1 introduction to xml yanlei diao umass amherst april 19, 2007 slides courtesy of ramakrishnan &...

41
Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.

Post on 22-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

1

Introduction to XML

Yanlei DiaoUMass AmherstApril 19, 2007

Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.

Page 2: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

2

Structure in Data Representation

Relational data is highly structured structure is defined by the schema good for system design good for precise query semantics / answers

Structure can be limiting data exchange hard: integration of diff

schema authoring is constrained: schema-first querying constrained: must know schema changes to structure not easy

Page 3: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

3

Data Integration1. Find all departments whose total employee salaries exceed 1% of the budget of the company.

US

EuropeAsia

Australia

Internet

2. Find names of employees with the top sales record last month.

Page 4: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

4

WWW

Structured data - Databases

Unstructured Text - Documents

Semistructured Data

Integration of Text and Structured Data

Page 5: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

5

Need for A New Data Model

Loose (and rich) structure Integration of structured, but

heterogeneous data sources Evolving, unknown, or irregular structure Textual data with tags and links Combination of data models

5

Page 6: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

6

XML: Universal Data Exchange Format

XML is the confluence of many factors: Databases needed a more flexible interchange format. Data needed to be generated and consumed by

applications. The Web needed a more declarative format for data. Documents needed a mechanism for extended tags.

XML was originally proposed for online publishing, is becoming the wire format for data exchange.

W3C Recommendation: http://www.w3.org/TR/REC-xml/

Page 7: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

7

From HTML to XML

HTML describes the presentation.

Page 8: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

8

HTML

<h1> Bibliography </h1><p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995<p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br> Morgan Kaufmann, 1999

Page 9: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

9

XML

<bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley

</publisher> <year> 1995 </year> </book> …

</bibliography>

XML describes the content!

Page 10: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

10

XML: Syntax & Typing

Page 11: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

11

XML Syntax

Tags: book, title, author, … start tag: <book> end tag: </book>

Elements: <book>…</book>,<author>…</author> elements are nested empty element: <red></red>, abbrv. <red/>

An XML document: single root element

An XML document is well formed if it has matching tags

Page 12: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

12

XML Syntax

<book price = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year></book>

<book price = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year></book>

Attributes are alternative ways to represent data.

Page 13: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

13

XML Syntax

<person id=“o555”> <name> Jane </name> </person>

<person id=“o456”> <name> Mary </name> <children idref=“o123 o555”/></person><person id=“o123”

mother=“o456”><name>John</name></person>

<person id=“o555”> <name> Jane </name> </person>

<person id=“o456”> <name> Mary </name> <children idref=“o123 o555”/></person><person id=“o123”

mother=“o456”><name>John</name></person>

Oids and references in XML are just syntax.

Page 14: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

14

XML Semantics: a Tree !<data> <person id=“o555” > <name> Mary </name> <address> <street> Maple </street> <no> 345 </no> <city> Seattle </city> </address> </person> <person> <name> John </name> <address> Thailand

</address> <phone> 23456 </phone> </person></data>

<data> <person id=“o555” > <name> Mary </name> <address> <street> Maple </street> <no> 345 </no> <city> Seattle </city> </address> </person> <person> <name> John </name> <address> Thailand

</address> <phone> 23456 </phone> </person></data>

data

Mary

person

person

name address

name address

street no city

Maple 345 Seattle

JohnThai

phone

23456

id

o555

Elementnode

Textnode

Attributenode

Order matters ! IDREF will turn it to a graph.

Page 15: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

15

XML Data XML is self-describing Schema elements become part of the

data– Relational schema: persons(name,phone)– In XML <persons>, <name>, <phone>

are part of the data, and are repeated many times

Consequence: XML is much more flexible

Some real data:

http://www.cs.washington.edu/research/xmldatasets/

Page 16: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

16

Relational Data as XML

<person><row> <name>John</name> <phone>

3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone>

6363</phone></row></person>

<person><row> <name>John</name> <phone>

3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone>

6363</phone></row></person>

row row row

name name namephone phone phone

“John” 3634 “Sue” “Dick”6343 6363

personXML: person

name phone

John 3634

Sue 6343

Dick 6363

Page 17: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

17

XML is Semi-structured Data

Missing attributes:

Could represent ina table with nulls

<data> <person> <name> John</name> <phone>1234</phone> </person> <person> <name>Joe</name> </person></data>

<data> <person> <name> John</name> <phone>1234</phone> </person> <person> <name>Joe</name> </person></data>

← no phone !

name phone

John 1234

Joe -

Page 18: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

18

XML is Semi-structured Data

Repeated attributes

Impossible in tables:nested collections

(non 1NF)

<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>

<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>

← two phones !

name phone

Mary 2345 3456 ???

Page 19: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

19

XML is Semi-structured Data Attributes with different types in different

objects

Mixed content:– <db> contains both <book>s and <publisher>s

<data> <person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone> </person> <person> <name> M. Carey</name> <phone>3456</phone> </person></data>

<data> <person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone> </person> <person> <name> M. Carey</name> <phone>3456</phone> </person></data>

← structured name !

← unstructured name !

Page 20: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

20

Data Typing in XML Data typing in the relational model:

schema Data typing in XML

– Much more complex– Typing restricts valid trees that can occur

• theoretical foundation: tree languages

– Practical methods:• DTD (Document Type Definition)• XML Schema

Page 21: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

21

Document Type Definitions (DTD)

Part of the original XML specification To be replaced by XML Schema

– Much more complex An XML document may have a DTD XML document:

well-formed = if tags are correctly closedValid = if it has a DTD and conforms to it

Validation is useful in data exchange

Page 22: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

22

DTD Example

<!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)>]>

<!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)>]>

Page 23: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

23

DTD Example

<company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ...</company>

<company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ...</company>

Example of valid XML document:

Page 24: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

24

DTD: The Content Model

Content model:– Complex = a regular expression over other

elements– Text-only = #PCDATA– Empty = EMPTY– Any = ANY– Mixed content = (#PCDATA | A | B | C)*

<!ELEMENT tag (CONTENT)><!ELEMENT tag (CONTENT)>

contentmodel

Page 25: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

25

DTD: Regular Expressions

<!ELEMENT name (firstName, lastName))

<!ELEMENT name (firstName, lastName))

<name> <firstName> . . . . . </firstName> <lastName> . . . . . </lastName></name>

<name> <firstName> . . . . . </firstName> <lastName> . . . . . </lastName></name>

<!ELEMENT name (firstName?, lastName))<!ELEMENT name (firstName?, lastName))

DTD XML

<!ELEMENT person (name, phone*))<!ELEMENT person (name, phone*))

sequence

optional

<!ELEMENT person (name, (phone|email)))<!ELEMENT person (name, (phone|email)))

Kleene star

alternation

<person> <name> . . . . . </name> <phone> . . . . . </phone> <phone> . . . . . </phone> <phone> . . . . . </phone> . . . . . .</person>

<person> <name> . . . . . </name> <phone> . . . . . </phone> <phone> . . . . . </phone> <phone> . . . . . </phone> . . . . . .</person>

Page 26: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

26

Attributes in DTDs

<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED>

<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED>

<person age=“25”> <name> ....</name> ...</person>

<person age=“25”> <name> ....</name> ...</person>

Page 27: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

27

Attributes in DTDs

<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED id ID #REQUIRED manager IDREF #REQUIRED manages IDREFS #REQUIRED>

<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED id ID #REQUIRED manager IDREF #REQUIRED manages IDREFS #REQUIRED>

<person age=“25” id=“p29432” manager=“p48293” manages=“p34982 p423234”> <name> ....</name> ...</person>

<person age=“25” id=“p29432” manager=“p48293” manages=“p34982 p423234”> <name> ....</name> ...</person>

Page 28: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

28

Attributes in DTDs

Types: CDATA = string ID = key IDREF = foreign key IDREFS = foreign keys separated by

space (Monday | Wednesday | Friday) =

enumeration

Page 29: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

29

Attributes in DTDs

Kind: #REQUIRED #IMPLIED = optional value = default value value #FIXED = the only value allowed

Page 30: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

30

Using DTDs

Must include in the XML document Either include the entire DTD:

– <!DOCTYPE rootElement [ ....... ]> Or include a reference to it:

– <!DOCTYPE rootElement SYSTEM “http://www.mydtd.org”>

Or mix the two... (e.g. to override the external definition)

Page 31: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

31

XML Schema DTDs capture grammatical structure, but

have some drawbacks: Not themselves in XML, inconvenient to build tools Don’t capture database datatypes’ domains No way of defining OO-like inheritance…

XML Schema addresses shortcomings of DTDs XML syntax Subclassing Domains and built-in datatypes nin. and max # of occurrences of elements http://www.w3.org/XML/Schema

Page 32: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

32

Basics of XML Schema Need to use the XML Schema namespace

(generally named xsd) simpleTypes are a way of restricting domains on

scalars Can define a simpleType based on integer, with values

within a particular range complexTypes are a way of defining element

structures Basically equivalent to !ELEMENT, but more powerful Specify sequence, choice between child elements Specify minOccurs and maxOccurs (default 1)

Must associate an element/attribute with a simpleType, or an element with a complexType

Page 33: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

33

Simple Schema Example<xsd:schema

xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name=“mastersthesis" type=“ThesisType"/> <xsd:complexType name=“ThesisType">

<xsd:attribute name=“mdate" type="xsd:date"/><xsd:attribute name=“key" type="xsd:string"/><xsd:attribute name=“advisor" type="xsd:string"/><xsd:sequence>

<xsd:element name=“author" type=“xsd:string"/> <xsd:element name=“title" type=“xsd:string"/> <xsd:element name=“year" type=“xsd:integer"/> <xsd:element name=“school" type=“xsd:string”/> <xsd:element name=“committeemember"

type=“CommitteeType” minOccurs=“0"/> </xsd:sequence>

</xsd:complexType> </xsd:schema>

Page 34: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

34

Questions

Page 35: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

35

How the Web was Yesterday

HTML documents• often generated by applications• consumed by humans only• easy access: across platforms, across

organizations No application interoperability:

• HTML not understood by applications• Database technology: client-server

Page 36: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

36

Application InteroperabilityPurchase order

Amazon

Supplier1Supplier2

Supplier3

Internet

Page 37: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

39

New Universal Data Exchange Format: XML

A recommendation from the W3C XML = data XML generated by applications XML consumed by applications Easy access: across platforms,

organizations

Page 38: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

40

XML

A W3C standard to complement HTML Origins: Structured text SGML

• Large-scale electronic publishing• Data exchange on the web

Motivation:• HTML describes presentation• XML describes content

http://www.w3.org/TR/2000/REC-xml-20001006 (version 2, 10/2000)

Page 39: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

41

Paradigm Shift on the Web

From documents (HTML) to data (XML) From information retrieval to data

management For databases, also a paradigm shift:

• from relational model to XML model• from data processing to data/query

translation• from storage to transport

Page 40: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

42

Database Issues

How are we going to model XML? (graphs). Compared to relational model,

• XML is hierarchical• XML allows missing or additional attributes• XML allows multiple instances of an attribute (set-

valued)• XML allows different types in different objects• XML integrates structure and text data …

How are we going to query XML? (XQuery) How are we going to store XML (in a relational

database? object-oriented? native?) How are we going to process XML efficiently?

(many interesting research questions!)

Page 41: 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau

43

Designing an XML Schema/DTD

Not as formalized as relational data design We can still use ER diagrams to break into entity,

relationship sets Note that often we already have our data in

relations and need to design the XML schema to export them!

Generally orient the XML tree around the “central” objects

Big decision: element vs. attribute Element if it has its own properties, or if you

*might* have more than one of them Attribute if it is a single property – or perhaps not!