introduction to xmlavid.cs.umass.edu/courses/445/f2009/lectures/lec14-xml.pdf · introduction to...

Post on 25-Aug-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction to XML

Yanlei DiaoUMass Amherst

Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.

2

Data Integration & Sharing

Intranet………

NULLAnna Wang111

513-123-0102Frank James109

413-555-1223Joe Black107

ContactNameSid

…………

mike.levine@bio.umass.eduLevineMike171

alee@math.umass.eduLeeAnna34

joe@cs.umass.eduSmithJoe12

Contact (NOT NULL)LastNameFirstNameSid

Amherst College Student Database

UMass Student Database

3

WWW

Structured data - Databases

Unstructured Text - Documents

Semistructured Data

Integration of Text & Structured Data

4

Needs for A New Data Model

Loose and rich structure

Evolving, unknown, irregular structures

Integration of structured, but heterogeneous data sources

Textual data with tags and links

Combination of data models (relational, hierarchical)

Answer to all of above : Extensible Markup Language (XML) http://www.w3.org/TR/REC-xml/

5

From HTML to XML

6

HTML

<h1> Bibliography </h1><p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995<p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br> Morgan Kaufmann, 1999

HTML describes the presentation.

7

XML

<bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> …

</bibliography>

XML describes the content!

8

Outline

XML Syntax

XML typing

9

XML Syntax

Tags: book, title, author, …− start tag: <book>− end tag: </book>

Elements: <book>…</book>,<author>…</author>− An element can have text data in its content.− An element can have nested elements.− Empty element: <red></red>, abbrv. <red/>

XML document: single root element

An XML document is well formed if it has a single rootelement and matching tags.

10

XML Syntax

<book price = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year></book>

Attributes are alternative ways to represent data.

11

XML Syntax

<family><person id=“o555”> <name> Jane </name> </person><person id=“o456”> <name> Mary </name>

<children idrefs=“o123 o555”/></person><person id=“o123” mother=“o456”><name>John</name></person>

</family>

Oids and references in XML are just syntax.

12

XML Semantics: a Tree !

<data> <person id=“o555” > <name> Mary </name> <address> <street> Maple </street> <no> 345 </no> <city> Seattle </city> </address> </person> <person> <name> John </name> <address> Thailand </address> <phone> 23456 </phone> </person></data>

data

Mary

person

person

name address

name address

street no city

Maple 345 Seattle

JohnThai

phone

23456

id

o555

Elementnode

Textnode

Attributenode

Order matters ! IDREF will turn it to a graph.

13

XML Data

XML is self-describing due to the use of tags

Schema elements become part of the data– Relational schema: persons(name,phone)– In XML <persons>, <name>, <phone> are part of the

data, and are repeated many times

Consequence: XML is much more flexible

Some real data:http://www.cs.washington.edu/research/xmldatasets/

14

Relational Data as XML

<person><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone> 6363</phone></row>

</person>

row row row

name name namephone phone phone

“John” 3634 “Sue” “Dick”6343 6363

personXML: person

name phone

John 3634

Sue 6343

Dick 6363

15

XML is Semi-structured Data

Missing attributes:

Could represent ina table with nulls

<data> <person> <name> John</name> <phone>1234</phone> </person> <person> <name>Joe</name> </person></data>

← no phone !

name phone

John 1234

Joe -

16

XML is Semi-structured Data

Repeated attributes

Impossible in tables:nested collections

(non 1NF)

<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>

← two phones !

name phone

Mary 2345 3456 ???

17

XML is Semi-structured Data

Attributes with different types in different objects

Mixed content: <p><person>Joe</person> went to <school>Umass</school>.</p>

<data> <person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone> </person> <person> <name> M. Carey</name> <phone>3456</phone> </person></data>

← structured name !

← unstructured name !

18

Outline

XML Syntax

XML typing

19

Data Typing in XML

Data typing in the relational model: schema

Data typing in XML– Much more complex– Typing restricts valid trees that can occur

• theoretical foundation: tree languages

– Practical methods:• DTD (Document Type Definition)• XML Schema: superset of DTD (not covered in detail in

this class)

20

Document Type Definitions (DTD)

Part of the original XML specification.• DTD can be part of an XML document.

An XML document may have a DTD.• But it is not required.

XML document iswell-formed = if tags are correctly closedValid = if it further has a DTD and conforms to it

21

DTD Example

<!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)>]>

22

DTD Example

<company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ...</company>

Example of valid XML document:

23

DTD: The Content Model

Content model:– Complex = a regular expression over other elements– Text-only = #PCDATA– Empty = EMPTY– Mixed content = (#PCDATA | A | B | C)*

<!ELEMENT tag_name (CONTENT)>

contentmodel

24

DTD: Regular Expressions

<!ELEMENT name (firstName, lastName))

<name> <firstName> . . . . . </firstName> <lastName> . . . . . </lastName></name>

<!ELEMENT name (firstName?, lastName))

DTD XML

<!ELEMENT person (name, phone*))

sequence

optional

<!ELEMENT person (name, (phone|email)))

Kleene star

alternation

<person> <name> . . . . . </name> <phone> . . . . . </phone> <phone> . . . . . </phone> <phone> . . . . . </phone> . . . . . .</person>

25

Attributes in DTDs

<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED>

<person age=“25”> <name> ....</name> ...</person>

26

Attributes in DTDs

<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED id ID #REQUIRED manager IDREF #REQUIRED manages IDREFS #REQUIRED>

<person age=“25” id=“p29432” manager=“p48293” manages=“p34982 p423234”> <name> ....</name> ...</person>

27

Attributes in DTDs

Types: CDATA = string ID = key IDREF = foreign key IDREFS = foreign keys separated by space (Monday | Wednesday | Friday) = enumeration

28

Attributes in DTDs

Kind: #REQUIRED #IMPLIED = optional value = default value value #FIXED = the only value allowed

29

Using DTDs

Must include in the XML document

Either include the entire DTD:– <!DOCTYPE nameRootElement [ DTD ]>

Or include a reference to it:– <!DOCTYPE nameRootElement SYSTEM “http://www.mydtd.org/”>

30

XML Schema

DTDs capture grammatical structure, but have limitations:− Not themselves in XML, inconvenient to build tools− Don’t capture database data types− No way of defining OO-like inheritance…

XML Schema addresses shortcomings of DTDs− XML syntax− Subclassing− Domains and built-in data types− MIN and MAX # of occurrences of elements− http://www.w3.org/XML/Schema

top related