cse 636 data integration

42
CSE 636 Data Integration XML Semistructured Data Document Type Definitions

Upload: booth

Post on 19-Jan-2016

46 views

Category:

Documents


0 download

DESCRIPTION

CSE 636 Data Integration. XML Semistructured Data Document Type Definitions. Semistructured Data. Another data model, based on trees Motivation : flexible representation of data Often, data comes from multiple sources with differences in notation, meaning, etc. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CSE 636 Data Integration

CSE 636Data Integration

XML

Semistructured Data

Document Type Definitions

Page 2: CSE 636 Data Integration

2

Semistructured Data

• Another data model, based on trees• Motivation: flexible representation of data

– Often, data comes from multiple sources with differences in notation, meaning, etc.

• Motivation: sharing of documents among systems and databases

Page 3: CSE 636 Data Integration

3

• Nodes = objects• Labels on arcs (attributes, relationships)• Atomic values at leaf nodes (nodes with no

arcs out)• Flexibility: no restriction on:

– Labels out of a node– Number of successors with a given label

Graphs of Semistructured Data

Page 4: CSE 636 Data Integration

4

Bud

A.B.

Gold1995

MapleJoe’s

M’lob

beer beerbar

manfmanf

servedAt

name

namename

addr

prize

year award

root

The bar objectfor Joe’s Bar

The beer objectfor Bud

Example: Data Graph

Page 5: CSE 636 Data Integration

5

XML

HTML• Uses tags for formatting the presentation (e.g.,

“italic”)• Hard for applications to process

XML = Extensible Markup Language• Uses tags for semantics

(e.g., “this is an address”)– Similar to labels in semistructured data

• Allows you to invent your own tags• Easy for applications to process

Page 6: CSE 636 Data Integration

6

HTML XML

<html> <body> <h1> Bibliography </h1> <p> <i>Foundations of Databases</i> Abiteboul, Hull, Vianu <br/> Addison Wesley, 1995 </p> <p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br/> Morgan Kaufmann, 1999 </p> </body></html>

<?xml version = “1.0” standalone = “yes” ?><bibliography> <book> <title>Foundations of Databases</title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> …</bibliography>

Page 7: CSE 636 Data Integration

7

Why XML is of Interest to Us

• XML is just syntax for data– Note: we have no syntax for relational data– But XML is not relational: semistructured

• This is exciting because:– Can translate any data to XML– Can ship XML over the Web (HTTP, SOAP)– Can input XML into any application– Thus: data sharing and exchange on the Web

Page 8: CSE 636 Data Integration

8

XML Data Sharing and Exchange

Applications

RelationalDB

WebSite Web

Service

Transform Integrate

Warehouse

XML DataWeb

(HTTP, SOAP)

XMLDB

Applications

Page 9: CSE 636 Data Integration

9

XML Tags & Elements

• Tags: book, title, author, …– XML tags are case sensitive

• Tags, as in HTML, are normally matched pairs– <book> … </book>– Start tag: <book>, End tag: </book>

• Elements: everything between tags– Example 1: <title>Foundations of Databases</title>– Example 2: <book>

<title>Foundations of Databases</title> </book>

• Elements may be nested arbitrarily• Empty element: <book></book>

– Abbreviation <book/>

Page 10: CSE 636 Data Integration

10

XML Attributes

<book price = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year></book>

• Attributes are alternative ways to represent data

Page 11: CSE 636 Data Integration

11

Replacing Attributes with Elements

<book> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> <price> 55 </price> <currency> USD </currency></book>

Page 12: CSE 636 Data Integration

12

Elements vs. Attributes

• Too many attributes make documents hard to read

• Attributes do not specify document structure• Attributes are good for simple information

Page 13: CSE 636 Data Integration

13

More XML: CDATA Section

• Syntax: <![CDATA[ .....any text here...]]>

• Example:

<example> <![CDATA[ some text here </notAtag> <>]]>

</example>

Page 14: CSE 636 Data Integration

14

More XML: Entity References

&lt; <

&gt; >

&amp; &

&apos; ‘

&quot; “

&#38; Unicode char

• Syntax: &entityname;• Example:

<element> this is less than &lt; </element>• Some entities:

Page 15: CSE 636 Data Integration

15

More XML: Comments

• Syntax <!-- .... Comment text... -->

• Yes, they are part of the data model !!!

Page 16: CSE 636 Data Integration

16

<data> <person age=“25” > <name> Mary </name> <address> <street> Maple </street> <no> 345 </no> <city> Seattle </city> </address> </person> <person> <name> John </name> <address>Thailand</address> <phone> 23456 </phone> </person></data>

XML Semantics: a Tree !

data

Mary

person

person

name address

name address

street no city

Maple 345 Seattle

JohnThai

phone

23456

age

25

Elementnode

Textnode

Attributenode

• Order matters!!!

Page 17: CSE 636 Data Integration

17

Well-Formed XML

• Start the document with a declaration, surrounded by <?xml … ?>

• Normal declaration is:<?xml version = “1.0” standalone = “yes” ?>– “Standalone” = “no DTD provided”

• Has single root element surrounding nested elements

• Has matching tags

Page 18: CSE 636 Data Integration

18

XML Data

• XML is self-describing• Schema elements become part of the data

– Relational schema: person(name, phone)– In XML <person>, <name>, <phone> are part of

the data, and are repeated many times

• Consequence: XML is much more flexible• XML = semistructured data

– Well-Formed XML with nested tags is exactly the same idea as trees of semistructured data

– XML also enables nontree structures, as does the semistructured data model

Page 19: CSE 636 Data Integration

19

XML is Semistructured Data

• Missing attributes:

• Could represent ina table with nulls

<person> <name> John</name> <phone>1234</phone></person>

<person> <name>Joe</name></person>

<person> <name> John</name> <phone>1234</phone></person>

<person> <name>Joe</name></person>

no phone !

name phone

John 1234

Joe -

Page 20: CSE 636 Data Integration

20

XML is Semistructured Data

• Repeated attributes

• Impossible in tables:

<person> <name>Mary</name> <phone>2345</phone> <phone>3456</phone></person>

<person> <name>Mary</name> <phone>2345</phone> <phone>3456</phone></person>

two phones !

name phone

Mary 2345 3456 ???

Page 21: CSE 636 Data Integration

21

XML is Semistructured Data

• Attributes with different types in different objects

• Nested collections (no 1NF)• Heterogeneous collections:

– <db> contains both <book>s and <publisher>s

<person> <name> <first>John</first> <last>Smith</last> </name> <phone>1234</phone></person>

<person> <name> <first>John</first> <last>Smith</last> </name> <phone>1234</phone></person>

structured name !

Page 22: CSE 636 Data Integration

22

Document Type Definition (DTD)

• Part of the original XML specification• An XML document may have a DTD• Valid XML: if it has a DTD and conforms to it• Validation is useful in data exchange

Page 23: CSE 636 Data Integration

23

Very Simple DTD

<!DOCTYPE db [ <!ELEMENT db ((book|publisher)*)> <!ELEMENT book (title,author*,year?)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT publisher (#PCDATA)>]>

<!DOCTYPE db [ <!ELEMENT db ((book|publisher)*)> <!ELEMENT book (title,author*,year?)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT publisher (#PCDATA)>]>

Page 24: CSE 636 Data Integration

24

• Content model:<!ELEMENT tag (CONTENT)>

– Complex = a regular expression over other elements

– Text-only = #PCDATA– Empty = EMPTY– Any = ANY– Mixed content= (#PCDATA | A | B | C)*

DTD: The Content Model

contentmodel

Page 25: CSE 636 Data Integration

25

<person> <name>…</name> <phone>…</phone> <phone>…</phone> <phone>…</phone> …</person><person> <name>…</name></person>

<person> <name>…</name> <phone>…</phone> <phone>…</phone> <phone>…</phone> …</person><person> <name>…</name></person>

<person> <name>…</name> <phone>…</phone> <phone>…</phone> …</person><person> <name>…</name> <phone>…</phone></person>

<person> <name>…</name> <phone>…</phone> <phone>…</phone> …</person><person> <name>…</name> <phone>…</phone></person>

<person> <name>…</name> <phone>…</phone></person><person> <name>…</name> <email>…</email></person>

<person> <name>…</name> <phone>…</phone></person><person> <name>…</name> <email>…</email></person>

<name> <lastName>…</lastName></name><name> <firstName>…</firstName> <lastName>…</lastName></name>

<name> <lastName>…</lastName></name><name> <firstName>…</firstName> <lastName>…</lastName></name>

<name> <firstName>…</firstName> <lastName>…</lastName></name>

<name> <firstName>…</firstName> <lastName>…</lastName></name>

DTD: Regular Expressions

<!ELEMENT name (firstName, lastName))<!ELEMENT name (firstName, lastName))

<!ELEMENT name (firstName?, lastName))<!ELEMENT name (firstName?, lastName))

<!ELEMENT person (name, phone*))<!ELEMENT person (name, phone*))

sequence

optional

<!ELEMENT person (name, (phone|email)))<!ELEMENT person (name, (phone|email)))

zero or more

alternation

DTD XML

<!ELEMENT person (name, phone+))<!ELEMENT person (name, phone+))

one or more

Page 26: CSE 636 Data Integration

26

DTD: Attributes

<!ELEMENTperson (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED

height CDATA #IMPLIED>

<!ELEMENTperson (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED

height CDATA #IMPLIED>

<person age=“25”height=“6”>

<name> ...</name> ...</person>

<person age=“25”height=“6”>

<name> ...</name> ...</person>

Page 27: CSE 636 Data Integration

27

DTD: Attributes

<!ATTLIST tag (name type kind)+>Types:• CDATA = string• (Mon | Wed | Fri) = enumeration• ID = key• IDREF = foreign key• IDREFS = foreign keys separated by space• others = rarely usedKind:• #REQUIRED• #IMPLIED = optional• “value” = default value• “value” #FIXED = the only value allowed

Page 28: CSE 636 Data Integration

28

XML: IDs and References

• Attributes can be pointers from one object to another– Compare to HTML’s

NAME = “foo” and HREF = “#foo”

• Allows the structure of an XML document to be a general graph, rather than just a tree

Page 29: CSE 636 Data Integration

29

XML: Creating ID’s

• Give an element E an attribute A of type ID• When using tag <E> in an XML document,

give its attribute A a unique value• Example:

<E A = “xyz”>

Page 30: CSE 636 Data Integration

30

XML: Creating References

• To allow objects of type F to refer to another object with an ID attribute, give F an attribute of type IDREF

• Or, let the attribute have type IDREFS, so the F –object can refer to any number of other objects

Page 31: CSE 636 Data Integration

31

<person id=“o555”> <name>Jane</name></person>

<person id=“o456”> <name> Mary </name> <children idref=“o123 o555”/></person>

<person id=“o123” mother=“o456”>

<name>John</name></person>

<person id=“o555”> <name>Jane</name></person>

<person id=“o456”> <name> Mary </name> <children idref=“o123 o555”/></person>

<person id=“o123” mother=“o456”>

<name>John</name></person>

XML: IDs and References

• IDs and references in XML are just syntax

Page 32: CSE 636 Data Integration

32

DTD: ID and IDREF(S) Attributes

<!ELEMENT person (ssn, name, office, phone?)><!ATTLIS person age CDATA #REQUIRED

id ID #REQUIREDmanager IDREF #REQUIREDmanages IDREFS #REQUIRED

>

<!ELEMENT person (ssn, name, office, phone?)><!ATTLIS person age CDATA #REQUIRED

id ID #REQUIREDmanager IDREF #REQUIREDmanages IDREFS #REQUIRED

>

<person age=“25” id=“p29432” manager=“p48293”

manages=“p34982 p423234”> <name> ....</name> ...</person>

<person age=“25” id=“p29432” manager=“p48293”

manages=“p34982 p423234”> <name> ....</name> ...</person>

Page 33: CSE 636 Data Integration

33

Use of DTDs

1. Set standalone = “no”2. Either:

a) Include the DTD as a preamble of the XML document, or

b) Follow DOCTYPE and the <root tag> by SYSTEM and a path to the file where the DTD can be found, or

c) Mix the two... (e.g. to override the external definition)

Page 34: CSE 636 Data Integration

34

<?xml version = “1.0” standalone = “no” ?><!DOCTYPE BARS [

<!ELEMENT BARS (BAR*)><!ELEMENT BAR (NAME, BEER+)><!ELEMENT NAME (#PCDATA)><!ELEMENT BEER (NAME, PRICE)><!ELEMENT PRICE (#PCDATA)>

]><BARS>

<BAR><NAME>Joe’s Bar</NAME><BEER><NAME>Bud</NAME> <PRICE>2.50</PRICE></BEER><BEER><NAME>Miller</NAME>

<PRICE>3.00</PRICE></BEER></BAR> <BAR> …

</BARS>

The DTD

The document

Example (a)

Page 35: CSE 636 Data Integration

35

• Assume the BARS DTD is in file bar.dtd<?xml version = “1.0” standalone = “no” ?><!DOCTYPE BARS SYSTEM “bar.dtd”><BARS>

<BAR><NAME>Joe’s Bar</NAME><BEER><NAME>Bud</NAME>

<PRICE>2.50</PRICE></BEER><BEER><NAME>Miller</NAME>

<PRICE>3.00</PRICE></BEER></BAR><BAR> …

</BARS>

Example (b)

Get the DTDfrom the filebar.dtd

Page 36: CSE 636 Data Integration

36

DTDs as Grammars

<!DOCTYPE db [ <!ELEMENT db ((book|publisher)*)> <!ELEMENT book (title,author*,year?)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT publisher (#PCDATA)>]>

<!DOCTYPE db [ <!ELEMENT db ((book|publisher)*)> <!ELEMENT book (title,author*,year?)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT publisher (#PCDATA)>]>

Page 37: CSE 636 Data Integration

37

DTDs as Grammars

Same thing as:

• A DTD is a EBNF (Extended BNF) grammar• An XML tree is precisely a derivation tree• A valid XML document = a parse tree for that

grammar

db ::= (book|publisher)*book ::= (title,author*,year?)title ::= stringauthor ::= stringyear ::= stringpublisher ::= string

db ::= (book|publisher)*book ::= (title,author*,year?)title ::= stringauthor ::= stringyear ::= stringpublisher ::= string

Page 38: CSE 636 Data Integration

38

DTDs as Grammars

<!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)>]>

<!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)>]>

<paper> <section> <text> </text> </section> <section> <title> </title> <section> … </section> <section> … </section> </section></paper>

• XML documents can be nested arbitrarily deep

Page 39: CSE 636 Data Integration

39

DTDs as Schemas

Not so well suited:• impose unwanted constraints on order:

– <!ELEMENT person (name,phone)>

• references cannot be constrained– ID/IDREFS can reference any ID

• can be too vague:– <!ELEMENT person ((name|phone|email)*)>

Page 40: CSE 636 Data Integration

40

DTDs as Schemas

No context-dependant typing

• Cannot distinguish between used car ads and new car ads– Different structure in different contexts

dealer

UsedCars

NewCars

ad

ad

model year year

Page 41: CSE 636 Data Integration

41

XML APIs

• Document Object Model - DOM– Manipulation of XML Data– Provides a representation of an XML Document as a tree– Reads XML Document into memory– http://www.w3.org/DOM– Many implementations (Sun JAXP, Apache Xerces, …)

• Simple API for XML - SAX– Event-based framework for parsing XML data– http://www.saxproject.org/

Page 42: CSE 636 Data Integration

42

References

• Lecture Slides– Jeffrey D. Ullman

– http://www-db.stanford.edu/~ullman/dscb/pslides/pslides.html

– Dan Suciu– http://www.cs.washington.edu/homes/suciu/COURSES/590DS/

02xmlsyntax.htm– http://www.cs.washington.edu/homes/suciu/COURSES/590DS/11dtd.htm

– Alon Levy– http://www.cs.washington.edu/education/courses/csep544/02sp/lectures/

lecture5cut.ppt

• BRICS XML Tutorial– A. Moeller, M. Schwartzbach

– http://www.brics.dk/~amoeller/XML/index.html

• W3C's XML homepage– http://www.w3.org/XML

• XML School: an XML tutorial – http://www.w3schools.com/xml