cse 636 data integration
DESCRIPTION
CSE 636 Data Integration. XML Semistructured Data Document Type Definitions. Semistructured Data. Another data model, based on trees Motivation : flexible representation of data Often, data comes from multiple sources with differences in notation, meaning, etc. - PowerPoint PPT PresentationTRANSCRIPT
CSE 636Data Integration
XML
Semistructured Data
Document Type Definitions
2
Semistructured Data
• Another data model, based on trees• Motivation: flexible representation of data
– Often, data comes from multiple sources with differences in notation, meaning, etc.
• Motivation: sharing of documents among systems and databases
3
• Nodes = objects• Labels on arcs (attributes, relationships)• Atomic values at leaf nodes (nodes with no
arcs out)• Flexibility: no restriction on:
– Labels out of a node– Number of successors with a given label
Graphs of Semistructured Data
4
Bud
A.B.
Gold1995
MapleJoe’s
M’lob
beer beerbar
manfmanf
servedAt
name
namename
addr
prize
year award
root
The bar objectfor Joe’s Bar
The beer objectfor Bud
Example: Data Graph
5
XML
HTML• Uses tags for formatting the presentation (e.g.,
“italic”)• Hard for applications to process
XML = Extensible Markup Language• Uses tags for semantics
(e.g., “this is an address”)– Similar to labels in semistructured data
• Allows you to invent your own tags• Easy for applications to process
6
HTML XML
<html> <body> <h1> Bibliography </h1> <p> <i>Foundations of Databases</i> Abiteboul, Hull, Vianu <br/> Addison Wesley, 1995 </p> <p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br/> Morgan Kaufmann, 1999 </p> </body></html>
<?xml version = “1.0” standalone = “yes” ?><bibliography> <book> <title>Foundations of Databases</title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> …</bibliography>
7
Why XML is of Interest to Us
• XML is just syntax for data– Note: we have no syntax for relational data– But XML is not relational: semistructured
• This is exciting because:– Can translate any data to XML– Can ship XML over the Web (HTTP, SOAP)– Can input XML into any application– Thus: data sharing and exchange on the Web
8
XML Data Sharing and Exchange
Applications
RelationalDB
WebSite Web
Service
Transform Integrate
Warehouse
XML DataWeb
(HTTP, SOAP)
XMLDB
Applications
9
XML Tags & Elements
• Tags: book, title, author, …– XML tags are case sensitive
• Tags, as in HTML, are normally matched pairs– <book> … </book>– Start tag: <book>, End tag: </book>
• Elements: everything between tags– Example 1: <title>Foundations of Databases</title>– Example 2: <book>
<title>Foundations of Databases</title> </book>
• Elements may be nested arbitrarily• Empty element: <book></book>
– Abbreviation <book/>
10
XML Attributes
<book price = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year></book>
• Attributes are alternative ways to represent data
11
Replacing Attributes with Elements
<book> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> <price> 55 </price> <currency> USD </currency></book>
12
Elements vs. Attributes
• Too many attributes make documents hard to read
• Attributes do not specify document structure• Attributes are good for simple information
13
More XML: CDATA Section
• Syntax: <![CDATA[ .....any text here...]]>
• Example:
<example> <![CDATA[ some text here </notAtag> <>]]>
</example>
14
More XML: Entity References
< <
> >
& &
' ‘
" “
& Unicode char
• Syntax: &entityname;• Example:
<element> this is less than < </element>• Some entities:
15
More XML: Comments
• Syntax <!-- .... Comment text... -->
• Yes, they are part of the data model !!!
16
<data> <person age=“25” > <name> Mary </name> <address> <street> Maple </street> <no> 345 </no> <city> Seattle </city> </address> </person> <person> <name> John </name> <address>Thailand</address> <phone> 23456 </phone> </person></data>
XML Semantics: a Tree !
data
Mary
person
person
name address
name address
street no city
Maple 345 Seattle
JohnThai
phone
23456
age
25
Elementnode
Textnode
Attributenode
• Order matters!!!
17
Well-Formed XML
• Start the document with a declaration, surrounded by <?xml … ?>
• Normal declaration is:<?xml version = “1.0” standalone = “yes” ?>– “Standalone” = “no DTD provided”
• Has single root element surrounding nested elements
• Has matching tags
18
XML Data
• XML is self-describing• Schema elements become part of the data
– Relational schema: person(name, phone)– In XML <person>, <name>, <phone> are part of
the data, and are repeated many times
• Consequence: XML is much more flexible• XML = semistructured data
– Well-Formed XML with nested tags is exactly the same idea as trees of semistructured data
– XML also enables nontree structures, as does the semistructured data model
19
XML is Semistructured Data
• Missing attributes:
• Could represent ina table with nulls
<person> <name> John</name> <phone>1234</phone></person>
<person> <name>Joe</name></person>
<person> <name> John</name> <phone>1234</phone></person>
<person> <name>Joe</name></person>
no phone !
name phone
John 1234
Joe -
20
XML is Semistructured Data
• Repeated attributes
• Impossible in tables:
<person> <name>Mary</name> <phone>2345</phone> <phone>3456</phone></person>
<person> <name>Mary</name> <phone>2345</phone> <phone>3456</phone></person>
two phones !
name phone
Mary 2345 3456 ???
21
XML is Semistructured Data
• Attributes with different types in different objects
• Nested collections (no 1NF)• Heterogeneous collections:
– <db> contains both <book>s and <publisher>s
<person> <name> <first>John</first> <last>Smith</last> </name> <phone>1234</phone></person>
<person> <name> <first>John</first> <last>Smith</last> </name> <phone>1234</phone></person>
structured name !
22
Document Type Definition (DTD)
• Part of the original XML specification• An XML document may have a DTD• Valid XML: if it has a DTD and conforms to it• Validation is useful in data exchange
23
Very Simple DTD
<!DOCTYPE db [ <!ELEMENT db ((book|publisher)*)> <!ELEMENT book (title,author*,year?)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT publisher (#PCDATA)>]>
<!DOCTYPE db [ <!ELEMENT db ((book|publisher)*)> <!ELEMENT book (title,author*,year?)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT publisher (#PCDATA)>]>
24
• Content model:<!ELEMENT tag (CONTENT)>
– Complex = a regular expression over other elements
– Text-only = #PCDATA– Empty = EMPTY– Any = ANY– Mixed content= (#PCDATA | A | B | C)*
DTD: The Content Model
contentmodel
25
<person> <name>…</name> <phone>…</phone> <phone>…</phone> <phone>…</phone> …</person><person> <name>…</name></person>
<person> <name>…</name> <phone>…</phone> <phone>…</phone> <phone>…</phone> …</person><person> <name>…</name></person>
<person> <name>…</name> <phone>…</phone> <phone>…</phone> …</person><person> <name>…</name> <phone>…</phone></person>
<person> <name>…</name> <phone>…</phone> <phone>…</phone> …</person><person> <name>…</name> <phone>…</phone></person>
<person> <name>…</name> <phone>…</phone></person><person> <name>…</name> <email>…</email></person>
<person> <name>…</name> <phone>…</phone></person><person> <name>…</name> <email>…</email></person>
<name> <lastName>…</lastName></name><name> <firstName>…</firstName> <lastName>…</lastName></name>
<name> <lastName>…</lastName></name><name> <firstName>…</firstName> <lastName>…</lastName></name>
<name> <firstName>…</firstName> <lastName>…</lastName></name>
<name> <firstName>…</firstName> <lastName>…</lastName></name>
DTD: Regular Expressions
<!ELEMENT name (firstName, lastName))<!ELEMENT name (firstName, lastName))
<!ELEMENT name (firstName?, lastName))<!ELEMENT name (firstName?, lastName))
<!ELEMENT person (name, phone*))<!ELEMENT person (name, phone*))
sequence
optional
<!ELEMENT person (name, (phone|email)))<!ELEMENT person (name, (phone|email)))
zero or more
alternation
DTD XML
<!ELEMENT person (name, phone+))<!ELEMENT person (name, phone+))
one or more
26
DTD: Attributes
<!ELEMENTperson (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED
height CDATA #IMPLIED>
<!ELEMENTperson (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED
height CDATA #IMPLIED>
<person age=“25”height=“6”>
<name> ...</name> ...</person>
<person age=“25”height=“6”>
<name> ...</name> ...</person>
27
DTD: Attributes
<!ATTLIST tag (name type kind)+>Types:• CDATA = string• (Mon | Wed | Fri) = enumeration• ID = key• IDREF = foreign key• IDREFS = foreign keys separated by space• others = rarely usedKind:• #REQUIRED• #IMPLIED = optional• “value” = default value• “value” #FIXED = the only value allowed
28
XML: IDs and References
• Attributes can be pointers from one object to another– Compare to HTML’s
NAME = “foo” and HREF = “#foo”
• Allows the structure of an XML document to be a general graph, rather than just a tree
29
XML: Creating ID’s
• Give an element E an attribute A of type ID• When using tag <E> in an XML document,
give its attribute A a unique value• Example:
<E A = “xyz”>
30
XML: Creating References
• To allow objects of type F to refer to another object with an ID attribute, give F an attribute of type IDREF
• Or, let the attribute have type IDREFS, so the F –object can refer to any number of other objects
31
<person id=“o555”> <name>Jane</name></person>
<person id=“o456”> <name> Mary </name> <children idref=“o123 o555”/></person>
<person id=“o123” mother=“o456”>
<name>John</name></person>
<person id=“o555”> <name>Jane</name></person>
<person id=“o456”> <name> Mary </name> <children idref=“o123 o555”/></person>
<person id=“o123” mother=“o456”>
<name>John</name></person>
XML: IDs and References
• IDs and references in XML are just syntax
32
DTD: ID and IDREF(S) Attributes
<!ELEMENT person (ssn, name, office, phone?)><!ATTLIS person age CDATA #REQUIRED
id ID #REQUIREDmanager IDREF #REQUIREDmanages IDREFS #REQUIRED
>
<!ELEMENT person (ssn, name, office, phone?)><!ATTLIS person age CDATA #REQUIRED
id ID #REQUIREDmanager IDREF #REQUIREDmanages IDREFS #REQUIRED
>
<person age=“25” id=“p29432” manager=“p48293”
manages=“p34982 p423234”> <name> ....</name> ...</person>
<person age=“25” id=“p29432” manager=“p48293”
manages=“p34982 p423234”> <name> ....</name> ...</person>
33
Use of DTDs
1. Set standalone = “no”2. Either:
a) Include the DTD as a preamble of the XML document, or
b) Follow DOCTYPE and the <root tag> by SYSTEM and a path to the file where the DTD can be found, or
c) Mix the two... (e.g. to override the external definition)
34
<?xml version = “1.0” standalone = “no” ?><!DOCTYPE BARS [
<!ELEMENT BARS (BAR*)><!ELEMENT BAR (NAME, BEER+)><!ELEMENT NAME (#PCDATA)><!ELEMENT BEER (NAME, PRICE)><!ELEMENT PRICE (#PCDATA)>
]><BARS>
<BAR><NAME>Joe’s Bar</NAME><BEER><NAME>Bud</NAME> <PRICE>2.50</PRICE></BEER><BEER><NAME>Miller</NAME>
<PRICE>3.00</PRICE></BEER></BAR> <BAR> …
</BARS>
The DTD
The document
Example (a)
35
• Assume the BARS DTD is in file bar.dtd<?xml version = “1.0” standalone = “no” ?><!DOCTYPE BARS SYSTEM “bar.dtd”><BARS>
<BAR><NAME>Joe’s Bar</NAME><BEER><NAME>Bud</NAME>
<PRICE>2.50</PRICE></BEER><BEER><NAME>Miller</NAME>
<PRICE>3.00</PRICE></BEER></BAR><BAR> …
</BARS>
Example (b)
Get the DTDfrom the filebar.dtd
36
DTDs as Grammars
<!DOCTYPE db [ <!ELEMENT db ((book|publisher)*)> <!ELEMENT book (title,author*,year?)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT publisher (#PCDATA)>]>
<!DOCTYPE db [ <!ELEMENT db ((book|publisher)*)> <!ELEMENT book (title,author*,year?)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT publisher (#PCDATA)>]>
37
DTDs as Grammars
Same thing as:
• A DTD is a EBNF (Extended BNF) grammar• An XML tree is precisely a derivation tree• A valid XML document = a parse tree for that
grammar
db ::= (book|publisher)*book ::= (title,author*,year?)title ::= stringauthor ::= stringyear ::= stringpublisher ::= string
db ::= (book|publisher)*book ::= (title,author*,year?)title ::= stringauthor ::= stringyear ::= stringpublisher ::= string
38
DTDs as Grammars
<!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)>]>
<!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)>]>
<paper> <section> <text> </text> </section> <section> <title> </title> <section> … </section> <section> … </section> </section></paper>
• XML documents can be nested arbitrarily deep
39
DTDs as Schemas
Not so well suited:• impose unwanted constraints on order:
– <!ELEMENT person (name,phone)>
• references cannot be constrained– ID/IDREFS can reference any ID
• can be too vague:– <!ELEMENT person ((name|phone|email)*)>
40
DTDs as Schemas
No context-dependant typing
• Cannot distinguish between used car ads and new car ads– Different structure in different contexts
dealer
UsedCars
NewCars
ad
ad
model year year
41
XML APIs
• Document Object Model - DOM– Manipulation of XML Data– Provides a representation of an XML Document as a tree– Reads XML Document into memory– http://www.w3.org/DOM– Many implementations (Sun JAXP, Apache Xerces, …)
• Simple API for XML - SAX– Event-based framework for parsing XML data– http://www.saxproject.org/
42
References
• Lecture Slides– Jeffrey D. Ullman
– http://www-db.stanford.edu/~ullman/dscb/pslides/pslides.html
– Dan Suciu– http://www.cs.washington.edu/homes/suciu/COURSES/590DS/
02xmlsyntax.htm– http://www.cs.washington.edu/homes/suciu/COURSES/590DS/11dtd.htm
– Alon Levy– http://www.cs.washington.edu/education/courses/csep544/02sp/lectures/
lecture5cut.ppt
• BRICS XML Tutorial– A. Moeller, M. Schwartzbach
– http://www.brics.dk/~amoeller/XML/index.html
• W3C's XML homepage– http://www.w3.org/XML
• XML School: an XML tutorial – http://www.w3schools.com/xml