1 xml and databases. 2 outline (ambitious) background: documents (sgml/html) and databases...

56
1 XML and Databases

Post on 21-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

1

XML and Databases

2

Outline (ambitious)

• Background: documents (SGML/HTML) and databases (structured and semistructured data)

• XML Basics and Document Type Descriptors

• XML query languages: XML-QL and XSL.

• XML additions: Xlink, Xpointer, RDF, SOX, XML-Data

• Document Object Model (XML API's)

3

Some Useful Articles

XML, Java, and the future of the webhttp://webreview.com/wr/pub/97/12/19/xml/index.html

XML and the Second-Generation Webhttp://www.sciam.com/1999/0599issue/0599bosak.html

Articles/standards for XML, XSL, XML-QL http://www.w3c.org/

http://www.w3.org/TR/REC-xml

4

Background

What’s the difference between the world of documents and information retrieval and databases and query

interfaces?

5

Documents vs DatabasesDocument world

> plenty of small documents

> usually static

> implicit structuresection, paragraph, toc,

> tagging

> human friendly

> contentform/layout, annotation

> Paradigms“Save as”, wysiwyg

> meta-dataauthor name, date, subject

Database world> a few large databases> usually dynamic

> explicit structure (schema)

> records

> machine friendly

> contentschema, data, methods

> ParadigmsAtomicity, Concurrency, Isolation, Durability

> meta-dataschema description

6

What to do with themDocuments

• editing

• printing

• spell-checking• counting words

• retrieving (IR)

• searching

Database

• updating

• cleaning

• querying

• composing/transforming

7

HTML• Publishing hypertext on the World Wide Web• Designed to describe how a Web browser

should arrange text, images and push-buttons on a page.

• Easy to learn, but does not convey structure.• Fixed tag set.

<HTML><HEAD><TITLE>Welcome to the XML course</TITLE></HEAD><BODY>

<H1>Introduction</H1><IMG SRC=”dragon.jpeg" WIDTH="200" HEIGHT="150” >

</BODY></HTML>

Opening tag Text (PCDATA)

Closing tag “Bachelor” tagAttribute name Attribute

value

8

The Structure of XML• XML consists of tags and text

• Tags come in pairs <date> ...</date>

• They must be properly nested <date> <day> ... </day> ... </date> --- good <date> <day> ... </date>... </day> --- bad

(You can’t do <i> ... <b> ... </i> ...</b> in HTML)

9

XML textXML has only one “basic” type -- text.

It is bounded by tags e.g. <title> The Big Sleep </title> <year> 1935 </ year> --- 1935 is still text

XML text is called PCDATA (for parsedcharacter data). It uses a 16-bit encoding,e.g. \&\#x0152 for the Hebrew letter Mem

Later we shall see how new types are specified by XML-data

10

XML structureNesting tags can be used to express various

structures. E.g. A tuple (record) :

<person> <name> Malcolm Atchison </name> <tel> (215) 898 4321 </tel> <email> [email protected] </email></person>

11

XML structure (cont.)• We can represent a list by using the same tag repeatedly:

<addresses> <person> ... </person> <person> ... </person> <person> ... </person> ...</addresses>

12

Terminology

The segment of an XML document between an opening and a corresponding closing tag is called an element.

<person> <name> Malcolm Atchison </name>

<tel> (215) 898 4321 </tel> <tel> (215) 898 4321 </tel>

<email> [email protected] </email> </person>

element

not an elementelement, a sub-elementof

13

XML is tree-like

person

name emailtel tel

Malcolm Atchison

(215) 898 4321

(215) 898 4321

[email protected]

Semistructured data models typically put the labels on the edges

14

Mixed Content

An element may contain a mixture of sub-elements and PCDATA

<airline> <name> British Airways </name> <motto> World’s <dubious> favorite</dubious> airline </motto>

</airline>

Data of this form is not typically generated from databases. It is needed for consistency with HTML

15

A Complete XML Document<?xml version="1.0"?><person> <name> Malcolm Atchison </name> <tel> (215) 898 4321 </tel> <email> [email protected] </email></person>

16

Two ways of representing a DB projects:

title budget managedBy

employees:

name ssn age

17

Project and Employee relations in XML

<db> <project> <title> Pattern recognition </title> <budget> 10000 </budget> <managedBy> Joe </managedBy> </project> <employee> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 < /age> </employee>

<employee> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> </employee> <project> <title> Auto guided vehicle </title> <budget> 70000 </budget> <managedBy> Sandra </managedBy> </project> :</db>

Projects and employees are intermixed

18

<db><projects>

<project> <title> Pattern recognition

</title> <budget> 10000 </budget> <managedBy> Joe

</managedBy></project>

<project> <title> Auto guided vehicles

</title> <budget> 70000 </budget> <managedBy> Sandra

</managedBy> </project> : </projects>

Project and Employee relations in XML (cont’d)

<employees><employee>

<name> Joe </name> <ssn> 344556 </ssn> <age> 34 </age> </employee> <employee> <name> Sandra

</name> <ssn> 2234 </ssn>

<age>35 </age> </employee> : <employees></db>

Employees follow projects

19

<db> <projects> <title> Pattern recognition

</title> <budget> 10000 </budget> <managedBy> Joe

</managedBy> <title> Auto guided vehicles

</title> <budget> 70000 </budget> <managedBy> Sandra

</managedBy> : </projects>

Project and Employee relations in XML (cont’d)

<employees> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 </age> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> : </employees></db>

Or without “separator” tags …

20

AttributesAn (opening) tag may contain attributes. These are typically used to describe the content of an element

<entry> <word language = “en”> cheese </word> <word language = “fr”> fromage </word> <word language = “ro”> branza </word> <meaning> A food made … </meaning>

</entry>Order of attributes in an element does not matterXML elements are ordered

21

Attributes (cont’d)Another common use for attributes is to express dimension or type

<picture> <height dim= “cm”> 2400 </height> <width dim= “in”> 96 </width> <data encoding = “gif” compression = “zip”> M05-.+C$@02!G96YE<FEC ... </data></picture>

A document that obeys the “nested tags” rule and does not repeat an attribute within a tag is said to be well-formed .

22

When to use attributesIt’s not always clear when to use attributes

<person ssno= “123 45 6789”> <name> F. MacNiel </name> <email> [email protected] </email> ...</person>

<person> <ssno> 123 45 6789 </ssno> <name> F. MacNiel </name> <email> [email protected] </email> ...</person>

23

XML Misc.

Apart from elements and attributes, XML allows processing instructions and comments. A processing instruction is a statement of the form:

<?xml version="1.0"?><?XML ENCODING="UTF-8" VERSION="1.0"?>

A comment takes the following form: enclose comments between <!- - and - ->

<!– - A Comment -->

24

Document Type Descriptors

Imposing structure on XML documents

25

Document Type Descriptors

• Document Type Descriptors (DTDs) impose structure on an XML document.

• There is some relationship between a DTD and a schema, but it is not close -- hence the need for additional “typing” systems.

• The DTD is a syntactic specification.

26

Example: The Address Book<person>

<name> MacNiel, John </name>

<greet> Dr. John MacNiel </greet>

<addr>1234 Huron Street </addr>

<addr> Rome, OH 98765 </addr>

<tel> (321) 786 2543 </tel>

<fax> (321) 786 2543 </fax>

<tel> (321) 786 2543 </tel>

<email> [email protected] </email>

</person>

Exactly one name

At most one greeting

As many address lines as needed (in order)

Mixed telephones and faxes

As manyas needed

27

Specifying the structure

• name to specify a nameelement

• greet? to specify an optional (0 or 1) greet elements

• name,greet? to specify a name followed by an optional greet

28

Specifying the structure (cont)

• addr* to specify 0 or more address lines

• tel | fax a tel or a fax element

• (tel | fax)* 0 or more repeats of tel or fax

• email* 0 or more email elements

29

Specifying the structure (cont)

So the whole structure of a person entry is specified by

name, greet?, addr*, (tel | fax)*, email*

This is known as a regular expression. Why is it important?

30

Regular Expressions

Each regular expression determines a corresponding finite state automaton. Let’s start with a simpler example:

name, addr*, email

This suggests a simple parsing program

name

addr

email

31

Another example

name,address*,(tel | fax)*,email*

name

address

tel

tel

fax

fax

email

email

Adding in the optional greet furthercomplicates things

email

32

A DTD for the address book

<!DOCTYPE addressbook [ <!ELEMENT addressbook (person*)> <!ELEMENT person (name, greet?, address*, (fax | tel)*,

email*)> <!ELEMENT name (#PCDATA)> <!ELEMENT greet (#PCDATA)> <!ELEMENT address(#PCDATA)> <!ELEMENT tel (#PCDATA)> <!ELEMENT fax (#PCDATA)> <!ELEMENT email (#PCDATA)>]>

33

Our relational DB revisited projects:

title budget managedBy

employees:

name ssn age

34

Two DTDs for the relational DB

<!DOCTYPE db [

<!ELEMENT db (projects,employees)><!ELEMENT projects (project*)><!ELEMENT employees (employee*)>

<!ELEMENT project (title, budget, managedBy)><!ELEMENT employee (name, ssn, age)>...

]>

<!DOCTYPE db [<!ELEMENT db (project | employee)*><!ELEMENT project (title, budget,

managedBy)><!ELEMENT employee (name, ssn, age)>...

]>

35

Some things are hard to specify

Each employee element is to contain name, age and ssn elements in some order.

<!ELEMENT employee ( (name, age, ssn) | (age, ssn, name) |

(ssn, name, age) | ... )>

Suppose there were many more fields !

36

Summary of XML regular expressions

• A The tag A occurs• e1,e2 The expression e1 followed by e2• e* 0 or more occurrences of e• e? Optional -- 0 or 1 occurrences• e+ 1 or more occurrences• e1 | e2 either e1 or e2• (e) grouping

37

Specifying attributes in the DTD

<!ELEMENT height (#PCDATA)><!ATTLIST height dimension CDATA #REQUIRED accuracy CDATA #IMPLIED >

The dimension attribute is required; the accuracy attribute is optional.

CDATA is the “type” of the attribute -- it means string.

38

The DTD Language

• Default modifiers in DTD attributes:

Modifier Description

#REQUIRED The attributes value must be specified withthe element.

#IMPLIED The attribute value can remain unspecified.#FIXED The attribute value is fixed and cannot be

changed by the user.

39

The DTD Language

• Datatypes in DTD attributes:

Type Description

CDATA Character dataenumerated A series of values of which only 1 can be chosenENTITY An entity declared in the DTDENTITIES Multiple whitespace separated entities declared

in the DTDID A unique element identifierIDREF The value of a unique ID type attributeIDREFS Multiple whitespace separated IDREFs of

elementsNMTOKEN An XML name tokenNMTOKENS Multiple whitespace separated XML name tokensNOTATION A notation declared in the DTD

40

Consistency of ID and IDREF attribute values

•If an attribute is declared as ID– the associated values must all be distinct (no

confusion)– Id is a poor cousin of a key in relational databases.

•If an attribute is declared as IDREF– the associated value must exist as the value of some

ID attribute (no dangling “pointers”)– IDREF is a poor cousin of foreign key in relational

databases.

•Similarly for all the values of an IDREFS attribute– An attribute of type IDREFS represent a space-

separated list of strings of references to valid IDs.

•ID and IDREF attributes are not typed

41

Specifying ID and IDREF attributes

<!DOCTYPE family [ <!ELEMENT family (person)*> <!ELEMENT person (name)> <!ELEMENT name (#PCDATA)> <!ATTLIST person

id ID #REQUIRED mother IDREF #IMPLIED father IDREF #IMPLIED children IDREFS #IMPLIED>]>

42

Some conforming data

<family> <person id="jane" mother="mary" father="john"> <name> Jane Doe </name> </person> <person id="john" children="jane jack"> <name> John Doe </name> </person> <person id="mary" children="jane jack"> <name> Mary Doe </name> </person> <person id="jack" mother=”mary" father="john"> <name> Jack Doe </name> </person></family>

43

An alternative specification

<!DOCTYPE family [ <!ELEMENT family (person)*> <!ELEMENT person (name, mother?, father?, children?)> <!ATTLIST person id ID #REQUIRED> <!ELEMENT name (#PCDATA)> <!ELEMENT mother EMPTY> <!ATTLIST mother idref IDREF #REQUIRED> <!ELEMENT father EMPTY> <!ATTLIST father idref IDREF #REQUIRED> <!ELEMENT children EMPTY> <!ATTLIST children idrefs IDREFS #REQUIRED>]>

44

The revised data

<family> <person id = "jane”>

<name> Jane Doe </name> <mother idref = "mary”></mother> <father idref = "john"></father>

</person> <person id = "john”>

<name> John Doe </name><children idrefs = "jane jack"> </children>

</person> ...

</family>

45

The DTD Language

• Example: Sales Order Document

“An order document is comprised of several sales orders. Each individual order has a number and it contains the customer information, the date when the order was received, and the items ordered. Each customer has a number, a name, street, city, state, and ZIP code. Each item has an item number, parts information and a quantity. The parts information contains a number, a description of the product and its unit price.

The numbers should be treated as attributes.”

46

The DTD Language

• Example: Sales Order Document DTD

<!-- DTD for example sales order document --><!ELEMENT Orders (SalesOrder+)><!ELEMENT SalesOrder (Customer,OrderDate,Item+)><!ELEMENT Customer (CustName,Street,City,State,ZIP)>

<!ELEMENT OrderDate (#PCDATA)><!ELEMENT Item (Part,Quantity)><!ELEMENT Part (Description,Price)><!ELEMENT CustName (#PCDATA)><!ELEMENT Street (#PCDATA)><!ELEMENT ... (#PCDATA)><!ATTLIST SalesOrder SONumber CDATA #REQUIRED><!ATTLIST Customer CustNumber CDATA #REQUIRED><!ATTLIST Part PartNumber CDATA #REQUIRED><!ATTLIST Item ItemNumber CDATA #REQUIRED>

47

The DTD Language• Example: Sales Order XML Document

<Orders><SalesOrder SONumber=“12345”> <Customer CustNumber=“543”>

<CustName>ABC Industries</CustName> <Street>123 Main St.</Street> <City>Chicago</City> <State>IL</State> <ZIP>60609</ZIP> </Customer> <OrderDate>10222000</OrderDate> <Item ItemNumber=“1”> <Part PartNumber=“234”> <Description>Turkey wrench</Description> <Price>9.95</Price> </Part> <Quantity>10</Quantity> </Item> </SalesOrder></Orders>

48

A useful abbreviation

When an element has empty content we can use

<tag blahblahbla/> for <tag blahblahbla></tag>

For example:<family>

<person id = "jane”>

<name> Jane Doe </name>

<mother idref = "mary”/>

<father idref = "john”/></person>

...

</family>

49

Schema.dtd

<!DOCTYPE db [ <!ELEMENT db (movie+, actor+)> <!ELEMENT movie

(title,director,cast,budget)> <!ATTLIST movie id ID

#REQUIRED> <!ELEMENT title (#PCDATA)> <!ELEMENT director (#PCDATA)> <!ELEMENT casts EMPTY> <!ATTLIST casts idrefs IDREFS

#REQUIRED> <!ELEMENT budget (#PCDATA)>

50

Schema.dtd (cont’d)

<!ELEMENT actor (name, acted_In,age?, directed*)>

<!ATTLIST actor id ID #REQUIRED> <!ELEMENT name (#PCDATA)> <!ELEMENT acted_In EMPTY> <!ATTLIST acted_In idrefs IDREFS

#REQUIRED> <!ELEMENT age (#PCDATA)> <!ELEMENT directed (#PCDATA)>]>

51

Connecting the document with its DTD

In line:<?xml version="1.0"?>

<!DOCTYPE db [<!ELEMENT ...> … ]><db> ... </db>

Another file:

<!DOCTYPE db SYSTEM "schema.dtd">

A URL: <!DOCTYPE db SYSTEM

"http://www.schemaauthority.com/schema.dtd">

52

Well-formed and Valid Documents

• Well-formed applies to any document (with or without a DTD): proper nesting of tags and unique attributes

• Valid specifies that the document conforms to the DTD: conforms to regular expression grammar, types of attributes correct, and constraints on references satisfied

53

DTDs v.s Schemas (or Types)• By database (or programming language)

standards DTDs are rather weak specifications. – Only one base type -- PCDATA– No useful “abstractions” e.g., sets– IDREFs are untyped. You point to something, but

you don’t know what!– No constraints e.g., child is inverse of parent– No methods– Tag definitions are global

• Some of the XML extensions impose something like a schema or type on an XML document. We’ll see these later

54

Lots of possibilities for schemas• XML Schema (under W3C’s spotlight)• XDR (Microsoft’s BizTalk)• SOX (Schema for Object-Oriented XML)• Schematron• DSD (AT&T Labs and BRICS)• and more.

55

Some tools• XML Authority

http://www.extensibility.com/tibco/solutions/xml_authority/index.htm

• XML Spy http://www.xmlspy.com/download.html

56

Summary• XML is a new data format. Its main virtues are

widespread acceptance and the (important) ability to handle semistructured data (data without schema)

• DTDs provide some useful syntactic constraints on documents. As schemas they are weak

• How to store large XML documents?• How to query them?• How to map between XML and other

representations?