cis550 handout 6 fall 2001 1 cis 550 fall 2001 handout 6 xml

108
CIS550 Handout 6 Fall 2001 1 CIS 550 Fall 2001 Handout 6 XML

Upload: calvin-woods

Post on 28-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

CIS550 Handout 6 Fall 2001 1

CIS 550Fall 2001

Handout 6 XML

CIS550 Handout 6 Fall 2001 2

Outline (ambitious)

• Background: documents (SGML/HTML) and databases (structured and semistructured data)

• XML Basics and Document Type Descriptors

• XML query languages: XML-QL and XSL.

• XML additions: Xlink, Xpointer, RDF, SOX, XML-Data

• Document Object Model (XML API's)

CIS550 Handout 6 Fall 2001 3

Some Useful Articles

XML, Java, and the future of the webhttp://webreview.com/wr/pub/97/12/19/xml/index.html

XML and the Second-Generation Webhttp://www.sciam.com/1999/0599issue/0599bosak.html

Articles/standards for XML, XSL, XML-QL http://www.w3c.org/

http://www.w3.org/TR/REC-xml

CIS550 Handout 6 Fall 2001 4

Part I: Background

What’s the difference between the world of documents and information retrieval and databases and query

interfaces?

CIS550 Handout 6 Fall 2001 5

Documents vs DatabasesDocument world

> plenty of small documents

> usually static

> implicit structuresection, paragraph, toc,

> tagging

> human friendly

> contentform/layout, annotation

> Paradigms“Save as”, wysiwyg

> meta-dataauthor name, date, subject

Database world> a few large databases> usually dynamic

> explicit structure (schema)

> records

> machine friendly

> contentschema, data, methods

> ParadigmsAtomicity, Concurrency, Isolation, Durability

> meta-dataschema description

CIS550 Handout 6 Fall 2001 6

What to do with themDocuments

• editing

• printing

• spell-checking• counting words

• retrieving (IR)

• searching

Database

• updating

• cleaning

• querying

• composing/transforming

CIS550 Handout 6 Fall 2001 7

HTML• Lingua franca for publishing hypertext on the

World Wide Web• Designed to describe how a Web browser should

arrange text, images and push-buttons on a page.• Easy to learn, but does not convey structure.• Fixed tag set.

<HTML><HEAD><TITLE>Welcome to the XML course</TITLE></HEAD><BODY>

<H1>Introduction</H1><IMG SRC=”dragon.jpeg" WIDTH="200" HEIGHT="150” >

</BODY></HTML>

Opening tag Text (PCDATA)

Closing tag “Bachelor” tagAttribute name Attribute

value

CIS550 Handout 6 Fall 2001 8

Thin red line• The line between the document world and the

database world is not clear.• In some cases, both approaches are

legitimate.• An interesting middle ground is data formats --

of which XML is an example

• Examples– Personal address book– Swissprot– ASN.1

CIS550 Handout 6 Fall 2001 9

Personal address book over 20 years1977

N Achison, MalcolmF Dr. M.P. AchisonA Dept. of Computer ScienceA University of EdinburghA Kings BuildingsA Edinburgh E12 8QQA ScotlandT 031-123-8855 ext. 4359 (work)T 031-345-7570 (home)

N Albani, PaoloF Prof. Paolo AlbaniA Dip. Informatica e SistemisticaA Universita di Roma La Sapienza ...

1980

N Achison, MalcolmF Dr. M.P. AchisonA Dept. of Computer Science ....T 031-667-7570 (home)C [email protected]

1990

N Achison, MalcolmF Prof. M.P. AchisonA Dept. of Computing ScienceA University of GlasgowA Lilybank GardensA Glasgow G12 8QQA ScotlandT 041-339-8855 ext. 4359T 041-357-3787 (private)T 031-667-7570 (home)X 041-339-0090C [email protected]

N Achison, MalcolmF Prof. M.P. AchisonA 34 Inverness PlaceA Edinburgh, EH3 8UV

1997

N Achison, MalcolmF Prof. M.P. AchisonA Department of Computing Science...T 031-667-7570 (home)X 041-339-0090C [email protected] http://www.dcs.gla.ac.uk/mpa

2000?

CIS550 Handout 6 Fall 2001 10

SwissprotID 11SB_CUCMA STANDARD; PRT; 480 AA.AC P13744;DT 01-JAN-1990 (REL. 13, CREATED)DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE)DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE)DE 11S GLOBULIN BETA SUBUNIT PRECURSOR.OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH).OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;OC VIOLALES; CUCURBITACEAE.RN [1]RP SEQUENCE FROM N.A.RC STRAIN=CV. KUROKAWA AMAKURI NANKIN;RX MEDLINE; 88166744.RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARANISHIMURA I.;RL EUR. J. BIOCHEM. 172:627-632(1988).RN [2]RP SEQUENCE OF 22-30 AND 297-302.RA OHMIYA M., HARA I., MASTUBARA H.;RL PLANT CELL PHYSIOL. 21:157-167(1980).

CIS550 Handout 6 Fall 2001 11

Swissprot (cont’d)CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN.CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND ACC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY ACC DISULFIDE BOND.CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS).DR EMBL; M36407; G167492; -.DR PIR; S00366; FWPU1B.DR PROSITE; PS00305; 11S_SEED_STORAGE; 1.KW SEED STORAGE PROTEIN; SIGNAL.FT SIGNAL 1 21FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT.FT CHAIN 22 296 GAMMA CHAIN (ACIDIC).FT CHAIN 297 480 DELTA CHAIN (BASIC).FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID.FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL).FT CONFLICT 27 27 S -> E (IN REF. 2).FT CONFLICT 30 30 E -> S (IN REF. 2).SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL

ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN

RGQSDLVLIV ...

CIS550 Handout 6 Fall 2001 12

The Structure of XML• XML consists of tags and text

• Tags come in pairs <date> ...</date>

• They must be properly nested <date> <day> ... </day> ... </date> --- good <date> <day> ... </date>... </day> --- bad

(You can’t do <i> ... <b> ... </i> ...</b> in HTML)

CIS550 Handout 6 Fall 2001 13

XML textXML has only one “basic” type -- text.

It is bounded by tags e.g. <title> The Big Sleep </title> <year> 1935 </ year> --- 1935 is still text

XML text is called PCDATA (for parsedcharacter data). It uses a 16-bit encoding,e.g. \&\#x0152 for the Hebrew letter Mem

Later we shall see how new types are specified by XML-data

CIS550 Handout 6 Fall 2001 14

XML structureNesting tags can be used to express various

structures. E.g. A tuple (record) :

<person> <name> Malcolm Atchison </name> <tel> (215) 898 4321 </tel> <email> [email protected] </email></person>

CIS550 Handout 6 Fall 2001 15

XML structure (cont.)• We can represent a list by using the same tag repeatedly:

<addresses> <person> ... </person> <person> ... </person> <person> ... </person> ...</addresses>

CIS550 Handout 6 Fall 2001 16

Terminology

The segment of an XML document between an opening and a corresponding closing tag is called an element.

<person> <name> Malcolm Atchison </name>

<tel> (215) 898 4321 </tel> <tel> (215) 898 4321 </tel>

<email> [email protected] </email> </person>

element

not an elementelement, a sub-elementof

CIS550 Handout 6 Fall 2001 17

XML is tree-like

person

name emailtel tel

Malcolm Atchison

(215) 898 4321

(215) 898 4321

[email protected]

Semistructured data models typically put the labels on the edges

CIS550 Handout 6 Fall 2001 18

Mixed Content

An element may contain a mixture of sub-elements and PCDATA

<airline> <name> British Airways </name> <motto> World’s <dubious> favorite</dubious> airline </motto>

</airline>

Data of this form is not typically generated from databases. It is needed for consistency with HTML

CIS550 Handout 6 Fall 2001 19

A Complete XML Document<?xml version="1.0"?><person> <name> Malcolm Atchison </name> <tel> (215) 898 4321 </tel> <email> [email protected] </email></person>

CIS550 Handout 6 Fall 2001 20

Two ways of representing a DB projects:

title budget managedBy

employees:

name ssn age

CIS550 Handout 6 Fall 2001 21

Project and Employee relations in XML

<db> <project> <title> Pattern recognition </title> <budget> 10000 </budget> <managedBy> Joe </managedBy> </project> <employee> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 < /age> </employee>

<employee> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> </employee> <project> <title> Auto guided vehicle </title> <budget> 70000 </budget> <managedBy> Sandra </managedBy> </project> :</db>

Projects and employees are intermixed

CIS550 Handout 6 Fall 2001 22

<db><projects>

<project> <title> Pattern recognition

</title> <budget> 10000 </budget> <managedBy> Joe

</managedBy></project>

<project> <title> Auto guided vehicles

</title> <budget> 70000 </budget> <managedBy> Sandra

</managedBy> </project> : </projects>

Project and Employee relations in XML (cont’d)

<employees><employee>

<name> Joe </name> <ssn> 344556 </ssn> <age> 34 </age> </employee> <employee> <name> Sandra

</name> <ssn> 2234 </ssn>

<age>35 </age> </employee> : <employees></db>

Employees follow projects

CIS550 Handout 6 Fall 2001 23

<db> <projects> <title> Pattern recognition

</title> <budget> 10000 </budget> <managedBy> Joe

</managedBy> <title> Auto guided vehicles

</title> <budget> 70000 </budget> <managedBy> Sandra

</managedBy> : </projects>

Project and Employee relations in XML (cont’d)

<employees> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 </age> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> : </employees></db>

Or without “separator” tags …

CIS550 Handout 6 Fall 2001 24

AttributesAn (opening) tag may contain attributes. These are typically used to describe the content of an element

<entry> <word language = “en”> cheese </word> <word language = “fr”> fromage </word> <word language = “ro”> branza </word> <meaning> A food made … </meaning>

</entry>

CIS550 Handout 6 Fall 2001 25

Attributes (cont’d)Another common use for attributes is to express dimension or type

<picture> <height dim= “cm”> 2400 </height> <width dim= “in”> 96 </width> <data encoding = “gif” compression = “zip”> M05-.+C$@02!G96YE<FEC ... </data></picture>

A document that obeys the “nested tags” rule and does not repeat an attribute within a tag is said to be well-formed .

CIS550 Handout 6 Fall 2001 26

When to use attributesIt’s not always clear when to use attributes

<person ssno= “123 45 6789”> <name> F. MacNiel </name> <email> [email protected] </email> ...</person>

<person> <ssno> 123 45 6789 </ssno> <name> F. MacNiel </name> <email> [email protected] </email> ...</person>

CIS550 Handout 6 Fall 2001 27

Using IDs<family> <person id="jane" mother="mary" father="john"> <name> Jane Doe </name> </person> <person id="john" children="jane jack"> <name> John Doe </name> <mother/> </person> <person id="mary" children="jane jack"> <name> Mary Doe </name> </person> <person id="jack" mother=”mary" father="john"> <name> Jack Doe </name> </person></family>

CIS550 Handout 6 Fall 2001 28

ODL schema

class Movie

( extent Movies, key title )

{

attribute string title;

attribute string director;

relationship set<Actor> casts

inverse Actor::acted_In;

attribute int budget;

} ;

class Actor

( extent Actors, key name )

{

attribute string name;

relationship set<Movie> acted_In

inverse Movie::casts;

attribute int age;

attribute set<string> directed;

} ;

CIS550 Handout 6 Fall 2001 29

An example<db> <movie id=“m1”> <title>Waking Ned

Divine</title> <director>Kirk Jones

III</director> <cast idrefs=“a1

a3”></cast> <budget>100,000</budget>

</movie> <movie id=“m2”> <title>Dragonheart</title> <director>Rob

Cohen</director> <cast idrefs=“a2 a9

a21”></cast> <budget>110,000</budget>

</movie> <movie id=“m3”> <title>Moondance</title> <director>Dagmar

Hirtz</director> <cast idrefs=“a1

a8”></cast> <budget>90,000</budget> </movie> :

<actor id=“a1”> <name>David Kelly</name> <acted_In idrefs=“m1 m3 m78” > </acted_In> </actor> <actor id=“a2”> <name>Sean Connery</name> <acted_In idrefs=“m2 m9 m11”> </acted_In> <age>68</age> </actor> <actor id=“a3”> <name>Ian Bannen</name> <acted_In idrefs=“m1 m35”> </acted_In> </actor> :</db>

CIS550 Handout 6 Fall 2001 30

Part II The Document Object Model (DOM)

Programming with XML

CIS550 Handout 6 Fall 2001 31

XML Parsers• traditional: build data structure (DOM)• event based: SAX (Simple API for XML)

– http://www.megginson.com/SAX– write handler for start tag and for end tag

CIS550 Handout 6 Fall 2001 32

Programming in XML -- why we need database technology

• Let’s examine some elements of the Document Object Model. It provides an interface to parsed data.

CIS550 Handout 6 Fall 2001 33

DOM -- overview

• Interface to parsed XML• “… language-neutral...” interface (IDL)• Level 1. Functionality for XML document

navigation and manipulation. – Level 2. Stylesheets and namespaces Level 3.

(future) Document loading and saving DTDs and schemas

http://www.w3.org/DOM/

CIS550 Handout 6 Fall 2001 34

DOM representation -- a tree of nodes

<partdb>

<part id = “a123” units= “mks”>

<name> widget </name>

<weight> 0.454 </weight>

</part>

</partdb>

Document

ElementgetTagName()= “name”

Element

getTagName()= “weight”

Element getTagName()= “part”

Attr

getName()= “part”getValue=“at23”

Attr

getName()= “units”getValue=“mks”

CharacterData

getData()= “widget”

CharacterData

getData()= “0.454”

CIS550 Handout 6 Fall 2001 35

Node interface

public interface Node {

...

public String getNodeName();

...

public NodeList getChildNodes();

...

public NamedNodeMap getAttributes();

...

}

CIS550 Handout 6 Fall 2001 36

public interface NodeList {

public Node item(int index);

public int getLength();

}

public interface NamedNodeMap {

public Node getNamedItem(String name);

...

}

Sub-elements --an “array”

Attributes-- a “dictionary”

CIS550 Handout 6 Fall 2001 37

A common form of data extraction<doc1> <employee>

<name> John Doe </name><contact-info>

<address> … </address><tel> 123 7456 </tel><email> [email protected]</email>

</contact-info> <dept> Math </dept>

</employee><employee>

…</employee>

...</doc1>

John Doe 123 7456 Jane Dee 234 5678 …...

Find the names and telephones of all employees in Math

CIS550 Handout 6 Fall 2001 38

Top-level traversal in DOMpublic class Test

{

public static void main(String args[]) throws Exception

{

Parser parser = new Parser( args[0] );

Document doc = parser.readStream( new FileInputStream( args[0] ));

NodeList nodes = doc.getDocumentElement.getChildNodes();

for (int i=0; i<nodes.getLength(); i++)

{

Node n = (Element) nodes.item(i); //coercion

// select Math depts }

}

}

Not specifiedby DOM

CIS550 Handout 6 Fall 2001 39

Selecting the math departments

NodeList ndl = n.getChildNodes();

for(int idl=0; idl<ndl.getLength(); idl++)

{

if ((n.tagName = "dept") &&

(((CharacterData) (n.getFirstChild)).getData="math")) // coercion

{

//inner code

return;

}

}

CIS550 Handout 6 Fall 2001 40

The inner codeNodeList nnl= n.getChildNodes();

for(int ii=0; ii < nnl.getLength(); ii++)

{

Node nn = (Element) nodes.item(i); //coercion

if (nn.tagName = "name" )

Nodelist ncl = n.getElementsByTagName("tel");

for(int iii = 0; iii < cl.getLength(); iii++)

{

nc = ncl.item(iii);

System.out.print((CharacterData) nn.firstChild).data //coercion

System.out.println((CharacterData) nc.firstChild).data //coercion

}

}

Preorder search of subtree.Convenient, but is it what we want?

CIS550 Handout 6 Fall 2001 41

Comments on our code• Already quite cumbersome. Compare with an “equivalent”

semistructured database query:

• Code may fail wherever there is a coercion, or give “empty” results.

• Code is already inefficient (double iterations over the same set of nodes)

• Need for types !!!

select {name: $N, tel: $T}where {name: $N, dept: “Math”, contact-info.tel: $T} <- DB

select {name: $N, tel: $T}where {name: $N, dept: “Math”, contact-info.tel: $T} <- DB

CIS550 Handout 6 Fall 2001 42

Constructing data<doc1> <employee>

<name> John Doe </name><contact-info>

<address> … </address><tel> 123 7456 </tel><email> [email protected]</email>

</contact-info> <dept> Math </dept>

</employee><employee>

…</employee>

...</doc1>

<doc2> <employee>

<name> John Doe </name><tel> 123 7456 </tel>

</employee><employee>

<name> Jane Dee </name> <tel> 234 5678 </tel>

</employee> ...</doc2>

CIS550 Handout 6 Fall 2001 43

Constructing Data using the DOMDocument d = new DocumentImplementation …Element root = d.createElement(“doc2”)// set root of document//top level loop

{Element emp = d.createElement(“employee”)

root.appendChild(emp)//innermost loop

{ ...Element name = d.createElement(“name”)

// set s to appropriate character stringname.appendChild(createCDATASection(s))emp.appendChild(name)

...}

…}

All node constructors are methods of the document implementation

Could also use an existing node or a “cloned” node

CIS550 Handout 6 Fall 2001 44

A join<doc1> <employee>

<name> John Doe </name><contact-info>...</contact-info>

<dept> Math </dept> </employee>

...</doc1>

<doc3> <row>

<name> John Doe </name><building> A123 </building> </row>

<row><name> Jane Dee </name> <building> B456 </building> </row>

...</doc3>

The names of employees and their department buildings

<doc2> <department>

<dname> Math </dname> <building> A123 </building>

</department> ...</doc2>

CIS550 Handout 6 Fall 2001 45

Implementing a Join in the DOM??

nl1 = r1.getChildNodes() //r1 is root of doc1for (int i1 = 1; i1 < n1.getLength; i1++)

{nl2 = r2.getChildNodes() //r1 is root of doc2for (int i2 = 1; i2 < n2.getLength; i2++)

…}

nl1 = r1.getChildNodes() //r1 is root of doc1for (int i1 = 1; i1 < n1.getLength; i1++)

{nl2 = r2.getChildNodes() //r1 is root of doc2for (int i2 = 1; i2 < n2.getLength; i2++)

…}

•Even if we can get both documents in core, this is not the most efficient method

•If not?•This is a typical database query!

CIS550 Handout 6 Fall 2001 46

Conclusions so far

• We need types for our XML documents, and we’d like to be able to use them as a static check of our program correctness.

• We need database technology for large “documents” and -- perhaps -- for simplicity of code.

CIS550 Handout 6 Fall 2001 47

Before we leave APIs… SAX, a low-level alternative to DOM

• SAX = Simple API for XML• supported by most XML parsers• event-driven • Instead of reading the entire file in memory and building a

tree, SAX reads a stream of tokens and triggers events, e.g.,– startDocument– startElement– endElement– endDocument

• The programmer has to write a document handler that captures these events and do something with the tokens

• Simpler than DOM: programmer must do more storage management

• Less likely than DOM to fail on large documents!!

CIS550 Handout 6 Fall 2001 48

Why Databases Can Help

• Efficiency. Querying and Storage

• Simplicity (a consequence of efficiency?) Query languages have a simple “algebraic” structure. Query transformation = optimization.

• (Maybe) suggestions for type systems/constraints.

select name, bldgfrom emps, deptswhere emps.dname = depts.dname

select name, bldgfrom emps, deptswhere emps.dname = depts.dname

Efficiently executed in SQL through use of indexing and clever join algorithms)

CIS550 Handout 6 Fall 2001 49

Other things that databases might do for XML

• Concurrency (allow a granularity finer than that of a file)– Proposals for updates?

• Integrity. Keys and referential integrity (XML proposals still weak)

• Data mining• Compression

CIS550 Handout 6 Fall 2001 50

The Down-Side

• Query languages are typically rather weak. You can’t always express a computation in a QL.

• We may have to tamper with the semantics of XML (sets vs. lists)

• We may have to simplify XML or its attendant “type systems” (DTDs, XML-schema) -- No bad thing :)

CIS550 Handout 6 Fall 2001 51

Part III: Document Type Descriptors

Imposing structure on XML documents

CIS550 Handout 6 Fall 2001 52

Document Type Descriptors

• Document Type Descriptors (DTDs) impose structure on an XML document.

• There is some relationship between a DTD and a schema, but it is not close -- hence the need for additional “typing” systems.

• The DTD is a syntactic specification.

CIS550 Handout 6 Fall 2001 53

Example: The Address Book<person>

<name> MacNiel, John </name>

<greet> Dr. John MacNiel </greet>

<addr>1234 Huron Street </addr>

<addr> Rome, OH 98765 </addr>

<tel> (321) 786 2543 </tel>

<fax> (321) 786 2543 </fax>

<tel> (321) 786 2543 </tel>

<email> [email protected] </email>

</person>

Exactly one name

At most one greeting

As many address lines as needed (in order)

Mixed telephones and faxes

As manyas needed

CIS550 Handout 6 Fall 2001 54

Specifying the structure

• name to specify a nameelement

• greet? to specify an optional (0 or 1) greet elements

• name,greet? to specify a name followed by an optional greet

CIS550 Handout 6 Fall 2001 55

Specifying the structure (cont)

• addr* to specify 0 or more address lines

• tel | fax a tel or a fax element

• (tel | fax)* 0 or more repeats of tel or fax

• email* 0 or more email elements

CIS550 Handout 6 Fall 2001 56

Specifying the structure (cont)

So the whole structure of a person entry is specified by

name, greet?, addr*, (tel | fax)*, email*

This is known as a regular expression. Why is it important?

CIS550 Handout 6 Fall 2001 57

Regular Expressions

Each regular expression determines a corresponding finite state automaton. Let’s start with a simpler example:

name, addr*, email

This suggests a simple parsing program

name

addr

email

CIS550 Handout 6 Fall 2001 58

Another example

name,address*,(tel | fax)*,email*

name

address

tel

tel

fax

fax

email

email

Adding in the optional greet furthercomplicates things

email

CIS550 Handout 6 Fall 2001 59

A DTD for the address book

<!DOCTYPE addressbook [ <!ELEMENT addressbook (project*)> <!ELEMENT project (name, greet?, address*, (fax | tel)*,

email*)> <!ELEMENT name (#PCDATA)> <!ELEMENT greet (#PCDATA)> <!ELEMENT address(#PCDATA)> <!ELEMENT tel (#PCDATA)> <!ELEMENT fax (#PCDATA)> <!ELEMENT email (#PCDATA)>]>

CIS550 Handout 6 Fall 2001 60

Our relational DB revisited projects:

title budget managedBy

employees:

name ssn age

CIS550 Handout 6 Fall 2001 61

Two DTDs for the relational DB

<!DOCTYPE db [

<!ELEMENT db (projects,employees)><!ELEMENT projects (project*)><!ELEMENT employees (employee*)>

<!ELEMENT project (title, budget, managedBy)><!ELEMENT employee (name, ssn, age)>...

]>

<!DOCTYPE db [<!ELEMENT db (project | employee)*><!ELEMENT project (title, budget,

managedBy)><!ELEMENT employee (name, ssn, age)>...

]>

CIS550 Handout 6 Fall 2001 62

Recursive DTDs

<DOCTYPE genealogy [<!ELEMENT genealogy (person*)><!ELEMENT person (

name,dateOfBirth,person, --

motherperson )> --

father ...

]>

What is the problem with this?

CIS550 Handout 6 Fall 2001 63

Recursive DTDs cont’d.

<DOCTYPE genealogy [<!ELEMENT genealogy (person*)><!ELEMENT person (

name,dateOfBirth,person?, --

motherperson? )> --

father ...

]>

What is now the problem with this?

CIS550 Handout 6 Fall 2001 64

Some things are hard to specify

Each employee element is to contain name, age and ssn elements in some order.

<!ELEMENT employee ( (name, age, ssn) | (age, ssn, name) |

(ssn, name, age) | ... )>

Suppose there were many more fields !

CIS550 Handout 6 Fall 2001 65

Summary of XML regular expressions

• A The tag A occurs• e1,e2 The expression e1 followed by e2• e* 0 or more occurrences of e• e? Optional -- 0 or 1 occurrences• e+ 1 or more occurrences• e1 | e2 either e1 or e2• (e) grouping

CIS550 Handout 6 Fall 2001 66

Take Home Message 1Do not write something to confuse yourself

<!ELEMENT PARTNER (NAME?, ONETIME?, PARTNRID?, PARTNRTYPE?, SYNCIND?, ACTIVE?, CURRENCY?, DESCRIPTN?, DUNSNUMBER?, GLENTITYS?, NAME*, PARENTID?, PARTNRIDX?, PARTNRRATG?, PARTNRROLE?, PAYMETHOD?, TAXEXEMPT?, TAXID?, TERMID?, USERAREA?, ADDRESS*, CONTACT*)>

Cited from oagis_segments.dtd (one of the files in Novell Developer Kit http://developer.novell.com/ndk/indexexe.htm)

<PARTNER> <NAME> Ben Franklin </NAME> </PARTNER>

Q. Which NAME is it?

CIS550 Handout 6 Fall 2001 67

Specifying attributes in the DTD

<!ELEMENT height (#PCDATA)><!ATTLIST height dimension CDATA #REQUIRED accuracy CDATA #IMPLIED >

The dimension attribute is required; the accuracy attribute is optional.

CDATA is the “type” of the attribute -- it means string.

CIS550 Handout 6 Fall 2001 68

Specifying ID and IDREF attributes

<!DOCTYPE family [ <!ELEMENT family (person)*> <!ELEMENT person (name)> <!ELEMENT name (#PCDATA)> <!ATTLIST person

id ID #REQUIRED mother IDREF #IMPLIED father IDREF #IMPLIED children IDREFS #IMPLIED>]>

CIS550 Handout 6 Fall 2001 69

Some conforming data

<family> <person id="jane" mother="mary" father="john"> <name> Jane Doe </name> </person> <person id="john" children="jane jack"> <name> John Doe </name> </person> <person id="mary" children="jane jack"> <name> Mary Doe </name> </person> <person id="jack" mother=”mary" father="john"> <name> Jack Doe </name> </person></family>

CIS550 Handout 6 Fall 2001 70

Consistency of ID and IDREF attribute values

•If an attribute is declared as ID– the associated values must all be distinct (no

confusion)

•If an attribute is declared as IDREF– the associated value must exist as the value

of some ID attribute (no dangling “pointers”)

•Similarly for all the values of an IDREFS attribute

•ID and IDREF attributes are not typed

CIS550 Handout 6 Fall 2001 71

An alternative specification

<!DOCTYPE family [ <!ELEMENT family (person)*> <!ELEMENT person (mother?, father?, children, name)> <!ATTLIST person id ID #REQUIRED> <!ELEMENT name (#PCDATA)> <!ELEMENT mother EMPTY> <!ATTLIST mother idref IDREF #REQUIRED> <!ELEMENT father EMPTY> <!ATTLIST father idref IDREF #REQUIRED> <!ELEMENT children EMPTY> <!ATTLIST children idrefs IDREFS #REQUIRED>]>

CIS550 Handout 6 Fall 2001 72

The revised data

<family> <person id = "jane”>

<name> Jane Doe </name> <mother idref = "mary”></mother> <father idref = "john"></father>

</person> <person id = "john”>

<name> John Doe </name><children idrefs = "jane jack"> </children>

</person> ...

</family>

CIS550 Handout 6 Fall 2001 73

A useful abbreviation

When an element has empty content we can use

<tag blahblahbla/> for <tag blahblahbla></tag>

For example:<family>

<person id = "jane”>

<name> Jane Doe </name>

<mother idref = "mary”/>

<father idref = "john”/></person>

...

</family>

CIS550 Handout 6 Fall 2001 74

ODL schema

class Movie

( extent Movies, key title )

{

attribute string title;

attribute string director;

relationship set<Actor> cast

inverse Actor::acted_In;

attribute int budget;

} ;

class Actor

( extent Actors, key name )

{

attribute string name;

relationship set<Movie> acted_In

inverse Movie::cast;

attribute int age;

attribute set<string> directed;

} ;

CIS550 Handout 6 Fall 2001 75

Schema.dtd

<!DOCTYPE db [ <!ELEMENT db (movie+, actor+)> <!ELEMENT movie

(title,director,cast,budget)> <!ATTLIST movie id ID

#REQUIRED> <!ELEMENT title (#PCDATA)> <!ELEMENT director (#PCDATA)> <!ELEMENT casts EMPTY> <!ATTLIST casts idrefs IDREFS

#REQUIRED> <!ELEMENT budget (#PCDATA)>

CIS550 Handout 6 Fall 2001 76

Schema.dtd (cont’d)

<!ELEMENT actor (name, acted_In,age?, directed*)>

<!ATTLIST actor id ID #REQUIRED> <!ELEMENT name (#PCDATA)> <!ELEMENT acted_In EMPTY> <!ATTLIST acted_In idrefs IDREFS

#REQUIRED> <!ELEMENT age (#PCDATA)> <!ELEMENT directed (#PCDATA)>]>

CIS550 Handout 6 Fall 2001 77

More on ODL and DTD

• The Object Data Management Group (ODMG) suggested OIFML, a XML document type of Object Interchange Format.

• http://www.odmg.org/library/readingroom/oifml.pdf

CIS550 Handout 6 Fall 2001 78

Constraints on IDs and IDREFs

• ID stands for identifier. No two ID attributes with the same name may have the same value (of type CDATA)

• IDREF stands for identifier reference. Every value associated with an IDREF attribute must exist as an ID attribute value

• IDREFS specifies several (0 or more) identifiers

CIS550 Handout 6 Fall 2001 79

Connecting the document with its DTD

In line:<?xml version="1.0"?>

<!DOCTYPE db [<!ELEMENT ...> … ]><db> ... </db>

Another file:

<!DOCTYPE db SYSTEM "schema.dtd">

A URL: <!DOCTYPE db SYSTEM

"http://www.schemaauthority.com/schema.dtd">

CIS550 Handout 6 Fall 2001 80

Well-formed and Valid Documents

• Well-formed applies to any document (with or without a DTD): proper nesting of tags and unique attributes

• Valid specifies that the document conforms to the DTD: conforms to regular expression grammar, types of attributes correct, and constraints on references satisfied

CIS550 Handout 6 Fall 2001 81

DTDs v.s Schemas (or Types)• By database (or programming language)

standards DTDs are rather weak specifications. – Only one base type -- PCDATA– No useful “abstractions” e.g., sets– IDREFs are untyped. You point to something, but

you don’t know what!– No constraints e.g., child is inverse of parent– No methods– Tag definitions are global

• Some of the XML extensions impose something like a schema or type on an XML document. We’ll see these later

CIS550 Handout 6 Fall 2001 82

Lots of possibilities for schemas• XML Schema (under W3C’s spotlight)• XDR (Microsoft’s BizTalk)• SOX (Schema for Object-Oriented XML)• Schematron• DSD (AT&T Labs and BRICS)• and more.

CIS550 Handout 6 Fall 2001 83

Some tools• XML Authority

http://www.extensibility.com/tibco/solutions/xml_authority/index.htm

• XML Spy http://www.xmlspy.com/download.html

CIS550 Handout 6 Fall 2001 84

Summary• XML is a new data format. Its main virtues are

widespread acceptance and the (important) ability to handle semistructured data (data without schema)

• DTDs provide some useful syntactic constraints on documents. As schemas they are weak

• How to store large XML documents?• How to query them?• How to map between XML and other

representations?

CIS550 Handout 6 Fall 2001 85

Part IV: Query LanguagesWhy a query language? Extracting, Restructuring, Integration, Browsing…

XML-QL http://www.w3.org/TR/NOTE-xml-ql

http://db.cis.upenn.edu/XML-QL/

XPATH (part of a query language) http:www.w3.org/TR/xpath XSLT

http://www.w3.org/TR/xslt

http://www.mulberrytech.com/quickref/XSLTquickref.pdf

CIS550 Handout 6 Fall 2001 86

XQuery -- likely to gain acceptance...

… and, like other things in W3C, not necessarily the best. XQuery is a group of projects. General blather and informal specifications:

http://www.w3.org/XML/QueryIngredients:•A concrete syntax: http://www.w3.org/TR/xquery

•Based on XPath: http://www.w3.org/TR/xpath.html

•A formal semantics and “algebra”: http://www.w3.org/TR/query-semantics/

•Some test cases: http://www.w3.org/TR/xmlquery-use-cases

CIS550 Handout 6 Fall 2001 87

Query Languages and DTDs• The DOM does not interact with DTDs (later

“levels” may do this)• For query languages there is almost no

interaction (only to find out the names of IDs and IDREFs)

• XDUCE (developed at Penn) is the only well-developed language that uses DTDs as a type system. www.cis.upenn.edu/~hahosoya/xduce/

CIS550 Handout 6 Fall 2001 88

XML-QL (XML Query Language)

• W3C proposal, August 1998

• authors: – Mary Fernandez AT&T– Dana Florescu INRIA– Alon Levy Univ. of Washington– Dan Suciu AT&T– Alin Deutsch Univ. of Pennsylvania

CIS550 Handout 6 Fall 2001 89

Address Book Revisited<addrBook>

<person SSN=“111-22-3333”>

<name> Caesar </name>

<greet> Caesar Imperator</greet>

<addr> The Capitol </addr>

<addr> Rome, OH 98765 </addr>

<tel> (321) 786 2543 </tel>

<fax> (321) 786 2543 <fax>

<tel> (321) 786 2543 </tel>

<email> [email protected]

</email>

</person>

</addrBook>

CIS550 Handout 6 Fall 2001 90

XML-QL: Pattern Matching

Find Caesar’s e-mail address:

where <addrBook> <person> <name>Caesar</name> <email>$e</email> </person> </addrBook> in “http://db.cis.upenn.edu/~peter/address.xml”construct $e

<XML>[email protected]</XML>

Data Extraction

CIS550 Handout 6 Fall 2001 91

XML-QL: Constructing New XML Data

Whom can we contact electronically?

where <addrBook> <person> <greet>$g</greet> <email>$e</email> </person> </addrBook> in “http://...”construct <e-contact> <who>$g</who> <where>$e</where> </e-contact>

<XML> <e-contact> <who>Caesar Imperator</who> <where>[email protected] </where> </e-contact> <e-contact> <who>Brutus</who> <where>[email protected] </where> </e-contact> ...</XML>

Data Restructuring

CIS550 Handout 6 Fall 2001 92

XML-QL: Joins

Who of our contacts was involved in a movie?

where <addrBook> <person> <greet>$g</greet> <email>$e</email> </person> </addrBook> in “http://…address.xml” <movie><title>$t</> <character>$g</> </movie> in “http://www.imdb.com”construct <cine-contact> <who>$g</who> <movie>$t</movie> <where>$e</where> </cine-contact>

CIS550 Handout 6 Fall 2001 93

XML-QL: Joins (cont’d)

<XML> <cine-contact> <who>Caesar Imperator</who> <where>[email protected]</where> <movie>Asterix and Cleopatra</movie> </cine-contact>

<cine-contact> <who>Dr. Strangelove</who> <where>[email protected]</where> <movie>Dr. Strangelove or How I Stopped ...</movie> </cine-contact>...</XML>

Data Integration

CIS550 Handout 6 Fall 2001 94

XML-QL Data Model

• Directed, labeled graph

• Tags represented as edge labels

• Sets of attribute name-value pairs as node labels

• Two models: ordered and unordered

CIS550 Handout 6 Fall 2001 95

XML-QL Data Model (cont’d)

<person SSN=“111-22-3333”>

<name> Caesar </name>

<greet> Caesar Imperator </greet>

<addr> The Capitol </addr>

<addr> Rome, OH 98765 </addr>

<tel> (321) 786 2543 </tel>

<fax> (321) 786 2543 <fax>

<tel> (321) 786 2543 </tel>

<email> [email protected]

</email>

</person>

person

name tel fax tel emailgreet

addr addr

SSN=“111-…”

Caesar

addrBook

Caesar Imperator

The Capitol Rome, OH

(321) 786 2543

CIS550 Handout 6 Fall 2001 96

XML-QL Semantics: Variable Bindings

person

name tel fax tel emailgreet

addr addr

SSN=“111-…”

Caesar

addrBook

Caesar Imperator

The Capitol Rome, OH

(321) 786 2543

name tel fax tel emailgreet

addr addr

SSN=“111-…”

Stragelove

Dr. Strangelove

The Capitol Washington, DC

person

strangelov@[email protected]

where <addrBook> <person> <name>$n</> <email>$e</> </> </>

$n $eCaesar [email protected] [email protected]

CIS550 Handout 6 Fall 2001 97

XML-QL Semantics: XML Output$n $eCaesar [email protected] [email protected]

construct <e-contact> <who>$n</who> <where>$e</where> </e-contact>

XML

e-contact e-contact

who where who where

Caesar [email protected] Strangelove [email protected]

CIS550 Handout 6 Fall 2001 98

Advanced XML-QL

Find tags of person subelements:

where <addrBook.person.$tag></> in “http://db.cis.upenn.edu/~peter/address.xml”construct <childOfPerson>$tag</>

Find all email addresses and fax numbers :

where <addrBook._*. (email | fax)>$eORf</> in “http://db.cis.upenn.edu/~peter/address.xml”construct <emailOrFax>$eORf</>

Schema browsing

CIS550 Handout 6 Fall 2001 99

More Advanced XML-QL

Find attributes of person elements:

where <_*.person $attrName=$attrVal></> in “http://db.cis.upenn.edu/~peter/address.xml”construct <personAttribute> <name>$attrName</> <value>$attrVal</> </>

Schema browsing

CIS550 Handout 6 Fall 2001 100

The XML Tree Model<person SSN=“111-22-3333”>

<name> Caesar </name>

<greet> Caesar Imperator</greet>

<addr>1234 Huron Street </addr>

<addr> Rome, OH 98765 </addr>

<tel> (321) 786 2543 </tel>

<fax> (321) 786 2543 <fax>

<tel> (321) 786 2543 </tel>

<email> [email protected] </email>

</person>

person

name tel fax tel emailgreet

addr addr

@SSN

CIS550 Handout 6 Fall 2001 101

XML-QL Semantics: XML Output$n $eCaesar [email protected] [email protected] [email protected]

construct <e-contact> <who>$n</who> <where>$e</where> </e-contact>

XML

e-contact e-contact

who where

Caesar [email protected]

who where

Strangelove [email protected]

who where

[email protected]

e-contact

e(Caesar,jc@forum) e(Strangelove,@love)

e(Strangelove,@hate)

w(Caesar,jc@forum)

CIS550 Handout 6 Fall 2001 102

XML-QL Skolem Functionsconstruct <e-contact ID=e($n) > <who ID=w($n) >$n</who> <where>$e</where> </e-contact>

XML

e-contact e-contact

who where

Caesar [email protected]

who where

Strangelove [email protected]

[email protected]

where

e(Caesar) e(Strangelove)

w(Caesar) w(Strangelove)

CIS550 Handout 6 Fall 2001 103

XSL (Extensible Stylesheet Language)

• W3C working draft by Adobe Systems

• Original purpose was to specify rendering of XML documents (mainly by Web browsers: HTML)

• Consists of two parts:

– an XML transformation language

– a formatting vocabulary denoting typographic abstractions (paragraph, page, rule, footer, etc.)

CIS550 Handout 6 Fall 2001 104

The XSL Processor

XML

XML

XML

original document

stylesheet

restructured document,elements are formatting objects

presentation (other formats possible)

HTML

formatter

transformer

plug in your favorite

CIS550 Handout 6 Fall 2001 105

Formatting Example

<doc> <title>An example</title> <p>This is a test.</p> <p>This is <emph>another</emph>test.</p></doc>

<fo:queue> <fo:block font-weight=“bold”>An example</fo:block> <fo:block>This is a test.</fo:block> <fo:block>This is <fo:inline-sequence font-style=“italic”>another </fo:inline-sequence> test. </fo:block></fo:queue>

transformed document:

original document:

CIS550 Handout 6 Fall 2001 106

Template Rules<xsl:template match=“title”> <fo:block font-weight=“bold”> <xsl:apply-templates/> </fo:block> </xsl:template>

<xsl:template match=“p”> <fo:block><xsl:apply-templates/> </fo:block></xsl:template>

<xsl:template match=“emph”> <fo:inline-sequence font-style=“italic”><xsl:apply-

templates/> </fo:inline-sequence></xsl:template>

CIS550 Handout 6 Fall 2001 107

Template Rules (cont’d)

<xsl:template match=“addrBook”> <xsl:apply-templates/> </xsl:template>

<xsl:template match=“person”> <contact> <name><xsl:value-of select=“name”/></name> <e-mail><xsl:value-of select=“email”></e-mail> </contact></xsl:template>

where <addrBook.person><name>$n</> <email>$e</></>construct <contact><name>$n</> <e-mail>$e</></>

CIS550 Handout 6 Fall 2001 108

XSL vs. XML-QL

XML output no schema required

data extractiondata restructuring

data integrationschema browsing

relational complete

XML-QL XSL