comp6037 semi-structured data and the websattler/teaching/slides-1.pdf · comp6037 semi-structured...

1

1

COMP6037Semi-structured Data and the Web Things around XML

U. SattlerUniversity of Manchester

2

Organisational

• COMP6037 consists of 2 parts:1. Semi-structured Data and XML (myself, this and next week)2. Web applications (Bijan Parsia, remainder)

• Prerequisites: good familiarity with databases and programming• Teaching period: Mondays of the next 5 weeks

– with demonstrators present to ask during labs• Coursework and Exercises: 10 days• Assessment: 50% exam, 50% coursework• http://www.cs.man.ac.uk/~sattler/teaching/comp6037.html

• Please do not hesitate to ask if you have a question!

3

Literature

To obtain more detailed information, please refer to

• W3C documents at http://www.w3.org/TR/...

• S. Abiteboul, P. Buneman, and D. Suciu: Data on the Web. MorganKaufmann Publishers, 2000.

• E. R. Harold and W. S. Means: XML in a Nutshell. O’Reilly, 2004.

• or choose some of the various available web resources.

4

Preliminary outline of the first part of the course

1. Introduction to semi-structured data2. XML: core concepts3. DTDs, a simple schema language for XML documents4. XPath, a navigation language for XML documents5. XML namespace: a concept ignored so far6. XSLT, a transformation language for XML documents7. DOM and SAX, a programmatic manipulation language for XML documents8. XML Schema, a more expressive schema language for XML documents9. Schematron, yet another schema language10.XQuery, a query language for XML documents11.Storing XML documents in RDBMSs

2

5

Data, documents, and the Web

The Web• extremely rich information source

www.worldwidewebsize.com, November 2006: ~12,000,000,000 pages January 2008: ~45,000,000,000 pages

• mostly web pages (HTML), accessible via a URL• HTML structures document/text:

– intra-document structure: lay-out and format– inter-document structure: links to other web pages

• content of web pages is often accessible to humans only: text• query mechanisms: keyword-based• used/accessed by everybody with internet access

6


Relational databases

• proven technology, currently storing/managing vast amounts of data

• separation between 3 levels:

– conceptual: ER diagrams

– logical: tables, and

– physical: implementation of tables, indices

• data is accessed via queries, mostly SQL queries

• used/accessed mainly by “insiders”

• integrity constraints play an important role

– e.g., (foreign) key constraints, not null, etc.

– preserving data integrity is important

• main issue: efficient implementation of query answering over large DBs

7


Why we need a bridge between DBs and the Web:

• a lot of information is made available on the web by querying RDBMSs

– http://www.schoolswebdirectory.co.uk/postcode.php

• output is in HTML, possibly wrapped

• difficult to access output data with a program

• access is limited to the queries that are hard-coded by the systeme.g., query for schools by postcode

8

Data, documents, and the Web: bio data from SWISSPROTProtein ACEK_BORBR on www.ebi.uniprot.org:

ID ACEK_BORBR Reviewed; 619 AA.AC Q7WDP2;DT 16-JAN-2004, integrated into UniProtKB/Swiss-Prot.DT 01-OCT-2003, sequence version 1.DT 31-OCT-2006, entry version 23.DE Isocitrate dehydrogenase kinase/phosphatase (EC 2.7.11.5) (EC 3.1.3.-)DE (IDH kinase/phosphatase) (IDHK/P).GN Name=aceK; OrderedLocusNames=BB4946;OS Bordetella bronchiseptica (Alcaligenes bronchisepticus).OC Bacteria; Proteobacteria; Betaproteobacteria; Burkholderiales;OC Alcaligenaceae; Bordetella.OX NCBI_TaxID=518;RN [1]RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].RC STRAIN=RB50 / ATCC BAA-588 / NCTC 13252;RX MEDLINE=22827954; PubMed=12910271; DOI=10.1038/ng1227;RA Parkhill J., Sebaihia M., Preston A., Murphy L.D., Thomson N.R.,RA Harris D.E., Holden M.T.G., Churcher C.M., Bentley S.D., Mungall K.L.,RA Cerdeno-Tarraga A.-M., Temple L., James K.D., Harris B., Quail M.A.,RA Achtman M., Atkin R., Baker S., Basham D., Bason N., Cherevach I.,RA Chillingworth T., Collins M., Cronin A., Davis P., Doggett J.,RA Feltwell T., Goble A., Hamlin N., Hauser H., Holroyd S., Jagels K.,RA Leather S., Moule S., Norberczak H., O'Neil S., Ormond D., Price C.,RA Rabbinowitsch E., Rutter S., Sanders M., Saunders D., Seeger K.,RA Sharp S., Simmonds M., Skelton J., Squares R., Squares S., Stevens K.,RA Unwin L., Whitehead S., Barrell B.G., Maskell D.J.;RT "Comparative analysis of the genome sequences of Bordetella pertussis,RT Bordetella parapertussis and Bordetella bronchiseptica.";RL Nat. Genet. 35:32-40(2003).

3

9

Data, documents, and the Web: bio data from SWISSPROTCC -!- FUNCTION: Bifunctional enzyme which can phosphorylate orCC dephosphorylate isocitrate dehydrogenase (IDH) on a specificCC serine residue. This is a regulatory mechanism which enablesCC bacteria to bypass the Krebs cycle via the glyoxylate shunt inCC response to the source of carbon. When bacteria are grown onCC glucose, IDH is fully active and unphosphorylated, but when grownCC on acetate or ethanol, the activity of IDH declines drasticallyCC concomitant with its phosphorylation (By similarity).CC -!- CATALYTIC ACTIVITY: ATP + [isocitrate dehydrogenase (NADP(+))] =CC ADP + [isocitrate dehydrogenase (NADP(+))] phosphate.CC -!- SUBCELLULAR LOCATION: Cytoplasm (By similarity).CC -!- SIMILARITY: Belongs to the aceK family.CC -----------------------------------------------------------------------CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/termsCC Distributed under the Creative Commons Attribution-NoDerivs LicenseCC -----------------------------------------------------------------------DR EMBL; BX640452; CAE35310.1; -; Genomic_DNA.DR GenomeReviews; BX470250_GR; BB4946.DR KEGG; bbr:BB4946; -.DR BioCyc; BBRO518:BB4946-MONOMER; -.DR GO; GO:0005524; F:ATP binding; IEA:HAMAP.DR GO; GO:0016788; F:hydrolase activity, acting on ester bonds; IEA:HAMAP.DR GO; GO:0006097; P:glyoxylate cycle; IEA:HAMAP.DR GO; GO:0006099; P:tricarboxylic acid cycle; IEA:HAMAP.DR HAMAP; MF_00747; -; 1.DR InterPro; IPR010452; AceK.DR Pfam; PF06315; AceK; 1.DR PIRSF; PIRSF000719; AceK; 1.KW ATP-binding; Complete proteome; Glyoxylate bypass; Hydrolase; Kinase;KW Multifunctional enzyme; Nucleotide-binding; Protein phosphatase;KW Transferase; Tricarboxylic acid cycle. 10

Data, documents, and the Web: bio data from SWISSPROTFT CHAIN 1 619 Isocitrate dehydrogenaseFT kinase/phosphatase.FT /FTId=PRO_0000057894.FT NP_BIND 354 360 ATP (By similarity).FT ACT_SITE 409 409 By similarity.FT BINDING 375 375 ATP (By similarity).SQ SEQUENCE 619 AA; 70681 MW; F434CC157EFD9CB4 CRC64; MIYSGDVQRI EPAPVAGPAP LDVAHLILAG FDRHYALFRY SAQRAKSLFE SGDWHGMQRL SRERIEYYDM RVRECATQLD SALRGSDART ADGSRANGSA ALSEAQTAFW QAVKQEFVGL LADHRQPECA ETFFNSVSCR ILHRDYFHND FLFVRPAIAT DYLDSRIPSY RVYYPVAEGL HKSLIRMVAD FGLAVPYADL PRDARLLARA AVRQLRGQLP RHAGPRLASD CQIQVLGSLF FRNTGAYIVG RLINQGTVYP FAVALRRNPA GQVCLDALLL GADDLSTLFS FTRAYFLVDM ETPAAVVNFL ASLLPRKPKA ELYTMLGLQK QGKTLFYRDF LHHLTHSRDA FDIAPGIRGM VMCVFTLPSY PYVFKLIKDR IDKDGMDHAT VRRKYQMVKL HDRVGRMADT WEYSQVALPR SRFAPRLLEE LRRLVPSLIE ENGDTVVIRH VYIERRMMPL NLYLRHASDP LLEVAVREYG DAIRQLATAN IFPGDMLYKN FGVTRLGRVV FYDYDEIQRM TEMNFRAIPP APNEEAELSS EPWYAVGPND VFPEEFGRFL LGDPRVRQAF LRHHADLLAP QWWQACRARV AQGRIEEFFP YDTDRRLHPQ AAPPPRTAA//

SWISSPROT provides a web query interface to their database...biologists need to integrate, share, query, analyse, and searchthis data...so what format is/should it be in?

11

XML as a bridge

XML

• is a format for the representation of semi-structured data(more on this later)

• is not designed to lay-out documents

• alone will not solve the problem of efficiently querying web data:we might have to use RDBMSs technology as well

12

Conventional databases and the Web

Conventional DB systems:

• client-server architecture

• queries issued to a server

• server processes queries:

– process/compile/optimise

– execute

client

server

client client

network

4

13

Conventional databases and the Web

Data processing on the web (ideally):

• multi-tier• all data sources translate data into common format: XML

• to share & integrate & combine data, common schema is used

• clients consume data• servers provide data• intermediate middleware to transform and integrate data

client

server

client client

server serverserver

middlewaremiddleware

middleware

14

Why XML as a bridge?

XML is

• designed to describe contents rather than presentation

• meant to be consumed by programs, not by humans

• suitable for exchanging data between applications/platforms:

– the way characters are encoded in an XML document is defined within thedocument itself through an encoding declaration, e.g., Unicode

– additional constraints on the content of an XML document can be specifiedseparately, e.g., in

• Document Type Definition, DTD or

• XML Schema (also an XML document) or

• RelaxNG or ...

– DTDs and schemas can be placed locally or across the network, and can befound using the universal notation of URLs

15

The Basics First: Semi-structured data

Semi-structured data• predates XML

• is an attempt to reconcile (Web) document view and (DB) strict structures

• is data organised in semantic entities, where

• similar entities are grouped together

• entities in same group may not have same attributes

• order of attributes not necessarily important

• not all attributes may be required

• carries its own description

Example: {name: “Uli”, tel: 56176, email:”[email protected]”}

simple set of attribute-value pairs differ16

The Basics First: Semi-structured data

Example (ctd):

Values can in turn be structured:

{name: {first:”Uli”, last: “Sattler”},

tel: 56176,

email:”[email protected]”}

And we can have several values for the same attribute:


tel: 56176,

tel: 56182,

email:”[email protected]”}

5

17

The Basics First: Semi-structured data (SSD)

Graphical representation as a tree:

name tel. tel. email

first last

“Uli” “Sattler”

56176 56182 “[email protected]


tel: 56176,

tel: 56182,

email:”[email protected]”}18

The Basics First: Semi-structured data (SSD)

• In general, a piece of SSD can be represented as a graph

– leaf nodes standing for single data items

– edges labelled with attribute names

19

Semi-structured data: tuples

We can easily represent nested tuples as sets of attribute-value pairs:

{person:

{name: “Uli”, tel: 56176, email:”[email protected]”}

person:

{name: “Bijan”, tel: 56183, email:”[email protected]”}

person:

{name: “Leo”, tel: 8488342, email:”[email protected]”}

}

20

Semi-structured data: tuples with variations

We can easily represent nested tuples as sets of attribute-value pairseven if they have missing or duplicated pairs

{person: {name: {first: “Uli”, last:”sattler}, tel: 56176, email:”[email protected]”} person: {name: “Bijan”, tel: 56183, tel: 783 4672,

email:”[email protected]”} person: {name: “Leo”, tel: 8488342, email:”[email protected]”}}

• serialization: converting SSD into a byte stream -- for easy transmission• self-describing: annotate each data-item (e.g., 56175) with its description

(e.g., tel.:): space consuming, but enhances inter-operability• we will see later how to efficiently store SSD

6

21

SSD: representing relational data

Consider two relations :

and their tree representation:

c2b2a2

c1b1a1

cbaR

d3c3

d4c4

d2c2

dcS

R S

row row row row row

a1 b1 c1 a2 b2 c2 c2 d2 c3 d3 c4 d4

a b c a b c c d c d c d

R S

a1 b1 c1 a2 b2 c2 c2 d2 c3 d3 c4 d4


S SR

row row row row rowR S

a1 b1 c1 a2 b2 c2 c2 d2 c3 d3 c4 d4


S SR

22

SSD: representing object databases

• some DBMSs are object-oriented• such data can be represented as SSDExample: { persons: {person: &o1 { name: “John”,

age: 47,relatives: {child: &o2,

child: &o3}} person: &o2 { name: “Mary”,

age: 21,relatives: {father: &o1,

sister: &o3}} person: &o3{ name: “Paula”,

age: 23,relatives: {father: &o1,

sister: &o2}}}}

23

SSD: representing object databases

• some DBMSs are object-oriented• such data can be represented as SSD• &o1, &o2,... are object identifiers• objects can refer to each other• relational structure of data (i.e., when drawn as a graph)

is no longer tree-like

24

2. XML - eXtensible Markup Language

7

25

What is XML?

• XML is a specialization of SGML, similar to HTML• XML is a W3C standard since 1998, see http://www.w3.org/XML/• was designed to be simple, generic, and extensible• a “piece of XML”

– is called an XML document and contains• structure• data

– can be associated with a tree• an XML document is divided into smaller pieces called elements (associated with

nodes in tree):– an XML document contains elements– elements can contain elements– with a non-ambiguous hierarchical structure amongst elements

• an XML document consists of– some administrative information followed by– an element containing all other elements

I mark technical terms,

when used for the first time, red

26

What is XML? (ctd)

General things about XML:

• elements are delimited by tags

• tags are enclosed in angle brackets, e.g., <panel>, </from>

• tags are case-sensitive, i.e., <FROM> is not the same as <from>

• we distinguish

– start tags: <...>, e.g., <panel>

– end tags: </...>, e.g., </from>

• a pair of matching start- and end tags delimits an element (like parentheses)

• empty elements of the form <foo></foo> can be written as <foo/>

• attributes specify properties of an elemente.g., <cartoon copyright=“United Feature Syndicate”>

27

Example

<cartoon copyright=“United Feature Syndicate” year=“2000”> <prolog> <series>Dilbert</series> <author>Scott Adams</author> <characters> <character>The Pointy-Haired Boss</character> <character>Dilbert</character> </characters> </prolog><panels>

<panel colour=“none”> <scene> Pointy-Haired Boss and Dilbert sitting at table. </scene> <bubbles> <bubble> <speaker>Dilbert</speaker> <speech>You haven’t given me enough resources to do

my project.</speech> </bubble></bubbles>

</panel> ...... 28

What is XML? (ctd)

The above mentioned administrative information of an XML document:1. XML declaration, e.g., <?xml version=“1.0” encoding=“iso-8859-1”?>

identifies the– XML version (1.0) and– character encoding (iso-8859-1)

2. document type declaration references a grammar describing documentcalled Document Type Definition– e.g. <!DOCTYPE cartoon SYSTEM “cartoon.dtd”>– a DTD constrains the structure, content & tags of a document– can either be local or remote

3. after these 2 declarations, we find the root element -- also calleddocument element

8

29

Example

<?xml version=“1.0” encoding=“iso-8859-1”?>

<!DOCTYPE cartoon SYSTEM “cartoon.dtd”>

<cartoon copyright=’United Feature Syndicate’ year=’2000’><prolog>

<series>Dilbert</series> <author>Scott Adams</author> <characters> <character>The Pointy-Haired Boss</character> <character>Dilbert</character> </characters> </prolog>

<panels>....</panels>

</cartoon>

AdministrativeInformation

Root element

30

What is XML? (ctd)

• in XML, the set of tags is not fixed -- in HTML, the tag set is fixed!

• elements can be nested, to arbitrary depth

• XML itself is not a markup language,but we can specify markup languages with XML

– an XML document can contain or refer to its specification: !DOCTYPE

31

When is an XML document well-formed?

An XML document is well-formed if

1. there is exactly one root element

2. tags, <, and > are correct (incl. no unescaped < or & in character data)

3. tags are properly nested

4. attributes are unique for each tag and attribute values are quoted

5. no comments inside tags

This is a very weak notion of well-formedness: basically,it only ensures that we can parse a document into a tree

The following are not well-formed:

1. <equation< a + b</equation> and <equation> a < b</equation>

2. <panel> <bubble>Hi there</panel>

3. <panel colour=“none” colour=“b&w”>32

Further restricting the structure of XML docs

In certain applications, we want XML documents to have a certain structure• e.g., for exchanging/managing cartoons, we want XML docs with

<cartoon copyright=STRING year=INTEGER> <prolog> OPTIONAL(<series>name-of-series</series>) ONE_OR_MORE(<author>author-name</author>) OPTIONAL(<characters> ONE_OR_MORE(<character>charactername</character> ) </characters>) </prolog>

<panels>ONE_OR_MORE(<panel colour= STRING >

<scene> scene-description</scene> OPTIONAL(<bubbles> ONE_OR_MORE(<bubble> <speaker>speaker-name</speaker> <speech>bubble-text</speech>

</bubble>)</bubbles>)

</panel>)

9

33


In applications, we want XML documents to have a certain structure• with certain elements• with certain nesting structure• with certain attributesThis requires a• document structure specification language or• grammar or• model or....For XML documents, numerous such formalisms have been developed. We

discuss• DTDs• XML Schema• Schematron• but there are others, e.g., RelaxNG

34


The structure can be exploited by query languages:we can answer queries related to content and structure

– which scenes does Dilbert occur in?– does Dilbert swear?– is there a scene with the pointy-haired boss alone?– find me all cartoons with characters talking about databases!

Contrast this with, e.g., Google keyword search

Next: briefly discuss DTDs and XML schema

35

Document Type Definitions: DTDs

• was built into XML 1.0 specification• DTDs can be inside or separate from XML documents• a DTD is a collection of declarations• an element declaration describes an element in terms of

– which elements it can contain– in which order it can contain these elements

• an element declaration does not constrain the type of data inside elements• an XML document is valid if

1. it has an associated document type declaration and2. if the document complies with the constraints expressed in it.

• E.g., <?xml version="1.0"?><greeting>Hello, world!</greeting>is well-formed, but not valid

later more

36

XML Schema

• has been developed later

• constrains the structure and data of XML documents

– e.g., inside <date-of-birth>...</date-of-birth> must be something of typedate

Later more...

10

37

A brief history of XML

• SGML (Standard Generalised Markup Language), 1985:– flexible, expressive– customised tags

• HTML (Hypertext Markup Language), early 1990ies:– application of SGML– designed for presentation of documents– single document type, presentation-oriented tags, e.g., <h1>...</h1>– led to the web as we know it

• XML, 1998 first edition of XML 1.0 (now 4th edition)– a W3C standard– subset/fragment of SGML– designed for the exchange/sharing of data

• XHTML is– an application of XML– almost a fragment of HTML 38

A rough map of a small part of the acronym world

XML

HTMLDTD

SGML

XHTML

is an application of

is an application of

is basicallya restriction of

is basicallya restriction of

XSLT

describes

transforms

XML Schema

describesSchematron

describes RelaxNG

describes

39

How to view or edit XML?

• XML is not for human consumption

– in contrast to HTML, your browser won’t help: you can only do a “viewsource” or

– first use XSLT (later more) to transform XML into HTML, then use yourweb browser to view it

• you can use your favourite editor, e.g., emacs in xml mode

• you can use an XML editor, e.g., XMLSpy, Stylus Studio, <oXygen>,MyEclipse, and many more

• we have <oXygen> installed in the lab machines, it supports many features,query languages, schemas, etc. and has been given to us for free

– if you want to use it at home/on your laptop, use a free 30 day trial

40

XML and HTML

• XML is always case sensitive, i.e., "Hello" is different from "hello"

– HTML isn’t: it uses SGML's default "ignore case"

• in XML, all tags must be present

– in HTML, some ”tag omission" may be permissible (e.g., <br>)

• in XML, we have a special way to write empty tags <myname/>

– which can’t be used in HTML

• in XML, all attribute values must be quoted, e.g., <name lang= ”eng”>...

– in SGML (and therefore in HTML) this is only required if value containsspace

• in XML, attribute names cannot be omitted

– in HTML they may be omitted using shorttags

11

41

XML Core Concepts: Prologue -- XML declaration

More at http://www.w3.org/TR/REC-xml/

Each parami is in the form

parameter-name=“parameter-value”

Parameters for

• the xml version used within document

• the character encoding

• whether document is standalone or uses external declarations(see validity constraint for when standalone=“yes” is required)

Example: <?xml version=“1.0” encoding=“US-ASCII” standalone=“yes” ?>

An XML document should have an XML declaration (but does not need to)

<?xml param1 param2 ...?>

42

XML Core Concepts: Prologue -- doctype declaration

• one such declaration, before root element• element-name is the name of the root element of the document• the optional dt-declarations is

– called internal subset– a list of document type definitions

• the optional f-name.dtd refers to the external subset also containingdocument type definitions

• e.g., <!DOCTYPE html PUBLIC “http://www.abc.org/dtds/html.dtd”“http://www.abc.org/dtds/html.dtd” >

<!DOCTYPE element-name PUBLIC “pub-id” “f-name.dtd” | SYSTEM “f-name.dtd” | [dt-declarations]>

43

XML Core Concepts: entity declaration

• can only occur in DTD

• associates the entity-name with

– a value or

– a piece of XML outside the document, referred to by entity-name

• e.g., we can state <!ENTITY donau “Donaudampfschifffahrtskapitaen”> in the DTDand then refer to it by

<text>Then the &donau; entered the room.</text>

• e.g., we can state <!ENTITY chap SYSTEM “chap02.XML”> in the DTDand then refer to it in

<book><title>Short Book</title> &chap;</book>

<!ENTITY entity-name identifier-or-value>

44

XML Core Concepts: entities

• entities are placeholders in XML (declared in DTD)• once declared, they can be referred to (several times) within a document• to reference an entity called entity-name, we use &entity-name; or %entity-name;• we can distinguish between

– [which content] parsed vs. unparsed entities; the latter is indicated by NDATA• e.g., an image is unparsed (because we don’t want it to be parsed/validated)

<!ENTITY my-pic SYSTEM "../grafix/my-photo.gif" NDATA gif >– [which name] entity references (&..;), character references (&#..;),

parameter entities (%...;)• e.g., &chap; references my defined entity “chap”, three pre-defined entity

references &, > and <• character references: e.g., ç (hexadecimal) and (decimal)• parameter entities are used within DTDs (more later)

– [how declared] internal vs. external entities: external <=> (SYSTEM or PUBLIC)• e.g., chap is external, my-pict with

<!ENTITY my-pic SYSTEM "../grafix/my-photo.gif" NDATA gif >is external and unparsed

• internal entities are parsed entities, e.g., <!ENTITY title “My title”> is internal

12

45

XML Core Concepts: elements (the main concept)

• arbitrary number of attributes is allowed

• but each attribute occurs at most once in one element

• each attr-decli is of the form

• the element-content can contain

– text and/or

– one or more other elements

• an empty element can be abbreviated as<element-name attr-decl1 ... attr-decln/>

<element-name attr-decl1 ... attr-decln>

element-content

</element-name>

attr-name=“attr-value”

element contentmixed content

46

XML Core Concepts: comments

• can go in every part of an XML document except

– inside tags

– before the XML declaration

• the ending must be -->– ending in ---> is not allowed

• they can be used to “turn off” a part of an XML document



47

XML Core Concepts: CDATA sections

• CDATA stands for character data, i.e., not markup

• the only thing that cannot go into a CDATA section is the ending ]]>

• useful when we want to use special character such as &, <, etc.

• e.g., <text>You can type <![CDATA[ if (&x < &y )]]> and this is alright.</text>

• left angle brackets and ampersands may occur in their literal form; theyneed not and cannot be escaped using "<" and "&”

• e.g., <text>You can type <![CDATA[ if you write < ]]> and this is alright.</text>

but then "<" will not be escaped

<! [CDATA [ whatever-you-want ]]>

48

XML: the information set

• the information set, or “infoset”, is a W3C recommendation• the infoset of an XML document is an abstract data structure that explicates

– the information contained in the XML document and– how to refer to the bits and pieces of an XML document

• it associates every– well-formed (!) XML document that conforms to the namespaces

recommendation– with a tree-like data structure over 7 (or 11) kind of information items

• document information item• element information item• attribute information item• etc.

• more on http://www.w3.org/TR/xml-infoset/

13

49

DocumentnodeType = DOCUMENT_NODEnodeName = #documentnodeValue = (null)

ElementnodeType = ELEMENT_NODEnodeName = mytextnodeValue = (null)firstchild lastchild attributes

ElementnodeType = ELEMENT_NODEnodeName = titlenodeValue = (null)firstchild

ElementnodeType = ELEMENT_NODEnodeName = contentnodeValue = (null)firstchild

TextnodeType = TEXT_NODEnodeName = #textnodeValue = Hallo!

TextnodeType = TEXT_NODEnodeName = #textnodeValue = Bye!

AttributenodeType = ATTRIBUTE_NODEnodeName = contentnodeValue = medium

XML: the information set - an example

<?xml version="1.0"?>

<mytext content=“medium”>

<title>Hallo!</title>

<content>Bye!</content>

</mytext> “is child of”

50

3. DTDs: Document Type Definitions

51

Why do we need a schema formalism for XML?

Benefits of a schema:

• it communicates the structure of an XML document, e.g. to a computer

• it can function as a publishable, shareable specification– e.g., we can agree on a schema to share protein data in

• it helps catching high-level mistakes: being well-formed is not enough

• it allows one to implement/use tools that check validity: in a

– portable

– efficient way

(the alternative would be to write dedicated programs/parsers...)

• schemata are extensible and re-usable

Drawbacks of a schema:

• it requires additional work

52

Example: bio data from SWISSPROT data in XML

<?xml version="1.0" encoding="UTF-8"?><uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd"><entry dataset="Swiss-Prot" created="2004-01-16" modified="2006-10-31" version="23"> <accession>Q7WDP2</accession> <name>ACEK_BORBR</name> <protein> <name ref="1 2">Isocitrate dehydrogenase kinase/phosphatase</name> <name>IDH kinase/phosphatase</name> <name>IDHK/P</name> </protein> <gene> <name type="primary">aceK</name> <name type="ordered locus">BB4946</name> </gene> <organism key="3"> <name type="scientific">Bordetella bronchiseptica</name> <name type="synonym">Alcaligenes bronchisepticus</name> <dbReference type="NCBI Taxonomy" id="518" key="4"/> <lineage> <taxon>Bacteria</taxon> <taxon>Proteobacteria</taxon> <taxon>Betaproteobacteria</taxon> <taxon>Burkholderiales</taxon> <taxon>Alcaligenaceae</taxon> <taxon>Bordetella</taxon> </lineage> </organism>

14

53

Example: bio data from SWISSPROT data in XML (ctd)

<reference key="5"> <citation type="journal article" date="2003" name="Nat. Genet." volume="35" first="32" last="40"> <title>Comparative analysis of the genome sequences of Bordetella pertussis, Bordetella parapertussis and Bordetella

bronchiseptica.</title> <authorList> <person name="Parkhill J."/> ... <person name="Maskell D.J."/> </authorList> <dbReference type="PubMed" id="12910271" key="6"/> <dbReference type="MEDLINE" id="22827954" key="7"/> <dbReference type="DOI" id="10.1038/ng1227" key="8"/> </citation> <scope>NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].</scope> <source> <strain>RB50 / ATCC BAA-588 / NCTC 13252</strain> </source> </reference> <comment type="function" status="By similarity"> <text>Bifunctional enzyme which can phosphorylate or dephosphorylate isocitrate dehydrogenase (IDH) on a specific serine residue. This is

a regulatory mechanism which enables bacteria to bypass the Krebs cycle via the glyoxylate shunt in response to the source of carbon.When bacteria are grown on glucose, IDH is fully active and unphosphorylated, but when grown on acetate or ethanol, the activity of IDHdeclines drastically concomitant with its phosphorylation</text>

</comment> <comment type="catalytic activity"> <text>ATP + [isocitrate dehydrogenase (NADP(+))] = ADP + [isocitrate dehydrogenase (NADP(+))] phosphate</text> </comment> <comment type="subcellular location" status="By similarity"> <text>Cytoplasm</text> </comment> ....

54

Why do we need a schema for XML?

So, using a schema for SWISSPROT we can

• publish it so that everybody knows about the structure of our data

• validate data from other sources before it is added to SWISSPROT

• build tools around this schema to

– analyse,

– search, and

– transform data

Also, other XML tools (for querying/transforming, etc) can make use of thisschema, e.g., for query optimisation

55

Schemata for XML

Interestingly, there are/were various different schema languages for XML:

• DTDs

• XML Schema

• Schematron

• SOX (Schema for Object-Oriented XML)

• RELAX NG (Regular Language Description for XML New Generation)

• etc.

56

DTDs• a DTD is a collection of declarations that specify

– which elements are allowed,– how elements can be nested,

i.e., what child elements an element can have, incl. their• type• order• number

– what attributes an element can have, incl. their• attribute name• type• number: whether it is compulsory or optional (obviously at most 1)

• DTDs are inspired by SGML– good for describing documents– not so good for describing data– simple and easy to understand and validate

• DTDs are standardized in the XML recommendations

remember:

an elementcomes with tags

15

57

Data Types in DTDs

• for text content: PCDATA– stands for parsed character data– stands for raw text, possibly with entity references like &chap; or <

but without tags or child elements• for attribute values:

– CDATA: a simple string– NMTOKEN: a string without blanks– NMTOKENS: list of such strings separated by blanks– ID: like NMTOKEN, but unique value in document– IDREF: like NMTOKEN, but each value must also occur as ID in document– IDREFS: list of IDs

• for element types: regular expressions denoting structure

58

DTD: element type declarations

The content-model can be of one of the following forms:

• EMPTY: empty element

• ANY: anything

• (#PCDATA): text only

• (expression), where expression is built over (,), other element names and #PCDATAusing the following operators:

– , : concatenation or sequence

– | : exclusive-or or choice

– * : Kleene Star or repetition (repeat as often as you like, but finitely often)

– + : repetition (as *, but repeat at least once)

– ? : optional

• each element can be declared at most once!

<!ELEMENT element-name content-model>

59


If we declare an element with• element content, all its children are elements, hence its content-model may

not contain #PCDATA– e.g., <!ELEMENT spec (front,body,back?)>

• mixed content, then its children can be text and/or element (or pure text),then its content-model must be of the form– '(''#PCDATA' (’|’ element-name)* ')*'

– i.e., it may not contain “,” and– no element name may occur more than once– e.g., <!ELEMENT p (#PCDATA|a|ul|b|i|em)*> is ok

but <!ELEMENT p (#PCDATA|a|(ul,b)|i|em)*> is not okand <!ELEMENT p (#PCDATA|a|ul|ul|i|em)*> is not ok

<!ELEMENT element-name content-model>

60


• please note that the root element is always declared in a DTD since it is the name ofthe DTD: <!DOCTYPE mytext [...]>

or <!DOCTYPE mytext SYSTEM “mydtd.dtd”>

• for a document to be valid w.r.t. a DTD, each of its objects (elements, attributes, etc.)must be valid

• an element e-name is valid w.r.t. a DTD if e-name is declared in the DTD and, if e-name’s content is declared as

– EMPTY, then any element whose name is e-name must have no content(not even whitespace)

– ANY, then it can only contain elements that are PCDATA or valid elements(and therefore only ones that are declared in the DTD)

– (expression), then its content must match this expression

• an attribute is valid w.r.t. a DTD if it is declared in the DTD and

– if it conforms to its declaration (see later)

16

61

DTD: structure of element content - examples

• <!ELEMENT myEmpty EMPTY>

• <!ELEMENT country(name, population, ethnicgroup*, religion*, border*, (province+|city+))>

• <!ELEMENT section(header, paragraph, (paragraph | image | figure | subsection)*, bibliography?)>

62

DTD: structure of element content -- determinism

• as we see below , the content-model in anelement type declarations for element content is deterministic

• (for mixed content, it is deterministic by definition)

• e.g., the content model ((b, c) | (b, d)) is non-deterministic

The content of an element matches a content model if and only if it is possibleto trace out a path through the content model, obeying the sequence,

choice, and repetition operators and matching each element in the contentagainst an element type in the content model. For compatibility, it is an errorif an element in the document can match more than one occurrence of an

element type in the content model. For more information, see"E.Deterministic Content Models".

http://www.w3.org/TR/REC-xml/#sec-element-content

• “for compatibility” is “for backward compatibility with SGML”

63

DTDs: element content of element type

• element declarations specify the “legal” structure of an element’s content

– so far, we haven’t said what it means for an element to “match” anexpression in an element declaration

• but we need to know in order to decide the following question:

given an XML document and a DTD, is the document valid w.r.t. the DTD?

i.e., does the document conform to the DTD?

• we can view element declarations as

– a grammar and parse an XML document

– ...but we need to know what it means for the

list of children of an element to match a regular expression

64


Example: does the content of this “mytext” element match its content model

<!ELEMENT mytext (title,(author*)?,content*)>

ElementnodeType = ELEMENT_NODEnodeName = mytextnodeValue = (null)firstchild .... lastchild attributes



...ElementnodeType = ELEMENT_NODEnodeName = contentnodeValue = (null)firstchild

17

65


• we consider– an element e1– its element type declaration <!ELEMENT e1 regexp>– and figure out whether regexp is “legal”, i.e., 1-unambigiuous– for this (and to understand when a document is valid wrt a DTD),

we need to know what it means for an (ordered) string w = a1...an (of element names aj) to match regexp

• an alphabet is a finite set, say ∑, whose elements are called letters– here, our letters are element names

• a word is a finite string of letters• concatenation is defined as usual: (a1...an)(b1...bm) = a1...anb1 ...bm

• we use ε for the empty word: w ε = ε w = w• a language is a set of words

66


What is the language L(rexp) described by a (regular) expression rexpover an alphabe Σ and operators {, | * + ?} ?

This language is inductively defined as follows:

• L(a) = {a} for a ∈ Σ

• L(rexp1,rexp2 ) = { w1 w2 | w1 ∈ L(rexp1) and w2 ∈ L(rexp2) }

• L(rexp1|rexp2 ) = L(rexp1) ∪ L(rexp2)

• L(rexp?) = {ε } ∪ L(rexp)

• L(rexp+) = L(rexp) ∪ L(rexp,rexp) ∪ L(rexp,rexp ,rexp) ...= ∪n>=1 L(rexpn)

• L(rexp*) = {ε } ∪ L(rexp) ∪ L(rexp,rexp) ∪ L(rexp,rexp,rexp) ...= ∪n>=0 L(rexpn)

67


Example: now we know that the “mytext” element matches its content model

<!ELEMENT mytext (title,(author*)?,content*)>because

title content content ∈ L(title,(author*)?,content*)

Hence we almost know what itmeans for a document to bevalid wrt a DTD(attributes are still missing)

ElementnodeType = ELEMENT_NODEnodeName = mytextnodeValue = (null)firstchild .... lastchild attributes



...ElementnodeType = ELEMENT_NODEnodeName = contentnodeValue = (null)firstchild

68

DTDs: checking whether a document is valid wrt a DTD

• as we have seen, for a document to be valid wrt a DTD,each of its objects (elements, attributes, etc.) must be valid

• consider an element e-name with child nodes a1...an :

e-name is valid w.r.t. a DTD if e-name is declared in the DTD and, if e-name’s content is declared as

– expression (in <!ELEMENT e-name expression>) then a1...an must be in L(expression)

• this is to be checked for each element in a document, i.e., for

– mixed content elements (easy -- why?) and for

– element content elements...how? is it costly? 1-unambiguouity helps!

18

69

DTDs: Ambiguity of expressions -- an example

• Consider the expression E = (a | b)* , a , a*• mark the ith occurrence in E of a letter x as xi: E = (a1 | b1)* , a2 , a3*• clearly, aaa 2 L(E), but there are three derivations or witnesses:

1. a1a1a2

2. a1a2a3

3. a2a3a3

• hence we call E ambiguous: given E, a parser would need to “look ahead” or guesswhen parsing aaa

• but there is F = (a | b)* , a such that1. F and E are equivalent, i.e., L(E) = L(F), and2. F is unambiguous, i.e., for each w 2 L(F), there is exactly 1 witness– hence we say that E is ambiguous, but L(E) is not

70

DTDs: 1-Ambiguity of expressions -- an example

• Consider the expression F = (a | b)* , a• mark the ith occurrence of a letter x as xi: F = (a1 | b1)* , a2

• clearly, baa 2 L(F), but consider its marking from left to right:1. clearly, we start with b1aa2. but how to proceed? We have two choices:

• b1a1a and• b1a2a

– to decide which to choose, we need to “look ahead” in baa,see that there is a second a and thus choose b1a1a2

• hence we say that F is not 1-unambiguous• but it is still unambiguous

71

1-Ambiguity -- preparation of formal definition

• in our examples, we marked letters in regular expressions with subscripts• for E a regular expression, we use m(E) for the marked version of E:

m(E) is obtained from E by marking each letter with a subscript such that– each subscripted letter occurs at most once in m(E)

• if E is a marked regular expression, we use u(E) for the expression obtainedthrough dropping all subscripts

• clearly, if F = m(E), then u(F) = u(m(E)) = E• we can extend marking and dropping to strings, alphabets and languages:

– m(w) is a marked version of the string w (note: one w can have morethan one m(w))

– m(Σ) is the set of all marked letters in Σ– m(L) is the set of words m(w) with w 2 L– ...etc.

72

Ambiguity -- formal definition

• an expression E over Σ is unambiguous if,for every w 2 L(E), there exists exactly one word w’ 2 L(m(E)) such thatw = u(w’)

• i.e., this is simply the formalization of what we said on slide 69

• but, as we have seen, unambiguity is not enough to safe a parser from havingto look ahead --> 1- unambiguity!

19

73

1-Ambiguity -- formal definition

• an expression E over Σ is 1-unambiguous if,for every u,v,w over m(Σ) and every x,y 2 m(Σ), we have that

– if both uxv and uyw 2 L(m(E)) and x ≠ y, then u(x) ≠ u(y)– e.g., not both ua3v and ua4w 2 L(m(E))

• a language L is 1-unambiguous ifthere exists a 1-unambiguous expression E such that L = L(E)

• 1-unambiguous– is the formalization of “Deterministic content model”– allows us to check validity of an element without look-ahead or backtracking,

i.e., fast -- important to validate large documents!? how to test E for 1-unambiguouity? how to test XML document for errors?• not every regular L is 1-unambiguous: remember F = (a | b)* , a? how to test L for 1-unambiguouity?

? how to test whether we can make an XML document error-free?

74

1-Ambiguity -- a consequence

Lemma: if E is 1-unambiguous, then for every word w 2 L(E), there is at most one wordv 2 L(m(E)), such that u(v) = w.

• let’s introduce some notation for E over Σ:– first(E) = {a 2 Σ | there is some aw 2 L(E)}– last(E) = {a 2 Σ | there is some wa 2 L(E)}– follow(E,a) = {b 2 Σ | there is some wabv 2 L(E)}

Lemma: let m(E) be a marked regular expression and x1 ... xn letters in m(Σ).Then x1 ... xn 2 L(m(E)) if and only if

– x1 2 first(m(E)),– xn 2 last(m(E)), and– for all i with 1 ≤ i < n: xi+1 2 follow(m(E),xi)

• In this lemma, marking is essential: consider E = aa and the word aaa

75

1-Ambiguity -- more consequences

• let’s keep in mind some notation for E over Σ:

– first(E) = {a 2 Σ | there is some aw 2 L(E)}

– last(E) = {a 2 Σ | there is some wa 2 L(E)}

– follow(E,a) = {b 2 Σ | there is some wabv 2 L(E)}

Lemma: A regular expression E is 1-unambiguos if and only if

1. for every x,y 2 first(m(E)), if x ≠ y, then u(x) ≠ u(y), and

2. for every z and every x,y 2 follow(m(E),z), if x ≠y, then u(x) ≠ u(y)

This allows us to test whether a given DTD is free of errors due to non-determinism ofexpressions in element declaration:

• for each expression E in DTD’s element declaration,

– construct m(E), first(m(E)), follow(m(E)) (all finite, all easily constructible)

– for all x,y,z letters in first(m(E)), follow(m(E)) and test (1) and (2) above

76

1-Ambiguity -- more consequences

In case a DTD contains an error due to non-determinism of expressions in elementdeclaration, we can even check whether we can “repair” it without changing itselements and structure:

Theorem: Given a regular expression E, we can decide in time polynomial in |E| whetherL(E) is 1-unambiguous.

• this important fact is proven in One-Unambiguous Regular Languages (1997),Anne Bruegemann-Klein, Derick Wood:

20

77

DTDs: attribute definitions

• remember: an element (of a well-formed XML document) cannot have morethan 1 attribute-value pair with the same attribute name!

• DTDs can specify attribute names and their types:

• attr-namei for names of attributes• attr-typei can be one of CDATA, NMTOKEN, NMTOKENS, ID, IDREF,

IDREFS (see slide 57) or (c1 | ... | c2)• attr-constri see next slide

<!ATTLIST element-nameattr-name1 attr-type1 attr-constr1 ... ... ...attr-namek attr-typek attr-constrk>

78

DTDs: attribute definitions - attribute constraints

DTDs allow to express the following attribute constraints:

• #REQUIRED: element must have one attribute-value pair for this attribute

• #IMPLIED: element may have an attribute-value pair for this attribute

• “default-value”: we can specify a default value for this attribute

• #FIXED “fixed-value”: each element has the same value fixed-value

– and I don’t need to mention this explicitly

79

DTDs: attribute definitions - attribute constraints - examples

<!ATTLIST country

code ID #REQUIRED

capital IDREF #REQUIRED

memberships IDREFS #IMPLIED

products NMTOKENS #IMPLIED>

<!ATTLIST desert

id ID #REQUIRED

type (sand|rock|ice) ‘sand’

climate NMTOKEN #FIXED ‘dry’>

80

DTDs: attribute definitions - attribute constraints

An attribute att-name in an element e-name is valid w.r.t. a DTDif att-name is declared for e-name,

i.e., if it occurs in <!ATTLIST e-name .... att-name type constr ...>and att-name‘s value conforms to its type type and satisfies its constraints constr

E.g., in order to conform to ID,an attribute’s value must beunique in document

E.g., in order to satisfy #REQUIREDan attribute must be present

ElementnodeType = ELEMENT_NODEnodeName = mytextnodeValue = (null) ...... attributes

AttributenodeType = ATTRIBUTE_NODEnodeName = contentnodeValue = medium

AttributenodeType = ATTRIBUTE_NODEnodeName = yearnodeValue = 2008

21

81

DTDs: entity declarations

• entities can reference each other, and they can be “localised”

• entities preceded by % are parameter entities, they can

– only be declared in external DTDs,

– only be referenced inside the DTD (outside, % has no meaning)

– be over-written

• parameter entities are used to make DTDs maintainable and succinct, e.g.:

<!ENTITY entity-name content>

<!ENTITY entity-name SYSTEM “filename.xml”>

possibly preceded by a %

<!ENTITY % article "art_number, art_name, art_size"><!ENTITY % aux-article ”description | class"><!ELEMENT storage (goods-in | goods-out)*><!ELEMENT goods-in (goods_in_number, (%article;), (%aux_article;))><!ELEMENT goods-out (goods_out_number, (%article;), (%aux_article;))><!ELEMENT art_number (#PCDATA)>...

82

DTDs: entity declarations

• entities can reference each other, and they can be “localised”• entities preceded by % are parameter entities, they can

– only be declared in external DTDs,– only be referenced inside the DTD (outside, % has no meaning)– be over-written

• Example:

• the replacement text of “&book;”:

the general entity reference "&rights;" is expanded when "&book;" appears in thedocument's content or an attribute value.

<!ENTITY entity-name content>

<!ENTITY entity-name SYSTEM “filename.xml”>

possibly preceded by a %

<!ENTITY % pub "Éditions Gallimard"><!ENTITY rights "All rights reserved"><!ENTITY book "La Peste: Albert Camus,

© 1947 %pub;. &rights;">

La Peste: Albert Camus,© 1947 Éditions Gallimard. &rights;

83

DTDs: others

• there are a few other kind of declarations possible in DTDs:

– notations

– conditional sections

• please refer to your favourite XML book for these

comp6037 semi-structured data and the websattler/teaching/slides-1.pdf · comp6037 semi-structured...

Documents