comp60362 advanced database technology semi-structured …sattler/teaching/slides.pdf · advanced...

1

1

COMP60362Advanced Database Technology Semi-structured Data and XML

U. SattlerUniversity of Manchester

2

Organisational

• COMP60632 consists of 2 parts:1. Semistructured Data and XML (myself, this and next week)2. Data Mining (Alvaro Fernandes, remainder)

• Prerequisites: good familiarity with databases and programming• Teaching period: Mondays of the next 5 weeks

– with demonstrators present to ask during labs• Coursework and Exercises: 10 days• Assessment: 33% exam, 67% coursework• http://www.cs.man.ac.uk/~sattler/teaching/cs636.html

• Please do not hesitate to ask if you have a question!

3

Literature

To obtain more detailed information, please refer to

• W3C documents at http://www.w3.org/TR/...

• S. Abiteboul, P. Buneman, and D. Suciu: Data on the Web. MorganKaufmann Publishers, 2000.

• E. R. Harold and W. S. Means: XML in a Nutshell. O’Reilly, 2004.

• or choose some of the various available web resources.

4

Outline of the first part of the course

1. Introduction to semi-structured data2. XML: core concepts3. DTDs, a simple schema language for XML documents4. XPath, a navigation language for XML documents5. XML namespace: a concept ignored so far6. XSLT, a transformation language for XML documents7. DOM and SAX, a programmatic manipulation language for XML documents8. XML Schema, a more expressive schema language for XML documents9. XQuery, a query language for XML documents10.Storing XML documents in RDBMSs

2

5

Data, documents, and the Web

The Web

• extremely rich information sourcewww.worldwidewebsize.com, November 2006: ~12.000.000.000

• mostly web pages (HTML), accessible via a URL

• HTML structures document/text:

– intra-document structure: lay-out and format

– inter-document structure: links to other web pages

• content of web pages is often accessible to humans only: text

• query mechanisms: keyword-based

6


Relational databases

• proven technology, currently storing/managing vast amounts of data

• separation between 3 levels:

– conceptual: ER diagrams

– logical: tables, and

– physical: implementation of tables, indices

• data is accessed via queries, mostly SQL queries

• integrity constraints play an important role

– e.g., (foreign) key constraints, not null, etc.

– preserving data integrity is important

• main issue: efficient implementation of query answering over large DBs

7


Why we need a bridge between DBs and the Web:

• a lot of information is available on the web by querying RDBMSs

– http://www.schoolswebdirectory.co.uk/postcode.php

• output is in HTML, possibly wrapped

• difficult to access output data with a program

• access is limited to the queries that are hard-coded by the systeme.g., query for schools by postcode

8

Data, documents, and the Web: bio data from SWISSPROTProtein ACEK_BORBR on www.ebi.uniprot.org:

ID ACEK_BORBR Reviewed; 619 AA.AC Q7WDP2;DT 16-JAN-2004, integrated into UniProtKB/Swiss-Prot.DT 01-OCT-2003, sequence version 1.DT 31-OCT-2006, entry version 23.DE Isocitrate dehydrogenase kinase/phosphatase (EC 2.7.11.5) (EC 3.1.3.-)DE (IDH kinase/phosphatase) (IDHK/P).GN Name=aceK; OrderedLocusNames=BB4946;OS Bordetella bronchiseptica (Alcaligenes bronchisepticus).OC Bacteria; Proteobacteria; Betaproteobacteria; Burkholderiales;OC Alcaligenaceae; Bordetella.OX NCBI_TaxID=518;RN [1]RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].RC STRAIN=RB50 / ATCC BAA-588 / NCTC 13252;RX MEDLINE=22827954; PubMed=12910271; DOI=10.1038/ng1227;RA Parkhill J., Sebaihia M., Preston A., Murphy L.D., Thomson N.R.,RA Harris D.E., Holden M.T.G., Churcher C.M., Bentley S.D., Mungall K.L.,RA Cerdeno-Tarraga A.-M., Temple L., James K.D., Harris B., Quail M.A.,RA Achtman M., Atkin R., Baker S., Basham D., Bason N., Cherevach I.,RA Chillingworth T., Collins M., Cronin A., Davis P., Doggett J.,RA Feltwell T., Goble A., Hamlin N., Hauser H., Holroyd S., Jagels K.,RA Leather S., Moule S., Norberczak H., O'Neil S., Ormond D., Price C.,RA Rabbinowitsch E., Rutter S., Sanders M., Saunders D., Seeger K.,RA Sharp S., Simmonds M., Skelton J., Squares R., Squares S., Stevens K.,RA Unwin L., Whitehead S., Barrell B.G., Maskell D.J.;RT "Comparative analysis of the genome sequences of Bordetella pertussis,RT Bordetella parapertussis and Bordetella bronchiseptica.";RL Nat. Genet. 35:32-40(2003).

3

9

Data, documents, and the Web: bio data from SWISSPROTCC -!- FUNCTION: Bifunctional enzyme which can phosphorylate orCC dephosphorylate isocitrate dehydrogenase (IDH) on a specificCC serine residue. This is a regulatory mechanism which enablesCC bacteria to bypass the Krebs cycle via the glyoxylate shunt inCC response to the source of carbon. When bacteria are grown onCC glucose, IDH is fully active and unphosphorylated, but when grownCC on acetate or ethanol, the activity of IDH declines drasticallyCC concomitant with its phosphorylation (By similarity).CC -!- CATALYTIC ACTIVITY: ATP + [isocitrate dehydrogenase (NADP(+))] =CC ADP + [isocitrate dehydrogenase (NADP(+))] phosphate.CC -!- SUBCELLULAR LOCATION: Cytoplasm (By similarity).CC -!- SIMILARITY: Belongs to the aceK family.CC -----------------------------------------------------------------------CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/termsCC Distributed under the Creative Commons Attribution-NoDerivs LicenseCC -----------------------------------------------------------------------DR EMBL; BX640452; CAE35310.1; -; Genomic_DNA.DR GenomeReviews; BX470250_GR; BB4946.DR KEGG; bbr:BB4946; -.DR BioCyc; BBRO518:BB4946-MONOMER; -.DR GO; GO:0005524; F:ATP binding; IEA:HAMAP.DR GO; GO:0016788; F:hydrolase activity, acting on ester bonds; IEA:HAMAP.DR GO; GO:0006097; P:glyoxylate cycle; IEA:HAMAP.DR GO; GO:0006099; P:tricarboxylic acid cycle; IEA:HAMAP.DR HAMAP; MF_00747; -; 1.DR InterPro; IPR010452; AceK.DR Pfam; PF06315; AceK; 1.DR PIRSF; PIRSF000719; AceK; 1.KW ATP-binding; Complete proteome; Glyoxylate bypass; Hydrolase; Kinase;KW Multifunctional enzyme; Nucleotide-binding; Protein phosphatase;KW Transferase; Tricarboxylic acid cycle. 10

Data, documents, and the Web: bio data from SWISSPROTFT CHAIN 1 619 Isocitrate dehydrogenaseFT kinase/phosphatase.FT /FTId=PRO_0000057894.FT NP_BIND 354 360 ATP (By similarity).FT ACT_SITE 409 409 By similarity.FT BINDING 375 375 ATP (By similarity).SQ SEQUENCE 619 AA; 70681 MW; F434CC157EFD9CB4 CRC64; MIYSGDVQRI EPAPVAGPAP LDVAHLILAG FDRHYALFRY SAQRAKSLFE SGDWHGMQRL SRERIEYYDM RVRECATQLD SALRGSDART ADGSRANGSA ALSEAQTAFW QAVKQEFVGL LADHRQPECA ETFFNSVSCR ILHRDYFHND FLFVRPAIAT DYLDSRIPSY RVYYPVAEGL HKSLIRMVAD FGLAVPYADL PRDARLLARA AVRQLRGQLP RHAGPRLASD CQIQVLGSLF FRNTGAYIVG RLINQGTVYP FAVALRRNPA GQVCLDALLL GADDLSTLFS FTRAYFLVDM ETPAAVVNFL ASLLPRKPKA ELYTMLGLQK QGKTLFYRDF LHHLTHSRDA FDIAPGIRGM VMCVFTLPSY PYVFKLIKDR IDKDGMDHAT VRRKYQMVKL HDRVGRMADT WEYSQVALPR SRFAPRLLEE LRRLVPSLIE ENGDTVVIRH VYIERRMMPL NLYLRHASDP LLEVAVREYG DAIRQLATAN IFPGDMLYKN FGVTRLGRVV FYDYDEIQRM TEMNFRAIPP APNEEAELSS EPWYAVGPND VFPEEFGRFL LGDPRVRQAF LRHHADLLAP QWWQACRARV AQGRIEEFFP YDTDRRLHPQ AAPPPRTAA//

...biologists need to integrate, share, query, analyse, and searchthis data!

11

XML as a bridge

XML

• is a format for the representation of semi-structured data(more on this later)

• is not designed to lay-out documents

• alone will not solve the problem of efficiently querying web data:we might have to use RDBMSs technology as well

12

Conventional databases and the Web

Conventional DB systems:

• client-server architecture

• queries issued to a server

• server processes queries:

– process/compile/optimise

– execute

client

server

client client

network

4

13

Conventional databases and the Web

Data processing on the web (ideally):

• multi-tier• all data sources translate data into common format: XML

• to share & integrate & combine data, common schema is used

• clients consume data• servers provide data• intermediate middleware to transform and integrate data

client

server

client client

server serverserver

middlewaremiddleware

middleware

14

Why XML as a bridge?

XML is• designed to describe contents rather than presentation• meant to be consumed by programs, not by humans• suitable for exchanging data between applications/platforms:

– the way characters are encoded in an XML document is defined withinthe document itself through an encoding declaration, e.g., Unicode

– additional constraints on the content of an XML document can bespecified separately, e.g., in

• Document Type Definition, DTD or• XML Schema (also an XML document)

– DTDs and XML schemas can be placed locally or across the network,and can be found using the universal notation of URLs

15

The Basics First: Semi-structured data

Semi-structured data• predates XML

• is an attempt to reconcile (Web) document view and (DB) strict structures

• is organised in semantic entities, where

• similar entities are grouped together

• entities in same group may not have same attributes

• order of attributes not necessarily important

• not all attributes may be required

• carries its own description

Example: {name: “Uli”, tel: 56176, email:”[email protected]”}

simple set of attribute-value pairs differ16

The Basics First: Semi-structured data

Example (ctd):

Values can in turn be structured:

{name: {first:”Uli”, last: “Sattler”},

tel: 56176,

email:”[email protected]”}

And we can have several values for the same attribute:


tel: 56176,

tel: 56182,

email:”[email protected]”}

5

17

The Basics First: Semi-structured data (SSD)

Graphical representation as a tree:

name tel. tel. email

first last

“Uli” “Sattler”

56176 56182 “[email protected]


tel: 56176,

tel: 56182,

email:”[email protected]”}18

The Basics First: Semi-structured data (SSD)

• In general, a piece of SSD can be represented as a graph

– leaf nodes standing for single data items

– edges labelled with attribute names

19

Semi-structured data: tuples

We can easily represent nested tuples as sets of attribute-value pairs:

{person:

{name: “Uli”, tel: 56176, email:”[email protected]”}

person:

{name: “Alvaro”, tel: 56183, email:”[email protected]”}

person:

{name: “Leo”, tel: 8488342, email:”[email protected]”}

}

20

Semi-structured data: tuples with variations

We can easily represent nested tuples as sets of attribute-value pairseven if they have missing or duplicated pairs

{person: {name: {first: “Uli”, last:”sattler}, tel: 56176, email:”[email protected]”} person: {name: “Alvaro”, tel: 56183, tel: 783 4672,

email:”[email protected]”} person: {name: “Leo”, tel: 8488342, email:”[email protected]”}}

• serialization: converting SSD into a byte stream -- for easy transmission• self-describing: annotate each data-item (e.g., 56175) with its description

(e.g., tel.:): space consuming, but enhances inter-operability• we will see later how to efficiently store SSD

6

21

SSD: representing relational data

Consider two relations :

and their tree representation:

c2b2a2

c1b1a1

cbaR

d3c3

d4c4

d2c2

dcS

R S

row row row row row

a1 b1 c1 a2 b2 c2 c2 d2 c3 d3 c4 d4

a b c a b c c d c d c d

R S

a1 b1 c1 a2 b2 c2 c2 d2 c3 d3 c4 d4


S SR

row row row row rowR S

a1 b1 c1 a2 b2 c2 c2 d2 c3 d3 c4 d4


S SR

22

SSD: representing object databases

• some DBMSs are object-oriented• such data can be represented as SSDExample: { person: &o1 { name: “John”,

age: 47,relatives: {child: &o2, child: &o3}}

&o2 { name: “Mary”,age: 21,relatives: {father: &o1,

sister: &o3}} &o3{ name: “Paula”,

age: 23,relatives: {father: &o1,

sister: &o2}}}

23

SSD: representing object databases

• some DBMSs are object-oriented• such data can be represented as SSD• &o1, &o2,... are object identifiers• objects can refer to each other• relational structure of data is no longer tree-like

24

2. XML - eXtensible Markup Language

7

25

What is XML?

• XML is a specialization of SGML, similar to HTML• XML is a W3C standard since 1998, see http://www.w3.org/XML/• was designed to be simple, generic, and extensible• a “piece of XML”

– is called an XML document and contains• structure• data

– can be associated with a tree• an XML document is divided into smaller pieces called elements (associated with

nodes in tree):– an XML document contains elements– elements can contains elements– with a non-ambiguous hierarchical structure amongst elements

• an XML document consists of– some administrative information followed by– an element containing all other elements 26

What is XML? (ctd)

General things about XML:

• elements are delimited by tags

• tags are enclosed in angle brackets, e.g., <panel>, </from>tags are case-sensitive, i.e., <FROM> is not the same as <from>

• we distinguish

– start tags: <...>, e.g., <panel>

– end tags: </...>, e.g., </from>

• like parentheses, pairs of matching start- and end tags delimit elements

• empty elements of the form <foo></foo> can be written as <foo/>

• attributes specify properties of an elemente.g., <cartoon copyright=“United Feature Syndicate”>

27

Example

<cartoon copyright=“United Feature Syndicate” year=“2000”> <prolog> <series>Dilbert</series> <author>Scott Adams</author> <characters> <character>The Pointy-Haired Boss</character> <character>Dilbert</character> </characters> </prolog><panels>

<panel colour=“none”> <scene> Pointy-Haired Boss and Dilbert sitting at table. </scene> <bubbles> <bubble> <speaker>Dilbert</speaker> <speech>You haven’t given me enough resources to do

my project.</speech> </bubble></bubbles>

</panel> ...... 28

What is XML? (ctd)

The administrative information of an XML document:1. XML declaration, e.g., <?xml version=“1.0” encoding=“iso-8859-1”?>

identifies the– XML version (1.0) and– character encoding (iso-8859-1)

2. document type declaration references a grammar describing documentcalled Document Type Definition– e.g. <!DOCTYPE cartoon SYSTEM “cartoon.dtd”>– a DTD constrains the structure, content & tags of a document– can either be local or remote

3. after these 2 declarations, we find the root element -- also calleddocument element

8

29

Example

<?xml version=“1.0” encoding=“iso-8859-1”?>

<!DOCTYPE cartoon SYSTEM “cartoon.dtd”>

<cartoon copyright=’United Feature Syndicate’ year=’2000’><prolog>

<series>Dilbert</series> <author>Scott Adams</author> <characters> <character>The Pointy-Haired Boss</character> <character>Dilbert</character> </characters> </prolog>

<panels>....</panels>

</cartoon>

AdministrativeInformation

Root element

30

What is XML? (ctd)

• in XML, the set of tags is not fixed -- in HTML, the tag set is fixed!

• structures can be nested, to arbitrary depth

• XML itself is not a markup language,but we can specify markup languages with XML

– an XML document can contain or refer to its specification: !DOCTYPE

31

When is an XML document well-formed?

An XML document is well-formed if1. tags, <, and > are correct2. tags are properly nested3. attributes are unique for each tag

This is a very weak notion of well-formedness: basically,it only ensures that we can parse a document into a tree

The following are not well-formed:1. <equation< a + b</equation> and <equation> a < b</equation>2. <panel>

<bubble>Hi there</panel>

3. <panel colour=“none” colour=“b&w”>32

Further restricting the structure of XML docs

In certain applications, we want XML documents to have a certain structure• e.g., for exchanging/managing cartoons, we want XML docs with

<cartoon copyright=STRING year=INTEGER> <prolog> OPTIONAL(<series>name-of-series</series>) ONE_OR_MORE(<author>author-name</author>) OPTIONAL(<characters> ONE_OR_MORE(<character>charactername</character> ) </characters>) </prolog>

<panels>ONE_OR_MORE(<panel colour= STRING >

<scene> scene-description</scene> OPTIONAL(<bubbles> ONE_OR_MORE(<bubble> <speaker>speaker-name</speaker> <speech>bubble-text</speech>

</bubble>)</bubbles>)

</panel>)

9

33


In applications, we want XML documents to have a certain structure• with certain elements• with certain nesting structure• with certain attributes

This requires a• document structure specification language or• grammar or• model or....

For XML documents, numerous such formalisms have been developed. Wediscuss

• DTDs• XML Schema

34


The structure can be exploited by query languages:we can answer queries related to content and structure

– which scenes does Dilbert occur in?– does Dilbert swear?– is there a scene with the pointy-haired boss alone?– find me all cartoons with characters talking about databases!

Contrast this with, e.g., Google keyword search

Next: briefly discuss DTDs and XML schema

35

Document Type Definitions: DTDs

• was built into XML 1.0 specification

• DTDs can be inside or separate from XML documents

• a DTD is a collection of declarations

• an element declaration describes an element in terms of

– which elements it can contain

– in which order it can contain these elements

• an element declaration does not constrain the type of data inside elements

• an XML document is valid if it has an associated document type declarationand if the document complies with the constraints expressed in it.

• E.g., <?xml version="1.0"?><greeting>Hello, world!</greeting>is well-formed, but not valid

36

XML Schema

• has been developed later

• constrains the structure and data of XML documents

– e.g., inside <date-of-birth>...</date-of-birth> must be something of typedate

Later more...

10

37

A brief history of XML

• SGML (Standard Generalised Markup Language), 1985:– flexible, expressive– customised tags

• HTML (Hypertext Markup Language), early 1990ies:– application of SGML– designed for presentation of documents– single document type, presentation-oriented tags, e.g., <h1>...</h1>– led to the web as we know it

• XML, 1998 first edition of XML 1.0 (now 4th edition)– a W3C standard– subset/fragment of SGML– designed for the exchange/sharing of data

• XHTML is– an application of XML– almost a fragment of HTML 38

A rough map of a small part of the acronym world

XML

HTMLDTD

SGML

XHTML

is an application of

is an application of

is basicallya restriction of

is basicallya restriction of

XSLT

describes

transforms

XML Schemadescribes

39

How to view or edit XML?

• XML is not for human consumption

– in contrast to HTML, your browser won’t help: you can only do a “viewsource” or

– first use XSLT (later more) to transform XML into HTML, then use yourweb browser to view it

• you can use your favourite editor,e.g., emacs in xml mode

• you can use an XML editor, e.g., XMLSpy, Stylus Studio, Oxygen,MyEclipse, and many more

40

XML and HTML

• XML is always case sensitive, i.e., "Hello" is different from "hello"

– HTML isn’t: it uses SGML's default "ignore case"

• in XML, all tags must be present

– in HTML, some ”tag omission" may be permissible (e.g., <br>)

• in XML, we have a special way to write empty tags <myname/>

– which can’t be used in HTML

• in XML, all attribute values must be quoted, e.g., <name lang= ”eng”>...

– in SGML (and therefore in HTML) this is only required if value containsspace

• in XML, attribute names cannot be omitted

– in HTML they may be omitted using shorttags

11

41

XML Core Concepts: Prologue -- XML declaration

More at http://www.w3.org/TR/REC-xml/

Each parami is in the form

parameter-name=“parameter-value”

Parameters for

• the xml version used within document

• the character encoding• whether document is standalone or uses external declarations

(see validity constraint for when standalone=“yes” is required)

Example: <?xml version=“1.0” encoding=“US-ASCII” standalone=“yes” ?>

An XML document should have an XML declaration (but does not need to)

<?xml param1 param2 ...?>

42

XML Core Concepts: Prologue -- doctype declaration

• one such declaration, before root element• element-name is the name of the root element of the document• the optional dt-declarations is

– called internal subset– a list of document type definitions

• the optional f-name.dtd refers to the external subset also containingdocument type definitions

• e.g., <!DOCTYPE html PUBLIC “http://www.abc.org/dtds/html.dtd”“http://www.abc.org/dtds/html.dtd” >

<!DOCTYPE element-name (PUBLIC “pub-id” “f-name.dtd” | SYSTEM “f-name.dtd”)? [dt-declarations]>

43

XML Core Concepts: entity declaration

• associates the entity-name with– a value or– a piece of XML outside the document, referred to by entity-name

• e.g., we can state <!ENTITY donau “Donaudampfschifffahrtskapitaen”> inthe prologue and then refer to it by

<text>Then the &donau; entered the room.</text>

• e.g., we can state <!ENTITY chap SYSTEM “chap02.XML”>in the prologue and then refer to it in

<book><title>Short Book</title> &chap;</book>

<!ENTITY entity-name identifier-or-value>

44

XML Core Concepts: entities

• entities are placeholders in XML, declared in DTD/prologue• once declared, they can be referred to (several times) within a document• to reference an entity called entity-name, we use &entity-name; or %entity-name;• we can distinguish between

– [content] parsed vs. unparsed entities; the latter is indicated by NDATA• e.g., an image is unparsed (because we don’t want it to be parsed/validated)

<!ENTITY my-pic SYSTEM "../grafix/my-photo.gif" NDATA gif >– [name] entity references (&..;), character references (&#..;), parameter entities

(%...;)• e.g., defined reference &chap; and 3 pre-defined entities & >, <• character references (&#..;), e.g., ç (hexadecimal) and (decimal)• parameter entities are used in DTDs (more later)

– [declaration] internal vs. external entities (external if (SYSTEM or PUBLIC))• e.g., &chap; is external, &pict; with

<!ENTITY my-pic SYSTEM "../grafix/my-photo.gif" NDATA gif >is external and unparsed

• internal entities are parsed entities, e.g., <!ENTITY title “My title”> is internal

12

45

XML Core Concepts: elements (the main concept)

• arbitrary number of attributes is allowed

• but each attribute occurs at most once in one element

• each attr-decli is of the form

• the element-content can contain

– text and/or

– one or more other elements

• an empty element can be abbreviated as<element-name attr-decl1 ... attr-decln/>

<element-name attr-decl1 ... attr-decln>

element-content

</element-name>

attr-name=“attr-value”

element contentmixed content

46

XML Core Concepts: comments

• can go in every part of an XML document except

– inside tags

– before the XML declaration

• the ending must be -->– ending in ---> is not allowed

• they can be used to “turn off” a part of an XML document



47

XML Core Concepts: CDATA sections

• CDATA stands for character data, i.e., not markup• the only thing that cannot go into a CDATA section is the ending ]]>• useful when we want to use special character such as &, <, etc.• e.g.,

<text>You can type <![CDATA[ if (&x < &y )]]> and this is alright. </text>• left angle brackets and ampersands may occur in their literal form; they

need not and cannot be escaped using "<" and "&"

<! [CDATA [ whatever-you-want ]]>

48

XML: the information set

• the information set, or “infoset”, is a W3C description• the infoset of an XML document is an abstract data structure that explicates

– the information contained in the XML document and– how to refer to the bits and pieces of an XML document

• it associates every– well-formed (!) XML document that conforms to the namespaces

recommendation– with a tree-like data structure over 7 kind of information items

• document information item• element information item• attribute information item• etc.

• more on http://www.w3.org/TR/xml-infoset/

DocumentnodeType = DOCUMENT_NODEnodeName = #documentnodeValue = (null)

ElementnodeType = ELEMENT_NODEnodeName = mytextnodeValue = (null)firstchild lastchild attributes

ElementnodeType = ELEMENT_NODEnodeName = titlenodeValue = (null)firstchild

ElementnodeType = ELEMENT_NODEnodeName = contentnodeValue = (null)firstchild

TextnodeType = TEXT_NODEnodeName = #textnodeValue = Hallo!

TextnodeType = TEXT_NODEnodeName = #textnodeValue = Bye!

AttributenodeType = ATTRIBUTE_NODEnodeName = contentnodeValue = medium

13

49

3. DTDs: Document Type Definitions

50

Why do we need a schema formalism for XML?

Benefits of a schema:

• it communicates the structure of an XML document, e.g. to a computer

• it can function as a publishable, shareable specification– e.g., we can agree on a schema to share protein data in

• it helps catching high-level mistakes: being well-formed is not enough

• it allows one to implement/use tools that check validity: in a

– portable

– efficient way

(the alternative would be to write dedicated programs/parsers...)

• schemata are extensible and re-usable

Drawbacks of a schema:

• it requires additional work

51

Example: bio data from SWISSPROT data in XML

<?xml version="1.0" encoding="UTF-8"?><uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd"><entry dataset="Swiss-Prot" created="2004-01-16" modified="2006-10-31" version="23"> <accession>Q7WDP2</accession> <name>ACEK_BORBR</name> <protein> <name ref="1 2">Isocitrate dehydrogenase kinase/phosphatase</name> <name>IDH kinase/phosphatase</name> <name>IDHK/P</name> </protein> <gene> <name type="primary">aceK</name> <name type="ordered locus">BB4946</name> </gene> <organism key="3"> <name type="scientific">Bordetella bronchiseptica</name> <name type="synonym">Alcaligenes bronchisepticus</name> <dbReference type="NCBI Taxonomy" id="518" key="4"/> <lineage> <taxon>Bacteria</taxon> <taxon>Proteobacteria</taxon> <taxon>Betaproteobacteria</taxon> <taxon>Burkholderiales</taxon> <taxon>Alcaligenaceae</taxon> <taxon>Bordetella</taxon> </lineage> </organism> 52

Example: bio data from SWISSPROT data in XML (ctd)

<reference key="5"> <citation type="journal article" date="2003" name="Nat. Genet." volume="35" first="32" last="40"> <title>Comparative analysis of the genome sequences of Bordetella pertussis, Bordetella parapertussis and Bordetella

bronchiseptica.</title> <authorList> <person name="Parkhill J."/> ... <person name="Maskell D.J."/> </authorList> <dbReference type="PubMed" id="12910271" key="6"/> <dbReference type="MEDLINE" id="22827954" key="7"/> <dbReference type="DOI" id="10.1038/ng1227" key="8"/> </citation> <scope>NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].</scope> <source> <strain>RB50 / ATCC BAA-588 / NCTC 13252</strain> </source> </reference> <comment type="function" status="By similarity"> <text>Bifunctional enzyme which can phosphorylate or dephosphorylate isocitrate dehydrogenase (IDH) on a specific serine residue. This is

a regulatory mechanism which enables bacteria to bypass the Krebs cycle via the glyoxylate shunt in response to the source of carbon.When bacteria are grown on glucose, IDH is fully active and unphosphorylated, but when grown on acetate or ethanol, the activity of IDHdeclines drastically concomitant with its phosphorylation</text>

</comment> <comment type="catalytic activity"> <text>ATP + [isocitrate dehydrogenase (NADP(+))] = ADP + [isocitrate dehydrogenase (NADP(+))] phosphate</text> </comment> <comment type="subcellular location" status="By similarity"> <text>Cytoplasm</text> </comment> ....

14

53

Why do we need a schema for XML?

So, using a schema for SWISSPROT we can

• publish it so that everybody knows about the structure of our data

• validate data from other sources before it is added to SWISSPROT

• build tools around this schema to

– analyse,

– search, and

– transform data

Also, other XML tools (for querying/transforming, etc) can make use of thisschema, e.g., for query optimisation

54

Schemata for XML

Interestingly, there are/were various different schema languages for XML:

• DTDs

• XML Schema

• SOX (Schema for Object-Oriented XML)

• RELAX NG (Regular Language Description for XML New Generation)

• Schematron

• etc.

55

DTDs• a DTD is a collection of declarations that specify

– which elements are allowed,– how elements can be nested,

i.e., what sub-elements an element can have, incl. their• type• order• number

– what attributes an element can have, incl. their• attribute name• type• number: whether it is compulsory or optional (obviously at most 1)

• DTDs are inspired by SGML– good for describing documents– not so good for describing data– simple and easy to understand and validate

• DTDs are standardized in the XML recommendations

(remember: an elementincludes/comes with its tags!)

56

Data Types in DTDs

• for text content: PCDATA

• for attribute values:

– CDATA: a simple string

– NMTOKEN: a string without blanks

– NMTOKENS: list of such strings separated by blanks

– ID: like NMTOKEN, but unique value in document

– IDREF: like NMTOKEN, but each value must also occur as ID indocument

– IDREFS: list of IDs

• for element types: regular expressions denoting structure

15

57

DTD: element type declarations

The content-model can be of one of the following forms:

• EMPTY: empty element

• ANY: anything

• (#PCDATA): text only

• (expression), where expression is built over (,), other element names and #PCDATAusing the following operators:

– , : concatenation or sequence

– | : exclusive-or or choice

– * : Kleene Star or repetition (repeat as often as you like, but finitely often)

– + : repetition (as *, but repeat at least once)

– ? : optional

• each element can be declared at most once!

<!ELEMENT element-name content-model>

58


If we declare an element with

• element content, all its children are elements, hence its content-model maynot contain #PCDATA

– e.g., <!ELEMENT spec (front,body,back?)>

• mixed content, then its children can be text and/or element (or pure text),then its content-model must be of the form

– '(''#PCDATA' ('|' Name)* ')*'

– i.e., it may not contain “,” and

– no element name may occur more than once

– e.g., <!ELEMENT p (#PCDATA|a|ul|b|i|em)*>

<!ELEMENT element-name content-model>

59


• please note that the root element is always declared in a DTD since it is thename of the DTD

• for a document to be valid w.r.t. a DTD, each of its objects (elements,attributes, etc.) must be valid

• an element e-name is valid w.r.t. a DTD if it is declared in the DTD and, if e-name’s content is declared as– EMPTY, then any element whose name is e-name must have no

content (not even whitespace)– ANY, then it can only contain elements that are PCDATA or valid

elements (and therefore only ones that are declared in the DTD)– (expression), then its content must match this expression

• an attribute is valid w.r.t. a DTD if it is declared in the DTD and– if it conforms to its declaration (see later)

60

DTD: structure of element content - examples

• <!ELEMENT myEmpty EMPTY>

• <!ELEMENT country(name, population, ethnicgroup*, religion*, border*, (province+|city+))>

• <!ELEMENT section(header, paragraph, (paragraph | image | figure | subsection)*, bibliography?)>

16

61

DTD: structure of element content -- determinism

• as we will see soon, it is required that the content-model in element typedeclarations for element content be deterministic

• (for mixed content, it is deterministic by definition)

• e.g., the content model ((b, c) | (b, d)) is non-deterministic

The content of an element matches a content model if and only if it is possibleto trace out a path through the content model, obeying the sequence,

choice, and repetition operators and matching each element in the contentagainst an element type in the content model. For compatibility, it is an errorif an element in the document can match more than one occurrence of an

element type in the content model. For more information, see"E.Deterministic Content Models".

• “for compatibility” is “for backward compatibility with SGML”62

DTDs and Grammars

• element declarations specify the “legal” structure of an element’s content– so far, we haven’t said what it means for an element to “match” an

expression in an element declaration• we can view element declarations as

– a grammar and parse an XML document

• but how to decide the following question: given an XML document and a DTD,does the document conform to the DTD?

• for this, we need to know what a grammar is– more precisely, what an extended context-free grammar is

63

Extended context-free grammars

An extended context-free grammar (ECFG) G is of the form (N,T,P,S) where

• N is a finite set of non-terminal symbols

• T is a finite set of terminal symbols

• T ∩ N = {}

• P is a set of productions, i.e., a set of expressions of the form

– A → E for A 2 N and E a regular expression over T and N

• a regular expression over some alphabet is built using (,), ·, +,*

• S 2 N is the start symbol

64


For example G = (N, T, P, paper) with• N = {paper,intro,section}• T = {DATA}• P = {paper → intro · section · section* + unstruct,

intro → DATA,section → DATA,unstruct → DATA}

is an extended context-free grammar, and it is “close” to<!DOCTYPE paper

<!ELEMENT paper ((intro,section+) | unstruct)><!ELEMENT intro (#PCDATA)><!ELEMENT section (#PCDATA)><!ELEMENT unstruct (#PCDATA)>

>

17

65


Given an ECFG G and two strings u and v, we say that• G directly derives v from u if

– u = w1 A w2,– v = w1 w w2, and– there is some A → E 2 P such that w 2 L(E),

where L(E) is the set of strings denoted by the regular expression E• G derives v from u if

– there is a sequence of strings u1, ... , un such that– u = u1,

– v = un, and– G directly derives ui+1 from ui for each 1 ≤ i ≤ n-1

• the language L(G) of G = (N,T,P,S) is the set of all strings w over T such thatG derives w from S

66

Extended context free grammars -- example continued

For example G = (N, T, P, paper) with• N = {paper,intro,section}• T = {DATA}• P = {paper → intro · section · section* + unstruct,

intro → DATA,section → DATA,unstruct → DATA}

• G directly derives intro · section · section · section intoDATA · section · section · section

• G derives intro · section · section · section intoDATA · DATA · DATA · DATA

• DATA · DATA · DATA · DATA 2 L(E)

67

ECFGs and content models in DTDs

• as we have just seen, we can

– associate a DTD with an ECFG

– by associating

• each of its element declarations with a production rule and thus

• each content model of an element declaration with a regularexpression

68

XML documents, DTDs, and ECFGs

• A valid XML document is

– well-formed and

– conforms to its DTD

• a derivation tree of an ECFG is a node-labelled tree

• an XML document can be seen as a node-labelled tree

– with element names as labels

– since we are mainly interested in declarations for element content,we ignore text

• [docs conforming to DTDs] an XML document conforms to a DTD ifits tree is a derivation tree of the ECFG associated with the DTD

• [DTDs being valid] a content model in an element content declaration is valid if its corresponding regular expression is 1-unambiguous

• [can we find a valid DTD] can we “repair” an invalid element content declaration?

I.e., can we turn an invalid content declaration into a valid one (for the samestructure)?

18

69

Ambiguity -- an example

• Consider the expression E = (a + b)* · a · a*• mark the ith occurrence in E of a letter x as xi: E = (a1 + b1)* · a2 · a3*• clearly, aaa 2 L(E), but there are three derivations or witnesses:

1. a1a1a2

2. a1a2a3

3. a2a3a3

• hence we call E ambiguous: given E, a parser would need to “look ahead”or guess when parsing aaa

• but there is F = (a + b)* · a such that1. F and E are equivalent, i.e., L(E) = L(F), and2. F is unambiguous, i.e., for each w 2 L(F), there is exactly 1 witness

70

1-Ambiguity -- example

• Consider the expression F = (a + b)* · a• mark the ith occurrence of a letter x as xi: F = (a1 + b1)* · a2

• clearly, baa 2 L(F), but consider its marking from left to right:1. clearly, we start with b1aa2. but how to proceed? We have two choices:

• b1a1a and• b1a2a

– to decide, which to take, we need to “look ahead”, see that there is asecond a and thus mark b1a1a2

• hence we say that F is not 1-unambiguous• but it is still unambiguous

71

1-Ambiguity -- preparation of formal definition

• in our examples, we marked letters in regular expressions with subscripts

• for E a regular expression, we use E’ for the marked version of E:E’ is obtained from E by marking each letter with a subscript such that

– each subscripted letter occurs at most once in E’

• if E is a marked regular expression, we use ‘E for the expression obtainedthrough dropping of all subscripts

• clearly, if F = E’, then ‘F = E

• we can extend marking and dropping to strings, alphabets and languages:

– w’ is a marked version of the string w

– Σ’ is the set of all marked letters in Σ

– ‘L is the set of words ‘w with w 2 L

– etc.

72

1-Ambiguity -- formal definition

• let E be a regular expression over Σ• E is 1-unambiguous if for every u,v,w 2 (Σ’)* and every x,y 2 Σ’, we have that

– if both uxv and uyw 2 L(E’) and x ≠ y, then ‘x ≠ ‘y– e.g., not both ua3v and ua4w 2 L(E’)

• a language L is 1-unambiguous ifthere exists a 1-unambiguous regular expression E such that L = L(E)

? how to test E for 1-unambiguouity?? how to test XML document for validity?

• not every regular L is 1-unambiguous: remember F = (a + b)* · a? how to test L for 1-unambiguouity?

? how to test whether we can make an XML document valid?

19

73

1-Ambiguity -- a consequence

Lemma: if E is 1-unambiguous, then for every word w 2 L(E), there is at most one wordv 2 L(E’), such that ‘v = w.

• let’s introduce some notation for E over Σ:– first(E) = {a 2 Σ | there is some aw 2 L(E)}– last(E) = {a 2 Σ | there is some wa 2 L(E)}– follow(E,a) = {b 2 Σ | there is some wabv 2 L(E)}

Lemma: let E’ be a marked regular expression and x1 ... xn 2 (Σ’)*. Then x1 ... xn 2 L(E’) ifand only if

– x1 2 first(E’),– xn 2 last(E’), and– for all i with 1 ≤ i < n: xi+1 2 follow(E’,xi)

• In this lemma, marking is essential: consider E = aa and the word aaa

74

1-Ambiguity -- more consequences

• let’s keep in mind some notation for E over Σ:– first(E) = {a 2 Σ | there is some aw 2 L(E)}– last(E) = {a 2 Σ | there is some wa 2 L(E)}– follow(E,a) = {b 2 Σ | there is some wabv 2 L(E)}

Lemma: A regular expression E is 1-unambiguos if and only if– for every x,y 2 first(E’), if x ≠ y, then ‘x ≠ ‘y, and– for every x,y 2 follow(E’,z), if x ≠ y, then ‘x ≠ ‘y

• the following, important fact is proven in One-Unambiguous Regular Languages(1997), Anne Bruegemann-Klein, Derick Wood:

Theorem: Given a regular expression E, we can decide in time polynomial in |E| whetherL(E) is 1-unambiguous.

• this is important since it allows us to answer whether we can make an XML documentthat is invalid due to non-deterministic content models valid

75

DTDs: attribute definitions

• remember: an element (of a well-formed XML document) cannot have morethan 1 attribute-value pair with the same attribute name!

• DTDs can specify attribute names and their types:

• attr-namei for names of attributes• attr-typei can be one of CDATA, NMTOKEN, NMTOKENS, ID, IDREF,

IDREFS (see above) or (c1 | ... | c2)• attr-constri see next slide

<!ATTLIST element-nameattr-name1 attr-type1 attr-constr1 ... ... ...attr-namek attr-typek attr-constrk>

76

DTDs: attribute definitions - attribute constraints

DTDs allow to express the following attribute constraints:

• #REQUIRED: element must have one attribute-value pair for this attribute

• #IMPLIED: element may have an attribute-value pair for this attribute

• “default-value”: we can specify a default value for this attribute

• #FIXED “fixed-value”: each element has the same value fixed-value

– and I don’t need to mention this explicitly

20

77

DTDs: attribute definitions - attribute constraints - examples

<!ATTLIST country

code ID #REQUIRED

capital IDREF #REQUIRED

memberships IDREF #IMPLIED

products NMTOKES #IMPLIED>

<!ATTLIST desert

id ID #REQUIRED

type (sand|rock|ice) ‘sand’

climate NMTOKEN #FIXED ‘dry’>

78

DTDs: entity declarations

• entities can reference each other, and they can be “localised”

• entities preceded by % are parameter entities, they can

– only be declared in external DTDs,

– only be referenced inside the DTD (outside, % has no meaning)

– be over-written

• parameter entities are used to make DTDs maintainable and succinct, e.g.:

<!ENTITY entity-name content>

<!ENTITY entity-name SYSTEM “filename.xml”>

possibly preceded by a %

<!ENTITY % article "art_number, art_name, art_size"><!ENTITY % aux-article ”description | class"><!ELEMENT storage (goods-in | goods-out)*><!ELEMENT goods-in (goods_in_number, (%article;), (%aux_article;))><!ELEMENT goods-out (goods_out_number, (%article;), (%aux_article;))><!ELEMENT art_number (#PCDATA)>...

79

DTDs: entity declarations

• entities can reference each other, and they can be “localised”• entities preceded by % are parameter entities, they can

– only be declared in external DTDs,– only be referenced inside the DTD (outside, % has no meaning)– be over-written

• Example:

• the replacement text of “&book;”:

the general entity reference "&rights;" is expanded when "&book;" appears in thedocument's content or an attribute value.

<!ENTITY entity-name content>

<!ENTITY entity-name SYSTEM “filename.xml”>

possibly preceded by a %

<!ENTITY % pub "Éditions Gallimard"><!ENTITY rights "All rights reserved"><!ENTITY book "La Peste: Albert Camus,

© 1947 %pub;. &rights;">

La Peste: Albert Camus,© 1947 Éditions Gallimard. &rights;

80

DTDs: others

• there are a few other kind of declarations possible in DTDs:

– notations

– conditional sections

• please refer to your favourite XML book for these

comp60362 advanced database technology semi-structured …sattler/teaching/slides.pdf · advanced...

Documents