xml databases - ciwb. piwowarski – xml databases – taller web 2006 c h a n ge s: s gm l t o g x...

68
B. Piwowarski – XML Databases – Taller Web 2006 B. Piwowarski – XML Databases – Taller Web 2006 XML Databases Benjamin Piwowarski Universidad de Chile Taller Web 2006

Upload: others

Post on 02-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML Databases

Benjamin PiwowarskiUniversidad de Chile

Taller Web 2006

Page 2: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Outline

● Introduction● XML Documents● XML Databases● XML Information Retrieval

Page 3: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML Databases

Part I. Introduction

Page 4: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML Roots1967: Generic Coding

● In the 60's, electronic manuscripts are composed of text and control codes (macros)

● The macros are instructions used to format the document

● Generic coding, as proposed by William Tunnicliffe, use descriptive tags (for example, "heading", rather than "format-17"). »

Page 5: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML Roots1969: GML/TDL

● Pele scores his 1000th goal● The human race lands on the moon● [...]● Creation of ARPANET● Invention of Unix● Charles F. Goldfarb, Ed Mosher and Ray Lorie

invent the Generalised Markup Language (or Text Description Language)

Page 6: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML Roots1973: GML (official)

:h1.Chapter 1: Introduction:p.GML supported hierarchical containers, such as:ol:li.Ordered lists (like this one),:li.Unordered lists, and:li.Definition lists:eol.

● C. Goldfarb, E. Mosher and R. Lorie “This analysis of the mark-up process suggests that it should be possible to design a generalized mark-up language so that mark-up would be useful for more than one application or computer system (...)”

Page 7: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML Roots:1980-86: SGML

● A Grammar:Document Type Definition (DTD)= Regular expressions of element sequences + character content Validation

● <Tags> to delimit structure<a> <b><c></c><d></d></b> <e></e> </a>

● But... quite complex b

c d

e

a

Page 8: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML Roots1990-92: HTML, an SGML

instance/application<html>

<body>

<p> Hello world.

</body>

</html>

Page 9: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML Development Goals (World Wide Web Consortium W3C)

● Straightforward to use and easy to create● Support of a variety of applications● Compatibility with SGML● Easy to write programs● No optional features● Human readable● Terse

Page 10: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Cha

nges

: S

GM

L to

XM

L

1. Differences Between XML and SGMLXML allows only documents that use the SGML declaration in this note. This declares all the following SGML features as NO:DATATAGOMITTAGRANKLINK (SIMPLE, IMPLICIT and EXPLICIT)CONCURSUBDOCFORMALNote that it differs from the reference concrete syntax in a number of ways:It also declares no short reference delimiters; it follows that SHORTREF and USEMAP declarations cannot occur in XMLThe PIC (processing instruction close) delimiter is ?>Quantities and capacities are effectively unlimitedNames are case sensitive (NAMECASE GENERAL is NO)Underscore and colon are allowed in namesNames can use Unicode characters and are not restricted to ASCII The following constructs which are permitted in SGML when SHORTTAG is YES are not allowed in XML:Unclosed start-tagsUnclosed end-tagsEmpty start-tagsEmpty end-tagsAttribute values in attribute specifications entered directly rather than as literalsAttribute specifications that omit the attribute nameNET delimiters can be used only to close an empty element. In SGML without the Web SGML Adaptations Annex, the NET delimiter is declared as />. With this approach, XML is not allowing null end-tags and is allowing net-enabling start-tags only for elements with no end-tag. In SGML with the Web SGML Adaptations Annex, there is a separate NESTC (net-enabling start tag close) delimiter. This allows the XML <e/> syntax to be handled as a combination of a net-enabling start-tag <e/ and a null end-tag >. With this approach, XML is allowing a net-enabling start-tag only when immediately followed by a null end-tag.XML imposes the following restrictions not in SGML:Entity references Entity references must be closed with a REFC delimiter References to external data entities in content are not allowed General entity references in content are required to be synchronous External entity references in attribute values are not allowedParameter entity references are allowed in the internal subset only within a declaration separator (that is, at a point where a markup declaration could occur)Character references Character references must be closed with a REFC delimiter Named character references are not allowedNumeric character references to non-SGML characters are not allowedEntity declarations A #DEFAULT entity cannot be declaredExternal SDATA entities are not allowedExternal CDATA entities are not allowedInternal SDATA entities are not allowedInternal CDATA entities are not allowedPI entities are not allowedBracketed text entities are not allowedExternal identifiers must include a system identifierAttributes cannot be specified for an entityThe replacement text of general text entities and external parameter entities is required to be well-formedAn ampersand in a parameter literal must be followed by a syntactically valid entity reference or numeric character referenceAttribute definition list declarations Associated element type in attribute definition list declarations cannot be a name groupAttributes cannot be declared for a notationCURRENT attributes are not allowedContent reference attributes are not allowedNUTOKEN(S) declared values are not allowedNUMBER(S) declared values are not allowedNAME(S) declared values are not allowedA name token group must use the or connectorAttribute values specified as defaults in attribute definition list declarations must be literals (SGML allows them not to be even when SHORTTAG is NO)Element type declarations Associated element type in element type declaration cannot be a name groupIn an element declaration, a generic identifier cannot be specified as a rank stem and rank suffix (SGML allows this even when the RANK feature is NO)Minimization parameters in element declarations are not allowedRCDATA declared content are not allowedCDATA declared content are not allowedContent models cannot use the and connectorContent models for mixed content have a restricted formInclusions are not allowedExclusions are not allowedComments A parameter separator cannot contain comments; this means that markup declarations (other than comment declarations) cannot contain comments Empty comment declarations (<!> in the reference concrete syntax) are not allowedA comment declaration cannot contain more than one commentIn a comment declaration, an S separator is not allowed before the final MDCProcessing instructions Processing instructions must start with a name (the PI target) A processing instruction whose PI target is xml can only occur at the beginning of a external entity and must be an XML declaration if it occurs in the document entity, and otherwise an text declarationA PI target must not match [Xx][Mm][Ll] unless it is xmlMarked sections In marked section declarations, TEMP status keyword is not allowedRCDATA marked sections are not allowedINCLUDE/IGNORE marked sections are not allowed in the document instanceIn a marked section declaration, a status keyword specification that contains no status keywords is not allowedIn a marked section declaration, a status keyword specification cannot contain more than one status keywordMarked sections are not allowed in the internal subsetParameter separators are not allowed in status keyword specifications in the document instance; in particular, parameter entity references are not allowedOther Names beginning with [Xx][Mm][Ll] are reservedThe SGML declaration must be implied and cannot be explicitly present in the document entityWhen < and & occur as data, they must be entered as &lt; and &amp;A parameter separator required by the formal syntax must always be present and cannot be omitted when it is adjacent to a delimiterXML predefines the semantics of the attributes xml:space and xml:lang. It also reserves all attribute, element type and notation names beginning with [Xx][Mm][Ll].XML requires that an SGML parser use an entity manager that behaves as follows:Lines are terminated by newline (Unicode code #X000A) rather than being delimited by RS and RE as with a typical SGML entity managerSystem identifiers are treated as URLsThe entity manager must support entities encoded in UTF-16 and UTF-8, and must be able automatically to detect which encoding an entity uses based on the presence of the byte order markThe entity manager should be able to recognize the encoding declaration in the XML declaration and encoding PI and use it to determine the encoding of entityXML imposes requirements on the information that a parser must make available to an application.XML depends on the following changes to SGML made by Web SGML Adaptations Annex: HCRO delimiter (for hex numeric character references); for XML this is &#xEMPTYNRM feature that allows elements declared EMPTY to have end-tagsNESTC delimiterDuplicate enumerated attribute tokens are allowedRelaxation of rules on use of parameter entity references inside groupsMultiple ATTLIST declarations for a single element type ATTLIST declarations which don't declare any attributes KEEPRSRE feature that turns off SGML's rules for ignoring RSs and REsFully-tagged SGML documents; a document that is fully-tagged but not type-valid is a conforming SGML document; this makes all XML documents, including those that are well-formed but not valid, conforming SGML documentsPredefined data character entities in the SGML declaration (for lt, amp and so on)Unlimited capacities and quantitiesThe Web SGML Adaptations Annex also enables some XML restrictions to be enforced in SGML:SHORTTAG is unbundled, so the SGML declaration can allow attribute defaulting and NET without allowing other SHORTTAG constructsThe SGML declaration can assert that a document is integrally stored, which disallows improperly nested entity references in content

Simplifications

Page 11: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML Roots1998: XML 1.0

● W3C standard (version 1.1 in 2004)● Simplification of SGML● Simple yet powerful● New grammars (XML Schema, ...)● Originally intended as document markup

language not database language (or technology)

Page 12: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Advantages of XML over SGML● Simplicity (no optional features)● The grammar is optional; an XML document

can be:– Well-Formed (WF)Well-Formed (WF): : XML Graph (tree)

<a><p>blah</p><p>blah</p></a> is WF<a><p>blah<p>blah</a> is not WF

– ValidValid: : Grammar-authorised subgraphs (valid implies WF)

● Proposed grammars can be powerful:DTD (SGML/XML) to XML Schema, RelaxNG

Page 13: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Summary of (X)ML advantages

● Increased lifespan● Communication between applications● Documents can be modified with a simple text

editor / an XML aware editor● Information can be processed by external

applications easily and partially● Semantics● Querying (XQuery, XPath, ...) and

transformations (XSLT)

Page 14: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Example

<?xml version="1.0"?><w:wordDocument xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"> <w:body> <w:p> <w:r> <w:rPr><w:b/></w:rPr><w:t>Hello World</w:t> </w:r> </w:p> </w:body></w:wordDocument>

[...]00001470 00 00 00 00 00 00 00 00 00 00 02 00 d9 00 00 00 |................|00001480 48 00 65 00 6c 00 6c 00 6f 00 20 00 77 00 6f 00 |H.e.l.l.o. .w.o.|00001490 72 00 6c 00 64 00 0d 00 00 00 00 00 00 00 00 00 |r.l.d...........|000014a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|00001680 00 04 00 00 18 04 00 00 fc 00 00 00 00 00 00 00 |................|[...]

Page 15: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML Databases

Part II. XML Documents

Page 16: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Mr. the XML document<?xml version=”1.0” encoding=”utf-8”?><!DOCTYPE book SYSTEM "/etc/printers.dtd"><driver id="driver/appledmp"> <name>appledmp</name><url>http://www.ghostscript.com/...</url> <execution> <ghostscript/> </execution> <printers> <printer> <!-- Apple Dot Matrix --> <id>printer/Apple-Dot_Matrix</id> </printer> </printers></driver>

Page 17: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

● Documents have tagstags giving extra information about sections of the document

<title> XML </title> <slide> Introduction …</slide>

● Extensible, unlike HTML (SGML application)HTML (SGML application)users can add new tags, and separately specify how the tag should be handled for display

● Goal was (is?) to replace HTML as the language for publishing documents on the Web

XML (eXtensible Markup Language)

Page 18: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

● The ability to specify new tags, and to create nested tag structures made XML a great way to exchange data, not just documents

much of the use of XML has been in data exchange applications, not as a replacement for HTML

● Tags make data self-documenting

XML Document

Page 19: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Nesting

● TagTag: label for a section of data● ElementElement: section of data beginning with

<tagname> and ending with matching </tagname>

● Elements must be properly nestednested– Proper nesting

<account> … <balance> …. </balance> </account>

– Improper nesting <account> … <balance> …. </account> </balance>

● Every document must have a single top-level single top-level elementelement

Page 20: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

● Mixture of text with sub-elements:<account>This account is seldom used any more.

<account-number> A-102</account-number><branch-name> Perryridge</branch-name><balance>400 </balance>

</account>

● Useful for document markup but discouraged for data representation

XML Elements & text

Page 21: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

● Elements can have attributes<account acct-type = “checking” ><account-number> A-102 </account-number><branch-name> Perryridge </branch-name><balance> 400 </balance></account>

● Attributes are specified by name="value" pairs inside the starting tag of an element

● An element may have several attributes, but each attribute name can only occur once<account acct-type = “checking” monthly-fee=“5”>

XML Attributes

Page 22: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML - Attributes vs Elements

● In the context of documents, attributes are part of markup, while element contents are part of the basic document contents

● In the context of data representation, the difference is unclear and may be confusing<account account-number = “A-101”> ...</account><account>

<account-number>A-101</account-number>…</account>

● Suggestion: use attributes for identifiers of elements, and use elements for contents

Page 23: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

● Elements without sub-elements or text content can be abbreviated by ending the start tag with a /> and deleting the end tag<account number=“A-101” branch=“Perryridge” balance=“200 />

● Comments: enclosed in <!– and --> tags.● CDATA sections: instructs XML processor to

ignore markup characters and pass enclosed text directly to application.<![CDATA[<account> … </account>]]>

Miscellaneous XML

Page 24: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

● In XML, elements are ordered.● In contrast, in XML attributes are unordered

Equivalence between<t a1="v1" a2="v2">...</t><t a2="v2" a1="v1">...</t>

XML Ordering

Page 25: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Namespaces

● Combine fragments from different documents without any naming conflicts

● Write reusable code modules that can be invoked for specific elements and attributes

● Define elements and attributes that can be reused in other schemas or instance documents without fear of name collisions

Page 26: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Namespaces: example

<math> <mfrac> <mrow> <mi>a</mi> <mo>+</mo> <mi>b</mi> </mrow> <mrow> <mn>2</mn> </mrow> </mfrac></math>

<svg version="1.1"><ellipse cx="240" cy="100" rx="220" ry="30" style="fill:yellow"/><ellipse cx="220" cy="100" rx="190" ry="20" style="fill:white"/></svg>

<div> <p>This is a text with a <b>MathML</b> formula </p> <p>and a <b>SVG</b> figure</p></div>

Page 27: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Namespaces: Example (cont'd)<div xmlns="http://www.w3.org/1999/xhtml"> <p>This is a text with a <b>MathML</b> formula </p> <math xmlns="http://www.w3.org/1998/Math/MathML"> <mfrac> <mrow> <mi>a</mi><mo>+</mo><mi>b</mi> </mrow> <mrow><mn>2</mn></mrow> </mfrac> </math>

<p>and a <b>SVG</b> figure</p>

<svg version="1.1" xmlns="http://www.w3.org/2000/svg"> <ellipse cx="240" cy="100" rx="220" ry="30" style="fill:yellow"/> <ellipse cx="220" cy="100" rx="190" ry="20" style="fill:white"/></svg>

</div>

Page 28: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Namespaces: Example (cont'd)<h:div xmlns:h="http://www.w3.org/1999/xhtml" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:s="http://www.w3.org/2000/svg"> <h:p>This is a text with a <h:b>MathML</h:b> formula </h:p> <m:math> <m:mfrac> <m:mrow> <m:mi>a</m:mi><m:mo>+</m:mo><m:mi>b</m:mi> </m:mrow> <m:mrow><m:mn>2</m:mn></m:mrow> </m:mfrac> </m:math>

<m:p>and a <m:b>SVG</m:b> figure</m:p>

<s:svg version="1.1"> <s:ellipse cx="240" cy="100" rx="220" ry="30" style="fill:yellow"/> <s:ellipse cx="220" cy="100" rx="190" ry="20" style="fill:white"/></s:svg>

</h:div>

Page 29: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML Model

● It is necessary to separate the model from the internal model (physical) for– Data independence from physical storage– Definition of Query Languages / XML

Database, API● Nowadays, the major models are:

– XML Information Set– DOM 1.0 Level 2– XQuery 1.0 and XPath 2.0 data model

Page 30: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML Model (2)

● A tree-like structure– Root node: Document (virtual)– Inner node: Element– Leaf nodes

● Element● Text● Comment● Processing instructions● (attributes) DOM

Page 31: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Programming with XML:Javascript 1.6 (E4X)

● In Javascript 1.6 (E4X)

XML is a native type

var x = <p><b>Hello</b> world <b>!</b></p>;x.b[0] is the element <b>Hello</b>

x.b[0] can be updated x.b[0] = <i>Hello</i>x is <p><i>Hello</i> world <b>!</b></p>

Page 32: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Where can XML be found?

● XHTML● RSS● SVG● MathML● RDF● Office applications (Openoffice, koffice,

Microsoft office, ...)● ...

Page 33: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML in the WebThe XML Web: a First Study (Mignet and Barbosa, 2002)

● 190 147 documents and 19 254 sites.● The ``.com'' and ``.net'' domains combined

contain 53% of the documents and 76% of the volume of XML content on the Web

● WAP and RDF make up 26% and 17% of all documents

● The average document size is around 4KB● 99% of them have less than 8 levels of element

nesting

Page 34: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML Databases

Part III. XML Databases

Page 35: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

MotivationsWhy XML data?

● Data can contain fields not known at design time.

● Data is self-describing. ● Data may be sparse.● Hierarchies are naturally handled● Natural form for document-centric data

Page 36: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

MotivationsWhy a database?

● Ease of management– Integrity constraints– independence from storage

● Enhanced query performance● Transactional safety (ACID): Atomicity,

Consistency, Isolation and Durability.● Security● Managing huge quantities of data● ...

Page 37: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

MotivationsWhy an enabled XML DB?

● Enabled XML Database– Built on top of a relational database (tables)– XML is only an “exchange” format– Construction of wrappers– Fixed schema

● Storage of XML documents● Structured query languages like XQuery

Page 38: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

MotivationsWhy an enabled XML DB (Example) ?

● Two tables:– articles (ID, title)– authors (articleID, name)

● Exchange format:<article id="...">

<title>The life of the great sea lion</title><author>...</author><author>...</author>

</article>

Page 39: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Motivations Native XML DB (According to the XMLDb mailing list)

● Defines a (logical) model for an XML document (as opposed to the data in that document) and stores and retrieves documents according to stores and retrieves documents according to that modelthat model.

● Has an XML document as its fundamental unit of (logical) storage

● Is not required to have any particular underlying physical storage model.

Page 40: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

MotivationsWhy a native XML Database?

● Avoid some bottlenecks– A simple XQuery might involve numerous

joins (e.g. recursive property)– XML data-model

● More flexibility in schema evolution– Schemas can be changed – Unknown version of a schema

● Links, versioning

Page 41: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

MotivationsXML Applications

● Le Monde (Xyleme Zone Server): Over 800000 documents, 6 gigabytes

● Flight information (Schiphol Airport in Amsterdam) uses Tamino to integrate data from more than 38 systems in real time.

● Customer profiles● ...

Page 42: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Influences on designTypical queries

● Queries on text only– Includes keyword, stemming, proximity

search● Queries on text and structure

– Content constraints– Structure constraints

● Queries that span structure– Structure might be “superfluous”– User might not know that (or don't want to)

● Other issues: joins, construction

Page 43: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Influences on designTwo types of documents

● Regular (data-centric)– Resembles relational data– Regular structure– Scalar values

● Mixed (document-centric)– Flexible structures– Arbitrary depth– Spare data

Page 44: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Influences on designTwo types of documents (examples)

<SalesOrder SONumber="12345">

<Customer CustNumber="543">

<CustName>ABC Industries</CustName>

<Street>123 Main St.</Street>

<City>Chicago</City>

<State>IL</State>

<PostCode>60609</PostCode>

</Customer>

<OrderDate>981215</OrderDate>

...

</SalesOrder>

Page 45: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Influences on designTwo types of documents (examples)

<FlightInfo>

<Airline>ABC Airways</Airline> provides <Count>three</Count>

non-stop flights daily from <Origin>Dallas</Origin> to

<Destination>Fort Worth</Destination>. Departure times are

<Departure>09:15</Departure>, <Departure>11:15</Departure>,

and <Departure>13:15</Departure>. Arrival times are minutes later.

</FlightInfo>

Page 46: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Influences on designTwo types of documents (examples)

<Product>

<Intro>

The <ProductName>Turkey Wrench</ProductName> from <Developer>Full Fabrication Labs, Inc.</Developer> is <Summary>like a monkey wrench, but not as big.</Summary>

</Intro>

<Description><Para>The turkey wrench, which comes in <i>both right- and left-handed versions (skyhook optional)</i>, is made of the <b>finest stainless steel</b>. The Readi-grip rubberized handle quickly adapts to your hands, ...

Page 47: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Influences on designMisc

● XML Data and Web are tightly related– Local = efficiency– Distributed = up-to-date

● PDOM (Persistent DOM)– The DOM tree returned is “live”– Similar to one of the roles of object

databases● Content Management System

Page 48: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Problematics

● Representation– How to store (speed, size, access)?– How to update?

● Schema design (normalisation)● Transactions: AC – Isolation - D● Querying

– How to evaluate e.g. XQuery expressions?– What are the appropriate index(es) and

representation(s)?

Page 49: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

ProblematicsStoring documents

● “Enabled” relational database– From XML schemas to relational schemas– Wrappers

● “Native” database– Text-based– Model-based

● Relational storage: how to capture identity, structure and order?

● Compressed representation● Others

Page 50: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Storing XML: Example (pre-post)

b c

d e

f g

h

a

Context node

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8pre

po

st

a

b

c

d

ee

f

g

h

(pre,post,level)

Page 51: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

ProblematicsNormalisation

● Normalisation = do not duplicate the information

● Relational databases– Functional dependencies (1-3NF, BCNF)– Multivalued dependencies

● XML and normalisation– How to extend relational concepts?– How to normalise a schema?

Page 52: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

ProblematicsTransactions and security

● Relational databases– Nothing to do!

● Other native XML databases– Durability, Consistence, Atomicity– Isolation

● Security– How to restrict access?

Page 53: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

ProblematicsQuery evaluation and indexes

● Evaluating query plans– How to predict selectivity of operators?– Optimisation– Use of “benchmarks” (XML + queries)

● Indexes– Uni-dimensional: structure, content– Multidimensional indexing

Page 54: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML Databases

Part IV. XML Information Retrieval

Page 55: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Motivations

● Traditional IR is about finding relevant documents to a user’s information need, e.g. entire book.

● SDR allows users to retrieve document components that are more focussed to their information needs (ex. a chapter of a book instead of an entire book).

● The structure of documents is exploited to identify which document components to retrieve.

Page 56: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Documents Query

Document representation

Retrieval results

Query representation

Indexing Formulation

Retrieval functionRelevancefeedback

EvaluationAssessments

Relevancefeedback

Conceptual model for IR

Page 57: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Aims of XML IR

● Document components of varying granularity (e.g. a book, a chapter, a section, a paragraph, a table, a figure, etc.) relevant to the user’s information need both with regards to content and structure

● SDR involves the same tasks as in the conceptual model for IR...

● but with different inner functionality (e.g. indexing, query formulation, retrieval, result presentation, feedback, ...)

Page 58: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML IR Concepts

● Like in IR– Transformation of queries and documents

into an adequate representation– A score (RSV) between the query and the

element representations– Feedback can be used both to update

document or query representations● But... Document and possibly queries are

structured

Page 59: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML IR: Queries

● Content-only (CO) queriesStandard IR queries but here we are retrieving document components

“Santiago metro”● Structure-only queries

Usually not that useful from an IR perspective

“Paragraph containing a diagram next to a table”

Page 60: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML IR: Queries (2)

● Content-and-structure (CAS) queriesPut on constraints on which types of components are to be retrieved

“Articles that contain sections about congestion charges in Santiago, and that contain a picture of a hole in the road”//article[.about(.//section,congestion charge) and about(.//picture,hole in the road)]]

Page 61: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML IR: Example (augmentation weight)

● A system should always retrieve the most specific part of a document answering a query.

● Example query “XQL”: the subsection is retrieved

section

section0.5

section0.5

0.5 example 0.8 XQL0.7 syntax

0.4 XQL

0.4 XQL

Page 62: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML IR: Example (cont'd)(augmentation weight)

section

section0.5

section0.5

0.5 example 0.8 XQL0.7 syntax

0.4 XQL

0.5 0.9 XQL

Example query “XQL”: the section is retrieved

Page 63: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Page 64: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Page 65: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Page 66: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML IR: Open Questions

● What to search?● How to express the search?● How to search?● How to present the search?● How to know if what we found is relevant?● Querying new XML applications (SVG,

MathML, etc.)

Page 67: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

XML IR: Actual tasks

● Ad-hoc: element retrieval● Interactive: study the user● Relevance feedback: user sessions● Natural language: extract structural constraints

from text● Multimedia: query on different medias● Heterogeneous: mixed XML collections● Mining: classification

Page 68: XML Databases - CIWB. Piwowarski – XML Databases – Taller Web 2006 C h a n ge s: S GM L t o G X M L 1.Differences BetweenXMLandSGML XML alowsonly documentsthatuse the SGML declaration

B. Piwowarski – XML Databases – Taller Web 2006B. Piwowarski – XML Databases – Taller Web 2006

Links● Charles F. Goldfarb, The Roots of SGML -- A Personal Recollection,

1996http://www.sgmlsource.com/history/roots.htm

● James Clark, Comparison of SGML and XML, 1997http://www.w3.org/TR/NOTE-sgml-xml.html

● W3Chttp://www.w3.org

● W3Schoolshttp://www.w3schools.com

● Ronald Bourret, XML and Databases http://www.rpbourret.com/xml/XMLAndDatabases.htm

● INEX (INitiative for the Evaluation of XML Retrieval)http://inex.is.informatik.uni-duisburg.de

● My page of linkshttp://benjamin.piwowarski.free.fr/links.php?view=XML%20World