xml: looking at the forest instead of the trees - igt.net

XML: Looking at the Forest Instead of the Trees

Guy LapalmeRALI-DIRO

Universite de MontrealP.O. Box 6128, Succ. Centre-VilleMontreal, Qc, Canada, H3C 3J7

e-mail: [email protected]://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees

November 18, 2005

[email protected]

http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees

Abstract

This report gives a high-level overview of the main principles of some XML technologies:DTD, XML Schema, RELAX NG, XPath, XSL stylesheets, Formatting Objects, DOMand SAX models of processing. They are presented from the point of view of a computerscientist, without the hype too often associated with them. We do not give a detaileddescription but we focus on the relations between the main ideas of XML and other computerlanguage technologies. A single compact pretty-print example is used throughout the text toillustrate the processing of an XML structure with XML technologies or by programmingin Java. We also show how to create an XML document by programming in Java.

A first version report of this report was written in Fall 2002 during my sabbatical atthe Universite de Grenoble and at Xerox Research Centre Europe. I wish to thank GillesSerasset, Christian Boitet, Pierre Isabelle and Marc Dymetman for many fruitful discussions.Since then, the document has been improved (at least increased in the number of pages...)after using it in teaching undergraduate and graduate courses at the Universite de Montreal:IFT3220 and IFT6281. I especially thank Fabrizio Gotti for his careful reading and for manyinsightful comments.

Contents

1 Introduction 6

2 Instance Document 142.1 Namespaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Document Validation 213.1 Document Type Declaration (DTD) . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 Associating an Instance File to DTD . . . . . . . . . . . . . . . . . . 253.2 Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.1 Simple Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2.2 Complex Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2.3 Keys and Keyrefs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.4 Namespaces in Schemas . . . . . . . . . . . . . . . . . . . . . . . . . 403.2.5 Overview of the XML Schemas of Our Application . . . . . . . . . 41

3.3 RELAX NG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.4 Associating an Instance File to a Schema . . . . . . . . . . . . . . . . . . . . 533.5 Additional Information on XML Schema . . . . . . . . . . . . . . . . . . . 54

4 Document Transformation 554.1 XPath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.2 XSL Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.3 Transformation in HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.3.1 Bulleted Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.3.2 Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.3.3 Computing New Information . . . . . . . . . . . . . . . . . . . . . . . 69

4.4 Transformation into a Compact Textual Form . . . . . . . . . . . . . . . . . 764.5 Transformation into PDF with XSL-FO . . . . . . . . . . . . . . . . . . . . 80

4.5.1 XSL-FO Input to the Renderer . . . . . . . . . . . . . . . . . . . . . 834.5.2 From the Instance Document to the XSL-FO file . . . . . . . . . . . 84

4.6 Associating an Instance File to a Stylesheet . . . . . . . . . . . . . . . . . . 894.7 Additional Information on XSL . . . . . . . . . . . . . . . . . . . . . . . . . 90

1

5 Document Processing by Programming 915.1 Document Object Model (DOM) . . . . . . . . . . . . . . . . . . . . . . . . 915.2 Simple API for XML (SAX) . . . . . . . . . . . . . . . . . . . . . . . . . . 965.3 Showing an Interactive Tree View . . . . . . . . . . . . . . . . . . . . . . . . 100

5.3.1 Building a JTree with DOM . . . . . . . . . . . . . . . . . . . . . . . 1015.3.2 Building a JTree with SAX . . . . . . . . . . . . . . . . . . . . . . . 102

5.4 Additional Information on Programming Models . . . . . . . . . . . . . . . . 104

6 Document Creation by Programming 1056.1 Creating a DOM Document . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056.2 Creating a Document with SAX Events . . . . . . . . . . . . . . . . . . . . 1096.3 Additional Information on XML Document Creation . . . . . . . . . . . . . 114

7 Conclusion 115

Some XML Related Technologies and Systems 119

Quick Reference Tables 120DTD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120XML Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121RELAX NG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122XSLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

2

List of Tables

3.1 DTD syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 XML Schema syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3 RELAX NG Compact and RELAX NG syntax . . . . . . . . . . . . . . 43

4.1 Examples of XPath expressions . . . . . . . . . . . . . . . . . . . . . . . . . 584.2 XSLT syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3

List of Figures

1.1 Simple XML structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Tree, Web browser, grid and table views of an XML file . . . . . . . . . . . 81.3 Overview of XML technologies . . . . . . . . . . . . . . . . . . . . . . . . . 101.4 HTML and in displayed HTML compact form . . . . . . . . . . . . . . . . 121.5 Text and PDF compact form . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Graphical view of the Schema for the cellar book . . . . . . . . . . . . . . . 293.2 Graphical view of the Schema for the wine catalog . . . . . . . . . . . . . . . 333.3 Built-in datatypes for XML Schema . . . . . . . . . . . . . . . . . . . . . . 38

4.1 HTML display the cellar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.2 HTML display of the red wines in the catalog . . . . . . . . . . . . . . . . . 674.3 HTML display of information about the cellar . . . . . . . . . . . . . . . . . 704.4 PDF output of compaction by Formating Objects . . . . . . . . . . . . . . . 814.5 Outline of the XSL-FO file produced by the nested box presentation . . . . 82

5.1 JTree display (on Mac OS X) of listing 2.2 . . . . . . . . . . . . . . . . . . . 100

4

Listings

2.1 Outline of CellarBook.xml which includes WineCatalog.xml . . . . . . . . 152.2 [CellarBook.xml]: XML instance document for the content of the cellar . 162.3 [WineCatalog.xml]: XML instance document for the wine catalog . . . . . 183.1 [CellarBook.dtd]: DTD for the cellar book . . . . . . . . . . . . . . . . . 233.2 [WineCatalog.dtd]: DTD to validate the wine catalog . . . . . . . . . . . 243.3 [CellarBook.xsd]: XML Schema for the cellar book . . . . . . . . . . . . 303.4 [WineCatalog.xsd]: Schema for the wine catalog . . . . . . . . . . . . . . . 343.5 Outline of CellarBook.xsd which imports WineCatalog.xsd . . . . . . . . 423.6 [CellarBook.rnc]: RELAX NG compact notation schema for the cellar book 453.7 [CellarBook.rng]: RELAX NG schema for the cellar book . . . . . . . . 463.8 [WineCatalog.rnc]: Relax NG Schema for the wine catalog . . . . . . . . . 493.9 [WineCatalog.rng]: Relax NG schema for the wine catalog . . . . . . . . . 504.1 [compactHTML.html]: HTML output produced by the transformation on the

cellar book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.2 [compactHTML.xsl]: XSLT transformation to produce a bulleted list . . . . 644.4 [WineCatalog.xsl]: XSLT to select the red wines in the wine catalog . . . 664.3 [WineCatalog.html]: HTML output of the red wines in the wine catalog . 674.5 [CellarBook.html]: HTML output about the cellar . . . . . . . . . . . . . 694.6 [CellarBook.xsl]: XSLT stylesheet to produce information about the cellar 734.7 [CellarBook.txt]: Text compaction of the cellar book . . . . . . . . . . . . 774.8 [compact.xsl]: Stylesheet used to compact the cellar book . . . . . . . . . 784.9 [compactFO.xsl]: Stylesheet to transform into colored nested blocks . . . . 855.1 [DOMCompact.java]: Text compaction of the cellar book with Java using DOM 925.2 [CompactErrorHandler.java]: DOM error handling . . . . . . . . . . . . . 955.3 [SAXCompact.java]: Text compaction of the cellar book with Java using SAX 965.4 [CompactHandler.java]: SAX Handler for text compacting an XML file . 985.5 [TreeViewer.java]: JTree building with DOM . . . . . . . . . . . . . . . . 1015.6 [JTreeHandler.java]: JTree building with SAX . . . . . . . . . . . . . . . 1026.1 [DOMExpand.java]: Compact form parsing to create a DOM XML document 1066.2 [CompactTokenizer.java]: Specialized stream tokenizer . . . . . . . . . . 1086.3 [SAXExpand.java]: XML document creation using SAX events . . . . . . . 1096.4 [CompactReader.java]: Compact form parsing to generate SAX events . . 110

5

Chapter 1

Introduction

XML has been developed to facilitate the annotation of information to be shared betweencomputer systems. It is intended to be easily generated and parsed by computer systemson diverse platforms so its format is based on character streams rather than internal binaryones. Being character based, it also has the nice property of being readable and editable byhumans using standard text editors.

XML is based on a uniform, simple and yet powerful model of data organization: thegeneralized tree. Such a tree is defined as either a single element or an element havingother trees as its sub-elements called children, see middle of figure 1.1. This is the samemodel as the one chosen for the Lisp programming language almost 50 years ago. Thishierarchical model is very simple and allows a simple annotation of the data. As in Lisp,the same tree notation used for data representation is also employed to write programs totransform tree structures into other tree structures. On top of this identity of data andprogram representation, in XML, the tree notation is also used to denote type informationto validate XML data.

As is shown at the top of figure 1.1, an arbitrary name between < and > symbols is givento a node of a tree. This is called a start-tag. Everything up until a corresponding end-tag(the same tag except that it starts with </) forms the content of the node, which can itselfbe a tree. Such trees (e.g. wine, properties and color in figure 1.1) are called elements.Elements can also contain character data and even mix character data and elements (e.g.food-pairing). In Lisp (bottom of figure 1.1), trees are represented by embedded lists (i.e.identifiers or lists enclosed between opening and closing parentheses) whose first element isthe name of the node; character data is represented by character strings. An XML elementwith no content can be indicated with an end-tag immediately following a start-tag andcan be abridged as an empty-element tag : a start-tag with a terminating / see (rating infigure 1.1). Comments can be added to an XML file by means of a special element thatstarts with .

Additional information can be added to an element tag with attribute pairs comprisingthe name of the attribute (e.g. format), an equal sign and the corresponding character stringvalue within double or single quotes (e.g. "1l" or ’1l’). Attributes can also be added to anempty element (e.g. rating).

6

<?xml version="1.0" encoding="UTF -8"?><wine name="M" code="00518712" format="1l">

<properties ><color >red</color><alcoholic -strength >12</alcoholic -strength >

</properties ><origin >

<country >Italy</country ><region >Abruzzo </region ><producer >Cantina Miglianico SCARL</producer >

</origin ><rating stars="2"/><food -pairing >Cold cuts , <bold>Meatloaf </bold>, Pizza</food -pairing ><price>9.95</price><year>2004</year>

</wine>

wine name:"M" code:"00518712" format:"1l"

properties origin rating stars:2 food-pairing price year

color alcoholic-strength

red 12

country region

Italy Abruzzo

producer

Can..SCARL

Cold cuts, bold

Meatloaf

, Pizza 9.95 2004

(wine (:name "M" :code "C00518712" :format "1l")(properties

(color red)(alcoholic -strength 12))

(origin(country Italy)(region Abruzzo)(producer Cantina Miglianico SCARL ))

(rating :stars "2")(food -pairing Cold cuts , (bold Meatloaf), Pizza)(price 9.95)(year 2004)

)

Figure 1.1: A simple XML structure (top) and a corresponding Lisp style structure (bottom).In the middle is shown an equivalent tree structure in which the element names have beenshown in bold and the attributes in italics. The real information is the character data whichappears in roman font. This shows the relations between nodes: properties has wine asparent and color, alcoholic-strength as children; a sibling of region is country.

7

Figure 1.2: On top right, file of figure 1.1 as displayed in Internet Explorer; the + at the leftof <properties> indicates that this element is hidden by collapsing. By clicking on it, the +

becomes - and the tree is displayed in full. The other parts of the figure show alternativeviews of the same file available on commercial XML editors in order to hide the tags fromthe view of the user: on the left, the <oXygen/> tree editor view; on the middle right is anXMLSpy grid view; on bottom, a table view offered by XMLSpy as a transpose of the gridview.

8

As is shown in the middle part of figure 1.1, these notations are equivalent to a tree datastructure where each node is labelled with its name and attributes. Character data appearsas leaf nodes. An empty element is a node with no sub-tree. In Lisp, the attributes can berepresented by a list of pairs with names indicated by keywords (i.e. identifiers starting witha colon) followed by the corresponding value.

XML has the (well deserved) reputation of being verbose but it must be kept in mind thatthis notation is primarily aimed at communication between machines for which verbosity isnot a problem but uniformity of notation is a real asset. In fact, humans should not be reallyrequired to type all these start-tags and end-tags. Indeed, many useful structural XMLeditors are now available which hide the verbosity, keeping only the important structuralinformation or by displaying embedded tables instead of tags. Figure 1.2 shows alternativeviews of an XML file.

As has been shown by Lisp over the years, this tree notation is very general and can beused not only to represent data but also its processing. Programs for transforming XML treestructures into other tree structures can be written in XSL (eXtensible Stylesheet Language)stylesheets which are a declarative notation for XML transformation also written in XML.

An important aspect of XML (and one that differs from Lisp) is the a priori type checkingthat can be done on the file and the validation that can be performed before processing.XML type information can be provided either with a DTD or with schemas, which offera more powerful and flexible type system. A schema is also written as an XML file whichcan itself be type checked. An alternative schema notation called RELAX NG will also bepresented later in this document.

XML originated from the need for a flexible way of organizing natural language texts andthus its designers used standard representations of the characters —most often Unicode—and standard encodings such as UTF-8 or UTF-16, which will not be discussed here.

XML is also widely used in computing systems to systematize structured data as analternative to databases. Many relational databases also offer XML specific features forindexing and searching. Because of the portability of its encoding and the fact that XMLparsers are freely available, it is also used for many tasks requiring flexible data manipulationto transfer data between systems, as configuration files of programs and for keeping informa-tion about other files. This document will not present these applications but will focus on asingle one (creating a compact representation of an XML file) that will be used throughoutso that one can feel the similarities and differences between some XML technologies.

Figure 1.3 presents the XML technologies we will describe in this report and their rela-tions. The focus of the whole process is an XML instance document that contains the data(towards the top of figure 1.1). This XML document can be validated against a specificationdescribed either as a Document Type Description (DTD) or a XML Schema, itself anotherXML file. The validation process will be described in chapter 3. Once validated, XML datacan be used by application programs through specific Application Programming Interfaces(APIs) described in chapters 5 and 6. XML data can also be processed by transformations(chapter 4) written as stylesheets, a special kind of validated XML file, to create new XML,HTML, PDF or text files.

9

Text

ValidationChapter 3

DTD - Schema.dtd .xsd .rng .rnc

StyleSheets.xsl

TransformationsChapter 4

XML Instance Document

.xml

Formatting Objects.xml

XHTML

RenderingChapter 4.3

PDF HTML PDA

...

...

APIChapters 5,6

XML document

Process Chapter

Output Document

Types:

ApplicationPrograms

Figure 1.3: Overview of XML technologies10

For example, the XML file at the top of figure 1.1 can be transformed with a stylesheetinto a HTML one. The top of figure 1.4 shows such a possible HTML output, displayed in aweb browser shown at the bottom of the figure. This is the kind of tree to tree transformationfor which XSLT was specifically designed. To better illustrate the power of the more generaltransformations that XSLT allows, we will show how to obtain a more compact form1 shownin figure 1.5 either as a text file (top) or in PDF (bottom) through a transformation usingFormatting Objects.

This chapter has shown that XML is a flexible notation for adding information to naturallanguage text but it is more and more used in other areas as well. The raw XML is verboseand not very user-friendly but it can be hidden by appropriate tools. Programmers can alsorely on freely available XML parsers and validators in order to get a well-organized datastructure from an XML file.

This report tries to give an overall impression of some XML techniques and should notbe considered as a definitive or exhaustive manual. We will describe the main principles andpresent general rules, and for the sake of simplicity, we will sometimes be making white liesthat seasoned XML experts could point out.

[T]he right abstraction [for XML ...] is a labeled tree of elements. Each elementhas an ordered list of children in which each child is a Unicode string or anelement. An element is labeled with a two-part name consisting of a URI andlocal part. Each element also has an unordered collection of attributes where eachattribute has a two-part name, distinct from the name of the other attributes inthe collection, and a value, which is a Unicode string. That is the completeabstraction. [...]. If you understand this, then you understand XML.

James Clark, in [34, pp. ix-x].

1This compaction notation is similar to the one used in the Formal Description of XML [15] and mustbe seen as a programming exercise and not as a compression technique for XML files.

11

<html xmlns="http: //www.w3.org /1999/ xhtml"><head><title >HTML compaction of the XML file</title ></head><body>

<ul><li xmlns=""><b>wine</b> name="M" code="00518712" format="1l"<ul><li><b>properties </b>

<ul><li><b>color </b> red</li><li><b>alcoholic -strength </b> 12</li>

</ul></li><li><b>origin </b>

<ul><li><b>country </b> Italy</li><li><b>region </b> Abruzzo </li><li><b>producer </b> Cantina Miglianico SCARL </li>

</ul></li><li><b>rating </b> stars="2" </li><li><b>food -pairing </b>

<ul>Cold cuts , <li><b>bold</b> Meatloaf </li>, Pizza </ul></li><li><b>price </b> 9.95</li><li><b>year</b> 2004</li>

</ul></li></ul>

</body></html>

Figure 1.4: Representation of the tree of figure 1.1 in source HTML and as it appears ina browser window. This HTML output (slighly reformatted here to fit in the page) wasproduced by our example stylesheet compactHTML.xsl shown in listing 4.2.

12

<?xml version="1.0" encoding="utf -8"?>wine[@name[M]

@code [00518712]@format [1l]properties[color[red]

alcoholic -strength [12]]origin[country[Italy]

region[Abruzzo]producer[Cantina Miglianico SCARL ]]

rating[@stars [2]]food -pairing[Cold cuts ,

bold[Meatloaf], Pizza]

price [9.95]year [2004]]

wine @name M@code 00518712@format 1lproperties color red

alcoholic-strength12origin country Italy

region Abruzzoproducer Cantina Miglianico SCARL

rating @stars 2food-pairing _ Cold cuts,

bold Meatloaf_ , Pizza

price 9.95year 2004

wine Page 1

Figure 1.5: Compact form of the tree of figure 1.1 in text and PDF format. These outputswere produced by the stylesheets of listing 4.8 and listing 4.9. The overlap, in the PDFoutput, between the label alcoholic-strength and its value will be explained in section 4.5.

13

Chapter 2

Instance Document

Because there are many types of XML documents, either for transforming or validatingdata, an XML file that contains data is usually called an instance document. Any XMLmust be well-formed which means that

• all element start-tags and end-tags must be properly nested

• there should only be one top-level element in the file.

But there also other peculiarities we will describe shortly in this chapter.In the rest of this report, we will be using as input the XML instance files shown in

listings 2.2 and 2.3 whose outline is shown in figure 2.1.1 They describe a wine cellarcontaining wine bottles defined in a separate wine catalog.2 The structure of these files isthe following:

CellarBook.xml (listing 2.2) describes the cellar in four parts:

wine catalog described in an external file Wine-Catalog.xml

owner name and address

1The XML and Java listings have been produced by the listings LATEX package which displays to indicate that a whitespace is significant because it appears within quotes. For the sake of brevity,some listings do not show the full content of the files. Ellipsis is indicated by ... The source files areavailable online at the companion website of this document at http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees

On the website, there are XML instance files having their name ending by DTD, XSD, RNC or RNG dependingon the type of validation used (e.g. WineCatalogXSD.xml). These instance files use file inclusion to buildthe complete instance file. In this document, we will instead use the plain names of the instance files withoutindicating the validation type used (e.g. WineCatalog.xml). These files also exist on the website but theycontain the full XML text and use no file inclusion. This can be useful to use with XML editors (such asXMLSpy) who do not support Xinclude.

The source files contain XML or Java comments of the form |\label... which can be ignored by thereader. As the content of the listings in this document is most often taken directly from these source files,the labels are used for keeping references with the LATEX source file of this report.

2This application was inspired by the Livre de cave example used by Benoıt Habert in his book on theCommon Lisp Object System (CLOS) programming [22].

14



Listing 2.1: Outline of CellarBook.xml (listing 2.2) which includes (line 4)WineCatalog.xml (Listing 2.3) which uses a given namespace. The XML processor replacesthis line at inclusion time by the content of the box.

<cellar -book ...xsi:noNamespaceSchemaLocation="CellarBook.xsd"xmlns:cat="http: //www.iro.umontreal.ca/lapalme/wine -catalog">

4 <xi:include href="WineCatalog.xml" ... />

<wine -catalog ... xsi:schemaLocation ="http: //www.iro.umontreal.ca/lapalme/wine -catalog WineCatalog.xsd"

xmlns="http: //www.iro.umontreal.ca/lapalme/wine -catalog"5 xml:base="WineCatalog.xml">

<wine name="Domaine de l’Ile Margaux" code="C00043125" ... >...

</wine><wine name="Riesling Hugel" code="C00042101" ... >

10 ...</wine><wine name="Chateau Montgueret" code="C10263859" ...>

...</wine>

15 <wine name="Mumm Cordon Rouge" code="C00312363">...

</wine><wine name="Prado Rey Roble" code="C00929026"...>

...20 </wine>

</wine -catalog >

<owner > ... <owner><location > ... <location ><cellar >

9 <wine code="C00043125">...</wine><wine code="C00312363">...</wine><wine code="C10263859">...</wine><wine code="C00929026">...</wine>

</cellar >14 </cellar -book>

15

location address of the cellar (if different from that of the owner)

cellar list of wine bottle lots (using codes from the wine catalog) and, for each, thequantity currently held in the cellar and the purchase date of the lot

Wine-Catalog.xml (listing 2.3) gives the description of each wine product with a code thatwill be matched by the ones of the cellar.

Listing 2.2 shows the content of the cellar-book as an XML instance document. The firstline starting with <?xml is a processing instruction that indicates the XML version used3

and the encoding for the file, here UTF-8.The real content of the file corresponding the tree structure storing the information starts

with the root element cellar-book (line 6) which itself has three children: owner (line 11),location (line 21) and cellar (line 27). The first child of cellar-book is the contents ofthe file WineCatalog.xml (shown in listing 2.3) which is included at run-time via the elementxi:include (line 9).

Element !DOCTYPE (line 2), not a well-formed XML element, defines entities that can beused in the XML instance document. This notation will be explained further in section 3.1but for the moment they can be considered as text macros that will perform string sub-stitutions before the XML file is processed. Substitution occurs when an entity is referredto by enclosing its name between & and ;. For example, entity guy (line 2) is replaced byGuy Lapalme when &guy; is encountered in the file. Entities can refer to other entities: &GL;

(line 32) will be replaced by Guy Lapalme, Montreal. When an entity declaration is followedby SYSTEM and the name of a file, then a reference to this entity is replaced by the contentof the file.

Listing 2.2: [CellarBook.xml]: XML instance document describing the content of thecellar

<?xml version="1.0" encoding="UTF -8"?><!DOCTYPE cellar -book [<!ENTITY guy "Guy Lapalme" >

<!ENTITY eacute "é" ><!ENTITY mtl "Montréal" >

5 <!ENTITY GL "&guy;, &mtl;" >]><cellar -book xmlns:xsi="http://www.w3.org /2001/ XMLSchema -instance"

xmlns:cat="http: //www.iro.umontreal.ca/lapalme/wine -catalog"xsi:noNamespaceSchemaLocation="CellarBook.xsd">

<xi:include href="WineCatalog.xml"10 xmlns:xi="http: //www.w3.org /2001/ XInclude"/>

<owner ><name>

<first>Jude</first><family >Raisin </family >

15 </name>

3Although XML version 1.1 exists, very few processors deal with it, so most of the time version="1.0"is used.

16

<street >1234 rue des Chateaux </street ><city>St -George </city><province >ON</province ><postal -code>M7W 7S0</postal -code>

20 </owner ><location >

<street >4587 des Futailles </street ><city>Vallee des crus</city><province >QC</province >

25 <postal -code>H3C 4J8</postal -code></location ><cellar >

<wine code="C00043125"><purchaseDate >2005 -06 -20</purchaseDate >

30 <quantity >2</quantity ><comment >

<cat:bold >&GL;</cat:bold >: should reorder soon</comment >

</wine>35 ...........

<wine code="C00929026"><purchaseDate >2003 -10 -15</purchaseDate ><quantity >1</quantity ><comment >for <cat:bold >big</cat:bold > parties </comment >

40 </wine></cellar >

</cellar -book>

Listing 2.3 is the content of the catalog of available types of wines storing informationsuch as their properties (color, alcoholic strength), their origin, their price and their year ofproduction.4 Other information such as the name, the code and format are given as attributeswithin the start-tag. While the value of an element can be an arbitrarily complex tree ofelements, attribute values can only be single string values. Strings for attribute values mustbe delimited by either matching ’ or ". These delimiters have the same meaning and thisconvention is convenient when embedding a quote of one type within a string value. In casethe two types of quotes are needed within a single string, one can use the predefined entities' and " (explained in section 3.1).

The structure of an XML instance file may seem arbitrary and, in a sense, it is. Inorder to make sure that its processing is efficient, it is important that the structure of theinformation be in the right format (i.e. embedded within the correct tags and in the correctorder) and that all the mandatory information be present. This verification could be doneby the program using the information but it would more helpful to detect errors or lack ofinformation when the instance file is created. Thus the program needing the data can be

4this information was inspired by data found on the web site of the Societe des Alcools du Quebec (SAQ).

17

http://www.saq.com

sure that the file structure follows the expected format. This validation process, similar tothe static type checking for a programming language, is explained in the next chapter butbefore, we will look at namespaces, another important concept in XML instance documents.

Listing 2.3: [WineCatalog.xml]: XML instance document for the wine catalog, it will beincluded in figure 2.2 line 9

<wine -catalog xmlns:xsi="http://www.w3.org /2001/ XMLSchema -instance"xsi:schemaLocation="http://www.iro.umontreal.ca/lapalme/wine -catalog WineCatalog.xsd"xmlns="http: //www.iro.umontreal.ca/lapalme/wine -catalog">

5 <wine name="Domaine de l’Ile Margaux" code="C00043125"classification="a.c" appellation="Bordeaux superieur"format="750ml">

<properties ><color>red</color>

10 <alcoholic -strength >12.5</alcoholic -strength ><nature >still</nature >

</properties ><origin >

<country >France </country >15 <region >Bordeaux </region >

<producer >SCEA Domaine de L'Ile Margaux (B.P. 5)

</producer ></origin >

20 <comment >Ready for drinking now</comment ><food -pairing >

Accompanies <emph>Bordelaise ribsteak </emph>,<bold>pork with prunes </bold> or magret de canard.

</food -pairing >25 <price>22.80</price>

<year>2002</year></wine>

...<wine name="Prado Rey Roble" code="C00929026"

30 classification="d.o." appellation="Ribera -del -duero"format="magnum">

<properties ><color>red</color><alcoholic -strength >12.5</alcoholic -strength >

35 <nature >still</nature ></properties ><origin >

<country >Spain</country ><region >Old Castille </region >

18

40 <producer >Real Sitio de Ventosilla SA</producer ></origin ><price>35.25</price><year>2002</year>

</wine>45 < wine -catalog >

2.1 Namespaces

Namespaces allow a graceful combination of independent XML files. As can be seen infigure 2.1, listing 2.2 includes listing 2.3 via the element xi:include (line 9). These files bothuse the wine element in different ways:5 in listing 2.3, wine (line 5) refers to a description of atype of wine while in listing 2.2 wine (line 28) refers to a batch of bottles. So both referencesmust be distinguished from one another in order to validate them with the appropriateXML Schema.

Each element name in an XML file is defined within a context, called a namespace,indicated as a prefix ending with a colon (:). The definition of a namespace prefix is doneusing attributes of the root element of an instance file defined in the xmlns namespace (howabout the circularity of this definition!). For example, on line 6 of listing 2.2, two namespaceprefixes are defined: xsi and cat, for which are given two arbitrary unique identifiers thatwill be used to distinguish their namespaces. Most often identifiers of namespaces are URLs(URIs more precisely) because the authors of an XML file use a URL designating a website that they own. If authors take care not to use the same URL for different purposes,this pretty much guarantees the uniqueness of the namespaces. This does not necessarilymeans that the URLs used as names for namespaces do exist. It must be remembered thatthe URL notation is nothing more than a useful convention, although this name can also beused by validators as a hint to find the corresponding schema.

By default, names without prefixes are defined in the empty namespace or to the valueassigned to the xmlns attribute. To create elements in a specific namespace, we assign adefault namespace like we did at the start of listing 2.3 by specifying a value for the xmlns

attribute (line 4). In principle, any element can set a value for the xmlns attribute tochange the default namespace or to set the prefix of new namespaces for nested elements.So namespace prefixes are inherited: the search for the URI corresponding to a prefix startsfrom the current element and follows the parent links in the tree until it finds a correspondingprefix declared as a value of a xmlns attribute.

As shown in listing 2.1, the declaration of namespaces is most often done at the root ele-ment of the file. In this listing, the box indicates the frontier of the namespace. An elementoutside of the box must use a prefix to refer to an element inside the box. Within the box, no

5In such a simple case as this one, it would be an easy matter, and probably a better design, to havedifferent names for these two concepts but we want to illustrate the use of namespaces in a small scaleexample. The same name clash would occur if we wanted to combine independently created XML files.

19

prefix is necessary because the namespace has been given a null prefix line 4 within the box.It is possible de define a namespace for any element (which will also apply to its subelements)but this make it hard to follow for the human reader to be aware of the current namespace ofan element even though a namespace aware XML processor has no problem because a names-pace is associated with each element. For example in listing 2.2, cat:bold (line 32) designatesthe bold element in the http://www.iro.umontreal.ca/lapalme/wine-catalog namespace.All elements in the listing 2.3 also have the same namespace; so bold elements (line 23) arethe same: i.e. when, as will be explained later, they will be processed by a XML system,they will be identified as being of the same type.

The use of namespaces will be better understood once we have seen their use in validationwith schemas (section 3.2.4). An excellent short introduction to the concept of namespacecan be found in [34, p. 160-166].

20

Chapter 3

Document Validation

As we have mentioned in the previous section, an XML file must be well-formed in orderto be processed correctly. XML designers have created a thorough checking method calledvalidation that verifies whether elements of an XML file are well-formed and, furthermore,ensures that their ordering and nesting obey certain rules. These rules are specified by aDTD or a XML Schema. This validation is done prior to any further processing so thatprograms that process a XML file do not waste time checking for such errors. An applicationis even allowed to stop any processing if it encounters an invalid XML file.

The author of an XML file can usually be warned of the invalidity of his XML fileat creation time. This validation can be done either within the XML text editor itself(e.g. XMLSpy [9], <oXygen/> [36] or the nXML mode in Emacs [18]) or by an externalvalidator program (e.g. Xerces [10] or XSV [32]). XML editors can also play an active rolein the creation of valid XML file, by suggesting at each point valid continuations (acceptableelements, attributes or values) depending on the DTD or the XML Schema.

XML, like its ancestor SGML, defines the validation of a file with respect to a DocumentType Declaration (DTD) given at the start of the file. Most often the DTD is an accom-panying external document that allows different files to follow the same rules by sharingit. A DTD is relatively simple to define but the rules of validation it can enforce are quiterudimentary because they can only define constraints on the nesting of elements and performsimple checking on values of attributes. In order to validate the content of elements, XMLdesigners have defined a more elaborate type system called a Schema which can be used in atleast two technologies: XML Schema presented in section 3.2 and RELAX NG describedin section 3.3.

3.1 Document Type Declaration (DTD)

A DTD is a notation to define elements that are allowed to appear in an XML file as wellas the type of information they can contain. Table 3.1 gives an overview of some of the morefrequent definitions of elements, attributes and entity that can be defined in a DTD. Thesedefinitions are simili XML tags in the sense that they look like XML start-tags without

21

<!DOCTYPE rootElement SYSTEM ”file.dtd” [ !ENTITY *]? >

<!ELEMENT NCName ( #PCDATA |? regexpOf !ELEMENT ) >

<!ELEMENT NCName (#PCDATA) >

<!ELEMENT NCName EMPTY >

<!ATTLIST elementNCName attributeNCName declValue default>

declValue = CDATA | ID | IDREF | (CNAME | CNAME+ )default = #REQUIRED | #IMPLIED

<![CDATA[ ... ]]>

<!ENTITY name ” ... ”>

<!ENTITY % name ” ... ”>

<!ENTITY name SYSTEM ”file.xml”>

Table 3.1: A reminder of the subset of DTD syntax used in listings 3.1 and 3.2. CDATAis character data as is, but PCDATA is parsed character data that can contain references toentities. Names in italics refer to other elements. declValue and default above are not partof the DTD syntax, they are only useful abbreviations in this table. Regular expressionsare used to describe the allowed forms: braces are used for grouping, ? indicates that thepreceding grouping is optional, * that it can be repeated as often as necessary possibly noneand + that it must be appear at least once.

their corresponding end-tags. For mainly historical reasons, DTDs are not well-formedXML files. The types for DTDs are most often given as either:

• (#PCDATA) (Parsed Character DATA) which corresponds to character string informa-tion; parsed means that the character data can contain entity references as explainedbelow

• a regular expression in parentheses involving other element names.

The regular expression for the sequencing of children elements follows the now well knownconventions1:

, sequence| choice( ) grouping of expressions? optional previous expression* repetition, possibly none, of the previous expression+ repetition at least once of the previous expression

1Regular expressions used in the definition of what can appear in a DTD in table 3.1 should be distin-guished from the regular expressions used in the DTD themselves even though they use the same symbolswith the same meaning. We have used two different fonts (this sans-serif font is used for meta regular ex-pressions) but they can be hard to distinguish in some cases. The context should make clear the type ofregexp that is referred to in each case.

22

Listing 3.1 is a validating DTD for the XML instance document given in listing 2.2.Elements are defined with an !ELEMENT tag, see wine (line 5), containing a regular expressionindicating constraints on its children elements: a wine element has up to four childrenelements in sequence: purchaseDate, quantity, rating and comment, the last two beingoptional. Elements purchaseDate (line 6) or city (line 28) can contain character data andno other elements. A cellar (line 3) is a list (possibly empty) of wine elements. A wine

(line 5) element must contain a purchaseDate element, followed by a quantity and possiblya rating or a comment. A name (line 12) is a non-empty list of either a first, initial orfamily in any order; these elements can even be repeated which shows the limitations on thetypes of constraints that can be easily represented with a DTD.

Attributes are defined using !ATTLIST tags indicating the element to which they belong,their name, their type and whether they are mandatory (#REQUIRED) or optional (#IMPLIED).See for example the !ATTLIST for the code (line 10) attribute of the wine element.

A DTD can also contain definitions of entities that act as text macros that are replacedtextually either in the instance document or in the DTD itself. Entities whose definitionsstart with <!ENTITY such as guy (line 20) (already illustrated in listing 2.2) define textualreplacements when they are called, i.e. when they appear between & and ;. This entitymechanism is necessary in order to be able to insert a less-than sign (< typed as <) in anXML file because < is reserved to indicate the start of a tag. So now we also need a way toinsert an ampersand (& typed as &) which indicates the start of an entity. Three otherpredefined entities also exist for XML files: " for ", ' for ’ and > for > (thislast one by symmetry with < even though it is not strictly needed).

Macro replacements are also quite useful to modularize DTDs but in order to be usedwithin definitions of DTDs they must be distinguished from ordinary entities; this differenttype of entity is called a parameter entity. Its definition has with a percent sign as namefollowed by the name of the parameter entity and its definition; see address (line 26). Itscall is preceded by a percent sign instead of an ampersand (see owner (line 32) and location

(line 33)). Another special kind of entity, indicated by SYSTEM, refers to a file such as inwine-catalog (line 35). This entity can then be used to include a file as is shown on the lastline of listing 3.1, which includes the file given in listing 3.2.

Listing 3.1: [CellarBook.dtd]: DTD for the cellar book. It can validate the instance filein listing 2.2. ELEMENTs and ATTLISTs are independent, indentation is ignored by the DTDprocessor, it is used here for the human reader only to highlight some inclusion dependencies.

<?xml version="1.0" encoding="UTF -8"?>

<!ELEMENT cellar (wine)* >

5 <!ELEMENT wine (purchaseDate ,quantity ,rating?,comment ?) ><!ELEMENT purchaseDate (# PCDATA) ><!ELEMENT quantity (# PCDATA) ><!ELEMENT rating EMPTY >

<!ATTLIST rating stars CDATA #IMPLIED >10 <!ATTLIST wine code IDREF #REQUIRED >

23

<!ELEMENT name (first | family | initial )+ ><!ELEMENT first (# PCDATA) ><!ELEMENT family (# PCDATA) >

15 <!ELEMENT initial (# PCDATA) >

<!ELEMENT cellar -book (wine -catalog , owner , location , cellar) >

20 <!ENTITY guy "Guy Lapalme" >

<!ENTITY eacute "é" ><!ENTITY mtl "Montréal" ><!ENTITY GL "&guy;, &mtl;" >

25 <!ENTITY % address "(street , city , province , postal -code)" >

<!ELEMENT street (# PCDATA) ><!ELEMENT city (# PCDATA) ><!ELEMENT province (# PCDATA) >

30 <!ELEMENT postal -code (# PCDATA) >

<!ELEMENT owner (name ,% address ;) ><!ELEMENT location %address; >

35 <!ENTITY % wine -catalog SYSTEM "WineCatalog.dtd" >%wine -catalog;

We now look at the validation of the wine catalog (listing 3.2). Given the fact that allelement names must be unique in a DTD (there are no namespaces in DTDs), we mustgive a different name to the wine element of listing 3.1. Here we decided to call it cat-wine

(line 4). The attribute format (line 10) shows an example of an enumeration of values fromwhich the attribute value must necessarily be chosen. The link between the wine and thecat-wine elements is done using the code (line 13) of listing 3.2 of type ID and its referencein code (line 10) of listing 3.1 which is of type IDREF. In an XML file, all values of type ID

must be distinct and values of type IDREF must refer to an existing ID.

Listing 3.2: [WineCatalog.dtd]: DTD to validate the instance file in listing 2.3. It isincluded in listing 3.1. ELEMENTs and ATTLISTs are independent, indentation is ignored bythe DTD processor, it is used here for the human reader only to highlight some inclusiondependencies.

<?xml version="1.0" encoding="UTF -8"?><!ELEMENT wine -catalog (cat -wine*) >

<!ELEMENT cat -wine (properties , origin ,5 (tasting -note?,food -pairing?,comment ?)*,

24

price ,year) ><!ATTLIST cat -wine name CDATA #REQUIRED ><!ATTLIST cat -wine appellation CDATA #IMPLIED ><!ATTLIST cat -wine classification CDATA #IMPLIED >

10 <!ATTLIST cat -wine format (375ml | 750ml | 1l | magnum | jeroboam| rehoboam | mathusalem | salmanazar| balthazar | nabuchodonosor) #REQUIRED >

<!ATTLIST cat -wine code ID #REQUIRED ><!ELEMENT properties (color ,alcoholic -strength ,nature ?) >

15 <!ELEMENT color (# PCDATA) ><!ELEMENT alcoholic -strength (# PCDATA) ><!ELEMENT nature (# PCDATA) >

<!ELEMENT origin (country ,region ,producer) ><!ELEMENT country (# PCDATA) >

20 <!ELEMENT region (# PCDATA) ><!ELEMENT producer (# PCDATA) >

<!ENTITY % Comment "(# PCDATA | emph | bold)*" ><!ELEMENT emph (# PCDATA) >

25 <!ELEMENT bold (# PCDATA) >

<!ELEMENT comment %Comment; ><!ELEMENT tasting -note %Comment; ><!ELEMENT food -pairing %Comment; >

30

<!ELEMENT price (# PCDATA) ><!ELEMENT year (# PCDATA) >

3.1.1 Associating an Instance File to DTD

The link between a DTD and an XML file that it validates can be done externally using anXML Editor, but most DTD validators insist that we add a !DOCTYPE element at the startof the XML file. For example, one can use declarations such as the following

<?xml version="1.0" encoding="UTF -8"?><!DOCTYPE cellar -book SYSTEM "CellarBook.dtd" [

<!ENTITY WC SYSTEM "WineCatalogContentNoNS.xml" ><!ENTITY CB SYSTEM "CellarBookContentNoNS.xml" >

5 ]><cellar -book>

<wine -catalog >&WC;</wine -catalog >&CB;

</cellar -book>

The root element of the XML instance document is given as the second value, SYSTEMin third and a reference to the DTD file in fourth. In the previous example, we have also

25

put the content of the wine catalog and the cellar book in separate files that are includedas system entities. These lines will be seen as a complete XML file (in fact listing 2.2) by aprogram using the standard XML tools and APIs.

3.2 Schema

As we have seen in the previous section, a DTD describes some constraints on the order andnesting of elements in an XML file but the type of constraints is quite limited and it does notallow any validation of the character content of elements. There are also other drawbacks: allelement names in a DTD must be unique and thus combining separately developed DTDscan become quite cumbersome. Moreover, the DTD file is not a well-formed XML file, thusone cannot easily use an XML tool to create or process it. This is why XML Schema hasbeen introduced with a comprehensive set of elementary types and a way to combine themto create new types. The concept of namespaces (presented in section 2.1) is also used inorder to facilitate the combination of independent files without name clashes.

A Schema is a well-formed XML file (usually with a .xsd extension) that defines typeswhich are used to validate the elements of the XML file. In a way similar to variabledeclarations in a programming language, we can define types2 for many elements instead ofusing inline definitions of embedded elements. In a Schema, there are two kinds of types:simple and complex. Simple types define constraints on the text content of an element whichcannot contain any element. A complex type can contain nested elements.

There are many different ways of organizing a Schema as described by Van der Vlist[33]:one can either use a russian doll approach in which a single element is defined with allembedded elements internally defined; another way is to use a bottom-up approach in whichthe elements are defined before being used in more complex elements; it is also possible touse a top-down approach that first define the higher level elements before defining the lowerlevel elements. All these styles of definition are possible and we will sometimes use a mix ofthem in order to show some features of XML Schema.

Table 3.2 presents the XML elements we use in our example to define the types neededfor the validation of our wine catalog and cellar book. Since a schema is itself an XMLfile, it is important to distinguish the elements defining the Schema from the elements beingdefined. This is done by having different namespaces for the defining element (definiens)using xs: (xsd is also commonly used) as prefix and for the defined elements (definiendum)without prefix, i.e. in the default namespace. Contrarily to a DTD, a XML Schema beinga valid XML file, it can be validated using the XML Schema of XML Schemas which isusually included in all XML editors.

A XML Schema has a xs:schema element as root which can contain different kinds ofdefinition elements.

2We follow the Java convention of starting type identifiers with an upper case letter. Element identifiersstart with a lower case letter. In a name comprising more than one word, each word starts with an uppercaseletter, no underscore or dash are used.

26

<xs:schema targetNameSpace=”URI”>xs:import* xs:simpleType | xs:complexType | xs:element | xs:group*

</xs:schema>

<xs:import nameSpace=”URI” schemaLocation=”URI”/>

<xs:simpleType name=”NCName”>xs:restriction

</xs:simpleType>

<xs:complexType name=”NCName” mixed=”true”?>xs:choice | xs:sequence | xs:group? xs:attribute*

</xs:complexType>

<xs:element name=”QName” type=”TName”/><xs:element name=”QName” ref=”EName”/><xs:element name=”QName”>

xs:simpleType | xs:complexType?xs:unique | xs:key | xs:keyref *

</xs:element>

<xs:sequence min|maxoccurs=”nonNegativeInteger |unbounded”>xs:element | xs:choice | xs:sequence | xs:group*

</xs:sequence>

<xs:choice min|maxoccurs=”nonNegativeInteger |unbounded”>xs:element | xs:choice | xs:sequence | xs:group*

</xs:choice>

<xs:group name=”NCName”>xs:choice | xs:sequence

</xs:group>

<xs:attribute name=”NCName” type=”TName” use=”required”?/>

<xs:restriction base=”TName”><xs:max|minin|exclusive value=”anySimpleType”/>| <xs:max|min|length value=”nonNegativeInteger”/>| <pattern value=”regExp”/>| <enumeration value=”anyValue”/>

</xs:restriction>

<xs:unique|key name=”NCName”>xs:selector xs:field+

</xs:unique|key><xs:keyref name=”NCName” refer=”NCName”>

xs:selector xs:field+</xs:keyref>

<xs:selector|field xpath=”XPathExpr”/>

Table 3.2: A reminder of the subset of XML Schema syntax used in listings 3.3 and 3.4.Names in italics refer to other elements. NCName (non-colonized) name is a name withoutnamespace prefix. Regular expressions are used to describe the allowed forms: braces areused for grouping, ? indicates that the preceding grouping is optional, * that it can berepeated and + that it can be repeated but at least once.

27

• xs:import allows the combination of different schemas into a single one; in our case, wehave a schema for the wine catalog which is imported into the one of the cellar book

• xs:simpleType gives supplementary constraints on predefined types; this is explainedfurther in section 3.2.1.

• xs:complexType defines a new type in terms of a choice or a sequence between othertypes; xs:group gives a new to an incomplete type. xs:attributes are given at the ofthe definition, even though they appear in the start-tag

• xs:element is the fundamental way of defining an element that can appear in an in-stance file. It can be given either with a name and a type, it can refer to anotherelement definition or it can be defined with an anonymous simple or complex typefollowed by keys and keyrefs definitions

• xs:sequence (resp xs:choice) combines other elements by making sure that they occursequentially (resp. alternatively i.e. only one of the element can appear)

• xs:group clusters elements that can be used together

• xs:attribute gives the name and the simple type associated with an attribute. At-tributes are not ordered and optional unless their are given the value required to theiruse attributes

• xs:restriction gives range, pattern constraints on the value of a simple type. enumerationis to limit the allowed value to one of a given list.

• xs:key, xs:unique and xs:keyref define cooccurrence constraints that will be explainedin section 3.2.3.

In the rest of this section, we first give the XML Schemas in their entirety Listings 3.3and 3.4 that correspond to the DTDs given in listings 3.1 and 3.2 respectively. Figure 3.1gives the overall structure of the XML Schema of listing 3.3. Figure 3.2 gives the overallstructure of the Schema in listing 3.4. As we will see, the validation of the text content ofthe elements can be much more thorough with a XML Schema than with a DTD.

We will then explain the structure of the type system: first simple types (section 3.2.1)then complex types (section 3.2.2) and finally how define keys and their reference (sec-tion 3.2.3).

28

Figure 3.1: Graphical view of the Schema for the cellar book (listing 3.3). A name in arectangular box is an element name or, if preceded by @, an attribute name. A complex typename is preceded by a square and a simple type by a triangle. A sequence is shown with 4dots horizontally aligned in an hexagon and a choice with the 4 dots aligned vertically (seefigure 4.3 for an example). A + after a box, indicates that further details have been omitted.Three small squares in front of an element name either indicates that its definition will bereferred to somewhere else in the schema; the reference is indicated by a small arrow at thebottom right of the rectangle. It was produced by the <oXygen/> XML editor from theXML Schema file given in listing 3.3.

29

Listing 3.3: [CellarBook.xsd]: XML Schema for the cellar book. It can validate theinstance file in listing 2.2. It can be compared with the DTD listing 3.1.

<?xml version="1.0" encoding="UTF -8"?><xs:schema xmlns:xs="http://www.w3.org /2001/ XMLSchema"

xmlns:cat="http: //www.iro.umontreal.ca/lapalme/wine -catalog">

5 <xs:import namespace="http://www.iro.umontreal.ca/lapalme/wine -catalog"schemaLocation="WineCatalog.xsd"/>

<xs:element name="cellar"><xs:complexType >

10 <xs:sequence minOccurs="0" maxOccurs="unbounded"><xs:element name="wine" type="Wine"/>

</xs:sequence ></xs:complexType >

</xs:element >15

<xs:complexType name="Wine"><xs:sequence >

<xs:element name="purchaseDate" type="xs:date"/><xs:element name="quantity" type="xs:nonNegativeInteger"/>

20 <xs:element name="rating" minOccurs="0"><xs:complexType >

<xs:attribute name="stars" type="xs:positiveInteger"/></xs:complexType >

</xs:element >25 <xs:element name="comment" type="cat:Comment" minOccurs="0"/>

</xs:sequence ><xs:attribute name="code" type="cat:SAQ -code" use="required"/>

</xs:complexType >

30 <xs:element name="name"><xs:complexType >

<xs:sequence maxOccurs="unbounded"><xs:choice >

<xs:element name="first" type="xs:string"/>35 <xs:element name="family" type="xs:string"/>

<xs:element name="initial" type="xs:string"/></xs:choice >


40 </xs:element >

<xs:element name="cellar -book"><xs:complexType >

30

<xs:sequence >45 <xs:element ref="cat:wine -catalog"/>

<xs:element name="owner" type="Owner"/><xs:element name="location" minOccurs="0">

<xs:complexType ><xs:group ref="address"/>

50 </xs:complexType ></xs:element ><xs:element ref="cellar"/>


55 <xs:keyref refer="cat:WineNumber" name="SAQ -UPC">

<xs:selector xpath="cellar/wine"/><xs:field xpath="@code"/>

60 </xs:keyref ></xs:element >

<xs:group name="address"><xs:sequence >

65 <xs:element name="street" type="xs:string"/><xs:element name="city" type="xs:string"/><xs:element name="province" type="ProvinceCA"/><xs:element name="postal -code" type="PostalCodeCA"/>

</xs:sequence >70 </xs:group >

<xs:simpleType name="ProvinceCA">

<xs:restriction base="xs:string">75 <xs:enumeration value="AB"/>

<xs:enumeration value="BC"/><xs:enumeration value="MB"/><xs:enumeration value="NB"/><xs:enumeration value="NL"/>

80 <xs:enumeration value="NT"/><xs:enumeration value="NS"/><xs:enumeration value="NU"/><xs:enumeration value="ON"/><xs:enumeration value="QC"/>

85 <xs:enumeration value="SK"/><xs:enumeration value="YT"/>

</xs:restriction ></xs:simpleType >

31

90 <xs:complexType name="Owner"><xs:sequence >

<xs:element ref="name"/><xs:group ref="address"/>

</xs:sequence >95 </xs:complexType >

<xs:simpleType name="PostalCodeCA"><xs:restriction base="xs:string">

<xs:pattern value="[A-Z][0 -9][A-Z] [0-9][A-Z][0-9]"/>100 </xs:restriction >

</xs:simpleType ></xs:schema >

32

Figure 3.2: Graphical view of the Schema for the wine catalog (listing 3.4). See caption offigure 3.1 for an explanation of symbols used in the figure.

33

Listing 3.4: [WineCatalog.xsd]: Schema for the wine catalog. It can validate the instancedocument shown in listing 2.3. It can be compared with the DTD in listing 3.2

<?xml version="1.0" encoding="UTF -8"?><xs:schema xmlns:xs=’http://www.w3.org /2001/ XMLSchema ’

3 elementFormDefault="qualified"attributeFormDefault="unqualified"xmlns:cat ="http: //www.iro.umontreal.ca/lapalme/wine -catalog"targetNamespace="http: //www.iro.umontreal.ca/lapalme/wine -catalog">

8 <xs:import namespace="http://www.w3.org/XML /1998/ namespace"

schemaLocation="http://www.w3.org /2001/03/ xml.xsd"/>

<xs:element name="wine -catalog">13 <xs:complexType >

<xs:sequence minOccurs="0" maxOccurs="unbounded"><xs:element name="wine" type="cat:Wine"/>

</xs:sequence >

18 <xs:attribute ref="xml:base"/></xs:complexType ><xs:key name="WineNumber">

<xs:selector xpath="cat:wine"/>

23 <xs:field xpath="@code"/></xs:key >

<xs:unique name="WineName"><xs:selector xpath="cat:wine"/><xs:field xpath="@name"/>

28 <xs:field xpath="@appellation"/></xs:unique >

</xs:element >

<xs:complexType name="Wine">33 <xs:sequence >

<xs:element name="properties" type="cat:Properties"/><xs:element name="origin" type="cat:Origin"/><xs:choice minOccurs="0" maxOccurs="unbounded">

<xs:element name="tasting -note"38 type="cat:Comment" minOccurs="0"/>

<xs:element name="food -pairing"type="cat:Comment" minOccurs="0"/>

<xs:element name="comment"type="cat:Comment" minOccurs="0"/>

43 </xs:choice >

34

<xs:element name="price" type="xs:decimal" ></xs:element ><xs:element name="year" type="xs:gYear"/>

</xs:sequence ><xs:attribute name="name" type="xs:string" use="required"/>

48 <xs:attribute name="appellation" type="xs:string"/><xs:attribute name="classification" type="xs:string"/><xs:attribute name="code" type="cat:SAQ -code"></xs:attribute ><xs:attribute name="format" type="cat:Format"/>

</xs:complexType >53

<xs:complexType name="Properties"><xs:sequence >

<xs:element name="color" type="cat:Color"/><xs:element name="alcoholic -strength" type="cat:Percentage"/>

58 <xs:element name="nature" type="xs:string" minOccurs="0"/></xs:sequence >

</xs:complexType >

<xs:complexType name="Origin">63 <xs:sequence >

<xs:element name="country" type="xs:string"/><xs:element name="region" type="xs:string"/><xs:element name="producer" type="xs:string"/>

</xs:sequence >68 </xs:complexType >

<xs:simpleType name="Format"><xs:restriction base="xs:string">

<xs:enumeration value="375ml"/>73 <xs:enumeration value="750ml"/>

<xs:enumeration value="1l"/><xs:enumeration value="magnum">

<xs:annotation ><xs:documentation > 1.5 litres </xs:documentation >

78 </xs:annotation ></xs:enumeration ><xs:enumeration value="jeroboam">

<xs:annotation ><xs:documentation > 3 litres </xs:documentation >

83 </xs:annotation ></xs:enumeration ><xs:enumeration value="rehoboam">

<xs:annotation ><xs:documentation > 4.5 litres </xs:documentation >

88 </xs:annotation >

35

</xs:enumeration ><xs:enumeration value="mathusalem">


93 </xs:annotation ></xs:enumeration ><xs:enumeration value="salmanazar">


98 </xs:annotation ></xs:enumeration ><xs:enumeration value="balthazar">

<xs:annotation ><xs:documentation >12 litres </xs:documentation >

103 </xs:annotation ></xs:enumeration ><xs:enumeration value="nabuchodonosor">

<xs:annotation ><xs:documentation >15 litres </xs:documentation >

108 </xs:annotation ></xs:enumeration >


113 <xs:complexType name="Comment" mixed="true"><xs:sequence minOccurs="0" maxOccurs="unbounded">

<xs:choice ><xs:element name="emph" type="xs:string"/><xs:element name="bold" type="xs:string"/>

118 </xs:choice ></xs:sequence >

</xs:complexType >

<xs:simpleType name="SAQ -code">123 <xs:restriction base="xs:string">

<xs:pattern value="C\d8"/></xs:restriction >

</xs:simpleType >

128 <xs:simpleType name="Color"><xs:restriction base="xs:string">

<xs:enumeration value="red"/><xs:enumeration value="white"/><xs:enumeration value="ros~A c©"/>

133 </xs:restriction >

36

</xs:simpleType >

<xs:simpleType name="Percentage"><xs:restriction base="xs:decimal">

138 <xs:minInclusive value="0"/><xs:maxInclusive value="100"/><xs:fractionDigits value="2"/>


143 </xs:schema >

3.2.1 Simple Types

A simple type is a primitive datatype such as xs:string, xs:decimal, xs:double, xs:date(XML has 19 of them shown in figure 3.3) or a derivation of a primitive datatype. Aderivation is a restriction on the original type such as constraining the maximum length ofa string, giving a list of acceptable values, or requiring that the value matches a regularexpression. Figure 3.3 shows a number of built-in derived types: xs:normalisedString,xs:integer and all types that are derive from them (i.e. appear under them). Users canalso define their own simple types using the xs:simpleType element.

We can see uses of simple types in listing 3.3: stars (line 22) which must not only be aninteger but a positive one, first (line 34) which is a string (essentially the same thing asa #PCDATA in a DTD). Examples of definition of simple types: constrain a string to be oneof many choices such as ProvinceCA (line 72) or have the string match a regular expressionsuch as PostalCodeCA (line 97).

A new simple type can also be created using a list (allowing a series of primitive typevalues) or a union (allowing one of many primitive types). It is thus possible to define a wholegamut of types. These are quite straightforward if one refers to the specification [31, 13], sothey will not be described further here.

3.2.2 Complex Types

A complex type can contain element declarations, element references and attributes decla-rations. We will illustrate some of these possibilities with listing 3.3.

An element declaration is done with an xs:element giving the name of the element andits type which can either be defined as the value of the element such as cellar (line 8) or byindicating the type with the type attribute (wine (line 11) or purchaseDate (line 18)).

A complex type is defined either by a sequence of elements contained in xs:sequence ele-ment (e.g. cellar (line 8)) or by a choice between many elements contained in an xs:choice

element such as within name (line 30). Attributes are defined after the definitions of theelements in sequence or in choice even though they appear in the start-tag (see code (line 27)as attribute of wine (line 11)).

37

Figure 3.3: Built-in datatypes for XML Schema. ur-types serve as root of the type hier-archy for all derivations. ur is the German prefix meaning ancestral such as in Ursprung(beginning). Figure taken from section 3 of XML Schema Part 2: Datatypes [13].

38

xs:choice and xs:sequence can be nested. For example, name (line 30) indicates achoice between three elements of type string first, family and initial which can be re-peated any number of times. Indeed, because an element only occurs once by default (i.e.minOccurs="1") and that maxOccurs="unbounded", each element can appear as often as wewish.

An existing element can also be referred to using the ref attribute like name (line 30)used in Owner (line 90). But be aware that in this case, if you had mistakenly used the name

attribute instead of ref, you would have named a new attribute with no connection with theone you wished to reference; this can lead to errors that are difficult to track down.

In listing 3.4, the mixed="true" attribute in the definition of a type (see Comment (line 113))means that character data can also appear between the elements described by the contentof the type. In this case, character data can thus be interspersed with any number of emphand bold elements.

3.2.3 Keys and Keyrefs

As shown at the bottom left of figure 3.3, DTD’s ID and IDREF are built-in XML typesand thus allow some simple uniqueness and reference constraints that we explained in sec-tion 3.1. But XML Schema has also defined a much more involved3 system using xs:key andxs:unique elements to define uniqueness constraints on the values and xs:keyref to refer tothese elements.

Within the wine catalog (listing 3.4), to ensure that each wine has a different code

attribute, we add constraints after the xs:complexType element within the wine-catalog

element (line 12):

• xs:key element for which we give a name WineNumber (line 20) to be used for referencing;WineNumber will never appear in the XML instance file, it is used internally by thevalidator. A key is defined in two parts: a xs:selector which identifies the scopewithin which the key must appear only once and a xs:field which indicates the valuethat will be used in the equality comparisons for the keys. If more than one xs:field

element are present, they are considered as forming a tuple of values that must bedistinct i.e. they must be different in at least one of their components. The valuesdesignated in these elements is indicated by an XPath expression4 associated with thexpath attribute.

• xs:unique element using xs:selector and xs:field elements as for xs:key. xs:unique

defines the same type of constraints as a xs:key except that the values so defined cannotbe referenced by xs:keyref elements. On line 25, we ensure that the combination ofthe name and appellation attributes of is unique for each wine.

3In fact so involved and complex that RelaxNG designers decided to leave it out of their proposition.4XPath syntax will be explained in section 4.2 but, for the moment, we only need to know that each level

in the tree is separated by a forward slash; each element is designated by its name, an attribute name ispreceded by a @.

39

The wine code identified by WineNumber will be used in the description of the cellar(listing 3.3). It is the value associated with the code attribute (line 27) of the Wine type(line 16) used to define element wine (line 11). To make the wine code of the cellar matcha code in the catalog, we define a xs:keyref element with xs:selector and xs:field sub-elements (as we have done for xs:key and xs:unique) but, in this case, the value identifiedmust match an existing value of an xs:key element.

Considering the above, we would expect that the definition of the xs:keyref elementappear after the type definition of the cellar element (line 8). But, for implementationreasons, the xs:keyref element should appear at a level high enough so that it covers theuniqueness domain of the key (the whole catalog in this case). This is why the xs:keyref isdefined (line 57) within the cellar-book element.5

3.2.4 Namespaces in Schemas

We have introduced namespaces for instance documents in section 2.1, but they show theirfull power during the validation process in which the combination of element names andnamespaces must correspond between the instance and the schema. Of course, namespacesmust be properly combined during file inclusion and the details can become quite intricate.We will illustrate with listings 3.3 and 3.4 a simple but quite frequent case. These twoschemas define a wine element having different meanings and content which must be welldistinguished. This is achieved with namespace declarations. A similar kind of name clashwould occur if one needed to use an element called type or sequence that are already reservedby the schema vocabulary. This is why we define a namespace (usually xs or xsd) for thenames of a XML Schema.

By default, names without prefixes are defined in the empty namespace or the namespaceassigned to the xmlns attribute. To create elements in a specific namespace (and not theempty one), we set a value for the attribute as it is done for element targetNameSpace

in listing 3.4 (line 6). The same namespace is also assigned to the prefix cat. In orderfor all global elements and types of an included file to be visible in the including file, theelementFormDefault should be assigned qualified and attributeFormDefault, unqualifiedas is seen on lines 2 and 2 of listing 3.4.

The importation of the elements of an external schema file along with its namespaces isdone using xs:import as shown in listing 3.3 (line 5) indicating both the namespace usedhere for the target namespace of the imported file (here we keep the same) and the locationof the file to be imported. The imported namespace must be given a prefix definition withan xmlns declaration like we do in the xs:schema opening tag on the first line of listing 3.3.Because the name associated with the target namespace of the imported file is the sameas the one associated with the cat prefix, we use cat:wine to refer to the wine element of

5It seems that there is a bug in the XMLSpy 2006 validator to validate the XML instance file (listing 2.2)because of the xs:keyref element; strangely, this files validates with the public domain validators but notwith XMLSpy . For the moment, we suggest to comment out the keyref element element definition insteadof reorganising the whole definition as suggested by the XMLSpy support people.

40

listing 3.4. Namespace and importation of RELAX NG schemas are similar in principle towhat we have shown for XML Schema.

We can now better understand how namespaces are then used in the instance documentsand how there are linked to their schemas. For example the first lines of listing 2.3 define thenamespace associated with the null prefix (i.e. only the element name) as the value of thexmlns attribute. We also indicate the namespace and the location of the Schema to be usedfor validation as value of the xsi:schemaLocation. The xsi prefix must also be defined byan attribute starting with xmlns:. Because all elements defined in this file are in the samepackage to which we have assigned the null prefix, no namespace prefix is used in this file.

3.2.5 Overview of the XML Schemas of Our Application

Coming back to listing 2.1 showing the outline of our XML instance files validated by thetwo XML Schemas described in this section. Their outline is shown in listing 3.5. Theselistings show the inclusion of both the instance file and the corresponding XML Schemaof the wine catalog into the cellar book. Note the use of the namespace prefixes in bothXML instance and the corresponding XML Schema files. Boxes in listings 2.1 and 3.5correspond to frontiers of namespaces6. In listing 3.5, we see that references from outsidethe box to the inside need to use the namespace prefix cat:: for the SQA-code type (line 18),the wine-catalogelement (line 24) or to the WineNumber key element name (line 30).

These listings show another interesting use of namespaces: to make sure that relativereference be kept, xml:base (line 5 within the included box) attribute is added to the rootelement. This is why xml:base is added as an attribute in the WineCatalog.xsd schema(line 16 of included listing of listing 3.5). xml:base is itself a special XML type whosedefinition must also be imported (line 8 of included listing of listing 3.5).

6here for simplicity (which should be the rule) we have kept the same name for the namespaces but wecould have changed between the instance and the schema

41

Listing 3.5: Outline of CellarBook.xsd which imports (line 4) WineCatalog.xsd which usesa different namespace.

1 <xs:schema xmlns:xs="http: //www.w3.org /2001/ XMLSchema"

xmlns:cat="http: //www.iro.umontreal.ca/lapalme/wine -catalog">

<xs:import namespace="http://www.iro.umontreal.ca/lapalme/wine-catalog"

schemaLocation="WineCatalog.xsd"/>

6

<xs:schema

xmlns:xs=’http: //www.w3.org /2001/ XMLSchema ’

elementFormDefault="qualified"

attributeFormDefault="unqualified"

5 xmlns:cat ="http: //www.iro.umontreal.ca/lapalme/wine -catalog"

targetNamespace="http: //www.iro.umontreal.ca/lapalme/wine -catalog">

<xs:import namespace="http: //www.w3.org/XML /1998/ namespace"

schemaLocation="http: //www.w3.org /2001/03/ xml.xsd"/>

10

<xs:element name="wine -catalog">

<xs:complexType >

<xs:sequence minOccurs="0" maxOccurs="unbounded">

<xs:element name="wine" type="cat:Wine"/>

15 </xs:sequence >

<xs:attribute ref="xml:base"/>

</xs:complexType >

<xs:key name="WineNumber">...</xs:key >

<xs:unique name="WineName">...</xs:unique >

20 </xs:element >

<xs:complexType name="Wine">...</xs:complexType >

...

</xs:schema >

<xs:element name="cellar">

<xs:complexType >

<xs:sequence minOccurs="0" maxOccurs="unbounded">

11 <xs:element name="wine" type="Wine"/>

</xs:sequence >

</xs:complexType >

</xs:element >

16 <xs:complexType name="Wine">

<xs:sequence >...</xs:sequence >

<xs:attribute name="code" type="cat:SAQ -code" use="required"/>

</xs:complexType >

21 <xs:element name="cellar -book">

<xs:complexType >

<xs:sequence >

<xs:element ref="cat:wine -catalog"/>

<xs:element name="owner" type="Owner"/>

26 <xs:element name="location" minOccurs="0">...</xs:element >

<xs:element ref="cellar"/>

</xs:sequence >

</xs:complexType >

<xs:keyref refer="cat:WineNumber" name="SAQ -UPC">...</xs:keyref >

31 </xs:element >

...

</xs:schema >

42

Compact Syntax (RNC) XML syntax (RNG)default? namespace id = URI <grammar>| datatypes id = URI * <start>pattern</start>

start = pattern | <define name=”NCName”> pattern+ </define> *| id = pattern * </grammar>

element QName pattern <element name=”QName”> pattern+ </element>attribute QName pattern <attribute name=”QName”> pattern+ </attribute>pattern , pattern + <group name=”QName”> pattern+ </group>pattern & pattern + <interleave name=”QName”> pattern+ </interleave>pattern | pattern + <choice name=”QName”> pattern+ </choice>pattern ? <optional name=”QName”> pattern+ </optional>pattern * <zeroOrMore name=”QName”> pattern+ </zeroOrMore>pattern + <oneOrMore name=”QName”> pattern+ </oneOrMore>mixed pattern <mixed name=”QName”> pattern+ </mixed>id <ref name=”NCName”/>empty <empty/>text <text/>dataTypeValue <value name=”NCName”?> string </value>dataTypeName id = value* <data type=”NCName”?>

<param name=”NCName”>string</param>*</data>

Table 3.3: Reminder of RELAX NG Compact and RELAX NG syntax used in ourexamples. The top cells of the table give the start of the file for RNC and the root elementfor RNG. Each line of the the bottom cells is a different pattern that can be combinedalmost freely with the others. The corresponding RNC and RNG elements appear in thesame line of the bottom cell of the table. Fat braces ( ) are braces that are terminal ofthe rnc syntax. In RNC, dataTypeName can be NCName, string or token.

3.3 RELAX NG

As we have seen in the previous section, XML Schema allows a thorough validation ofXML instance files. The type extension mechanism is very powerful but its XML format isnot user-friendly, especially for complex embedding of sequences and choices. This is why thegraphical editing of schemas, provided by editors such as XMLSpy , is very useful. In fact whenit comes to ease of use, the DTD grammar like format is much more convenient. In order toget the best of both worlds, an alternative Schema notation has been suggested which is calledRELAX NG (REgular LAnguage for XML, New Generation) which features a simpler,intuitive notation to define schemas. RELAX NG is based on the same mathematicaltheory underlying regular expressions but adapted to the XML context. The mathematicalfoundations are both simpler and more powerful than the ones of the XML Schema.

RELAX NG has two equivalent syntaxes:7 one is XML-based and the other (called

7Trang [17] is a tool that can transform one notation into the other and even an RELAX NG Schema

43

compact) is more convenient because it allows grammar-like definitions. Eric van der Vlist[34]has written an excellent book explaining both notations in detail. First, he introduces theXML patterns which are the theoretical foundations of the formalism that are combined intoordered and unordered groups and used in choices among alternatives. He then shows howthe compact notation can simplify the XML notation. In this report, we use the compactnotation to write RELAX NG schemas; we will use the Trang automatic Schema converterto get the XML notation should one need it for further processing. Most validators can dealdirectly with the compact notation. Listing 3.6, a RELAX NG compact notation schemafor our cellar book looks more intuitive than the equivalent XML Schema of listing 3.3.

As can be seen in figure 3.3, the structure of RELAX NG Compact definitions is quiteregular and simple: on the last line of the top left cell, a definition is simply defined by aname followed by an equal sign and a pattern definition (each line of the bottom cell of thetable correspond to a different pattern definition). A pattern can start either by the keywordelement or attribute followed by another pattern within braces. Patterns can be combinedsequentially (with a comma), with alternatives (with a vertical bar) or by interleaving (withan ampersand); this last case means that all patterns must occur but not necessarily in order.A pattern can also be qualified to be optional, appear zero or more times or once or more.Mixed pattern allow text elements to appear between patterns. Reference to another patternis indicated by simply giving its name. empty means that the content of the element mustbe empty. text corresponds to any number of text nodes in the instance document. Givinga value (usually within braces) means that the element in the document should match thisvalue. It is also possible to specify facets (in the XML Schema sense) to a type with a listof triples of the form: the name of the facet, an equal sign and then the value of the facet.

In listing 3.6, we can see examples of element definitions (line 9, line 19 and line 25). Adefinition can also be a comma-separated sequence of patterns (line 32 and line 38). We useit here for type definitions but the concept is more general and can be applied to any kind ofdefinition. The content of a definition starts with the keyword attribute or element followedby its name and the type of its content between braces. Similarly to the regular expressionsconventions used for DTDs, a definition or a reference to a definition can be followed by a ?

to indicate that it is optional (see rating and comment within wine (line 10)), a * to indicatea repetition of 0 or more times (see cellar-element (line 9)) or a + for a repetition of atleast one element. If a & is used instead of a comma (such as for name-element (line 19)) isused to separate elements, it indicates an interleave meaning that elements in the patternare unordered. In this case it means that the parts of the name can appear in any order, anyof them being optional because they are followed by a ?.8 The root element the schema isdefined by the rule associated with the start keyword.

When there is no constraint on the string inside an element then the type is text but itcan also refer to the built-in data types of XML Schema (see wine (line 10)). Restrictionscan also be added on types by indicating them within braces: patterns (see PostalCodeCA

into an XML Schema8This is a slight difference from the syntax allowed for a name element as defined by the DTD (listing 3.1)

and XML Schema (listing 3.3) in which the only way to indicate this constraint would have been toenumerate all possible orderings of first, family and initial.

44

(line 40)) or enumerations (see province element in Address (line 32)).Listing 3.6 includes (line 4) the definitions of the wine catalog in a separate file (list-

ing 3.8). Because the included file also has a start symbol, we override its definition bythe definition in braces after the name of the file. Any other included definition could beoverridden in this way. There are many other possibilities to combine definitions of manyfiles but we will not deal with them in this document. One should consult [34, chapter 10]for more details.

Namespace prefixes are declared by a definition following the keyword namespace (line 2).To use the predefined types of XML Schema (figure 3.3), we declare similarly the prefix usedfor referring to them. RELAX NG does not implement the notions of XML Schema keys

and keyref so that one must resort to the simpler (but often sufficient) notion of DTD ID

and IDREF explained in section 3.1.

Listing 3.6: [CellarBook.rnc]: RELAX NG compact notation schema for the cellar book.It can validate listing 2.2. Compare it with listing 3.3

datatypes xs = "http: //www.w3.org /2001/ XMLSchema -datatypes"namespace cat = "http: //www.iro.umontreal.ca/lapalme/wine -catalog"

include "WineCatalog.rnc" 5 start = cellar -book

cellar -element =element cellar

10 element wine attribute code xs:IDREF,element purchaseDate xs:date,element quantity xs:nonNegativeInteger,element rating attribute stars xs:positiveInteger ??,

15 element comment Comment ?*

name -element = element name 20 element first text?

& element family text?& element initial text?

25 cellar -book = element cellar -book wine -catalog ,element owner Owner,element location Address,cellar -element

30

45

Address = element street text,element city text,element province "AB"|"BC"|"MB"|"NB"|"NL"|"NT"|

35 "NS"|"NU"|"ON"|"QC"|"SK"|"YT",element postal -code PostalCodeCA

Owner = name -element , Address

40 PostalCodeCA = xs:string pattern="[A-Z][0 -9][A-Z] [0-9][A-Z][0-9]"

Should one need to manipulate a RELAX NG schema with a program, it would besimpler to use the corresponding RELAX NG XML notation as illustrated in listing 3.7.As we have obtained it automatically from the compact notation, we will not explain themfurther but we want to point out that it is much simpler to write than the correspond-ing XML Schema because of the uniformity of the underlying concepts (everything is apattern).

Listing 3.7: [CellarBook.rng]: RELAX NG schema for the cellar book in XML notationto be compared with listing 3.3. It was obtained automatically (using the Trang converter)from listing 3.6.

<?xml version="1.0" encoding="UTF -8"?><grammar xmlns="http:// relaxng.org/ns/structure /1.0"

datatypeLibrary="http: //www.w3.org /2001/ XMLSchema -datatypes"><include href="WineCatalog.rng">

5 <start ><ref name="cellar -book"/>

</start ></include ><define name="cellar -element">

10 <element name="cellar"><zeroOrMore >

<element name="wine"><attribute name="code">

<data type="IDREF"/>15 </attribute >

<element name="purchaseDate"><data type="date"/>

</element ><element name="quantity">

20 <data type="nonNegativeInteger"/></element ><optional >

<element name="rating"><optional >

46

25 <attribute name="stars"><data type="positiveInteger"/>

</attribute ></optional >

</element >30 </optional >

<optional ><element name="comment">

<ref name="Comment"/></element >

35 </optional ></element >

</zeroOrMore ></element >

</define >40 <define name="name -element">

<element name="name"><interleave >

<optional ><element name="first">

45 <text/></element >

</optional ><optional >

<element name="family">50 <text/>

</element ></optional ><optional >

<element name="initial">55 <text/>

</element ></optional >

</interleave ></element >

60 </define ><define name="cellar -book">

<element name="cellar -book"><ref name="wine -catalog"/><element name="owner">

65 <ref name="Owner"/></element ><element name="location">

<ref name="Address"/></element >

47

70 <ref name="cellar -element"/></element >

</define ><define name="Address">

<element name="street">75 <text/>

</element ><element name="city">

<text/></element >

80 <element name="province"><choice >

<value>AB</value><value>BC</value><value>MB</value>

85 <value>NB</value><value>NL</value><value>NT</value><value>NS</value><value>NU</value>

90 <value>ON</value><value>QC</value><value>SK</value><value>YT</value>

</choice >95 </element >

<element name="postal -code"><ref name="PostalCodeCA"/>

</element ></define >

100 <define name="Owner"><ref name="name -element"/><ref name="Address"/>

</define ><define name="PostalCodeCA">

105 <data type="string"><param name="pattern">[A-Z][0 -9][A-Z] [0-9][A-Z][0 -9]</param>

</data></define >

</grammar >

The beginning of listing 3.8 illustrates how to declare a default namespace for the elementsof this file, included in listing 3.6 (line 4). The definition of elements follows the sameprinciples explained for the cellar book. wine-catalog (line 6) must add an optional attributexml:base that is used by the XML processor during the file inclusion process. It is needed in

48

order to ensure the integrity of both the including and included file. Element Format (line 34)shows that comments starts with a # and go up to the end of the line. These comments arealso preserved during in the transformation to the XML notation in listing 3.9 (line 83).

Listing 3.8: [WineCatalog.rnc]: Relax NG Schema for the wine catalog in compact nota-tion. It can validate the instance document of listing 2.3. It can be compared with listing 3.4

default namespace = "http://www.iro.umontreal.ca/lapalme/wine -catalog"datatypes xs = "http: //www.w3.org /2001/ XMLSchema -datatypes"

start = wine -catalog5

wine -catalog = element wine -catalog # needed because this schema will be importedattribute xml:basetext?,element wineWine*

10

Wine = attribute name text,attribute appellation text,attribute classification text,

15 attribute code xs:ID,attribute format Format,element properties Properties,element origin Origin,( element tasting -note Comment

20 | element food -pairing Comment| comment -element

)*,element price xs:decimal,element year xs:gYear

25

Properties = element color Color,element alcoholic -strength Percentage,element nature text?

30 Origin = element country text,element region text,element producer text

Format = "375ml" | "750ml" | "1l"35 | "magnum" # 1.5 litres

| "jeroboam" # 3 litres| "rehoboam" # 4.5 litres| "mathusalem" # 6 litres| "salmanazar" # 9 litres

49

40 | "balthazar" # 12 litres| "nabuchodonosor" # 15 litres

Comment = mixed element emph text* & element bold text *comment -element = element commentComment

45

Color = "red" | "white" | "rose"

Percentage = xs:decimal minInclusive = "0"

50 maxInclusive = "100"fractionDigits ="2"

Listing 3.9: [WineCatalog.rng]: Relax NG schema for the wine catalog in XML notation.It can validate listing 2.3. It was obtained automatically (using the Trang converter) fromlisting 3.8 and slightly reformatted here to fit in the page

<?xml version="1.0" encoding="UTF -8"?><grammar ns="http://www.iro.umontreal.ca/lapalme/wine -catalog"

3 xmlns="http: // relaxng.org/ns/structure /1.0"datatypeLibrary="http: //www.w3.org /2001/ XMLSchema -datatypes"><start>

<ref name="wine -catalog"/></start>

8 <define name="wine -catalog"><element name="wine -catalog">

<optional ><attribute name="xml:base"/>

13 </optional ><zeroOrMore >

<element name="wine"><ref name="Wine"/>

</element >18 </zeroOrMore >

</element ></define ><define name="Wine">

<attribute name="name"/>23 <attribute name="appellation"/>

<attribute name="classification"/><attribute name="code">

<data type="ID"/></attribute >

28 <attribute name="format">

50

<ref name="Format"/></attribute ><element name="properties">

<ref name="Properties"/>33 </element >

<element name="origin"><ref name="Origin"/>

</element ><zeroOrMore >

38 <choice ><element name="tasting -note">

<ref name="Comment"/></element ><element name="food -pairing">

43 <ref name="Comment"/></element ><ref name="comment -element"/>

</choice ></zeroOrMore >

48 <element name="price"><data type="decimal"/>

</element ><element name="year">

<data type="gYear"/>53 </element >

</define ><define name="Properties">

<element name="color"><ref name="Color"/>

58 </element ><element name="alcoholic -strength">

<ref name="Percentage"/></element ><optional >

63 <element name="nature"><text/>

</element ></optional >

</define >68 <define name="Origin">

<element name="country"><text/>

</element ><element name="region">

73 <text/>

51

</element ><element name="producer">

<text/></element >

78 </define ><define name="Format">

<choice ><value>375ml</value><value>750ml</value>

83 <value>1l</value><value>magnum </value><value>jeroboam </value><value>rehoboam </value><value>mathusalem </value>

88 <value>salmanazar </value><value>balthazar </value><value>nabuchodonosor </value>

</choice ></define >

93 <define name="Comment"><mixed >

<interleave ><zeroOrMore >

<element name="emph">98 <text/>

</element ></zeroOrMore ><zeroOrMore >

<element name="bold">103 <text/>

</element ></zeroOrMore >

</interleave ></mixed >

108 </define ><define name="comment -element">

<element name="comment"><ref name="Comment"/>

</element >113 </define >

<define name="Color"><choice >

<value>red</value><value>white</value>

118 <value>rose</value>

52

</choice ></define ><define name="Percentage">

<data type="decimal">123 <param name="minInclusive">0</param>

<param name="maxInclusive">100</param><param name="fractionDigits">2</param>

</data></define >

128 </grammar >

3.4 Associating an Instance File to a Schema

An instance XML file can specify its validating schema by adding some information in theattributes of the root tag. This is illustrated in listing 2.2 (line 6) where we indicate the loca-tion of the schema with no namespace using the xsi:noNamespaceSchemaLocation attribute.We then include (using an xi:include element) the WineCatalog.xml file (listing 2.3) so thatits elements can be referred to. In fact, the XML processor sees the full content of thesefile (i.e. the cellar and the wine catalog). Listing 2.1 illustrates the file inclusion mechanismand how the instance files are linked to their respective XML Schema in listing 3.5.

xi:include refers to the W3C standard[26] which specifies a general purpose inclusionmechanism to merge information from different XML files. So it is possible to include onlysome well-formed parts of the included file, but here we include the whole wine catalog. Thisis a principled way of including information and not mere character inclusions like the onespecified with DTD system entities we used in section 3.1.1.

Listing 2.2 also shows that even if a file is validated with a XML Schema, a DOCTYPE

element can be added to define new entities. In fact, it is the only way to define an entityin a XML Schema.

Listing 2.3 (line 1) shows how to link an instance file and define its namespace. The emptynamespace, defined by the xmlns attribute in the root tag (line 4), indicates that all elementtags without prefix are defined in the http://www.iro.umontreal.ca/lapalme/wine-catalog

namespace. The schema location is indicated as the value of the xsi:schemaLocation (line 3)attribute with two values (blank separated). The first part indicates the namespace corre-sponding to the target namespace of the schema and the second part gives its URI (here alocal file).

RELAX NG specifications [20] do not prescribe how an instance file should be linkedto its schema, so each XML editor or validator has an implementation specific way of asso-ciating these files (either internally or externally). For example, <oXygen/> uses processinginstructions inserted at the top of the file such as the following (depending on whether thecompact syntax is used or not).

<?oxygen RNGSchema="CellarBook.rnc" type="compact"?><?oxygen RNGSchema="CellarBook.rng" type="xml"?>

53

3.5 Additional Information on XML Schema

Although XML schemas have been standardized, the area of validation is still a researchsubject and alternatives have been proposed: see [24] for a comparison of some of them.Interesting links are being made with relational database models[25] in order to build on itsstrong theoretical background. Schemas and the validation process are being formalized [15].

We have only skimmed over the subject of validation of XML files but the same essen-tial ideas apply throughout. On top of the official and informal information available atwww.w3.org/xml/Schema, some good sources of information and interesting tutorials can befound in the following resources:

http://www.XML.com is maintained by the O’Reilly editor with many excerpts from theirbooks

http://www.XML.org is a market-oriented site with interesting files in the resources section

http://www.mulberrytech.com/quickref/XMLquickref.pdf is a very useful XML Syn-tax Quick Reference Sheet (US letter size)

http://www.xfront.com/xml-schema.html gives a complete tutorial in roughly 150 Mi-crosoft Powerpoint slides.

http://www.xmlspy.com XMLSpy is a good commercial XML editor on the PC platform,complete with a powerful structure editor and internal validation and real-type sug-gestions of allowable elements attributes (strangely, these suggestions are not adequatein the text view i.e. the mode in which XML tags are explicitely typed). It is easyto switch between the text view and the structural view of the editor. There is also agood stylesheet designer module (Stylevision) to create stylesheet transformationsinteractively and graphically. These transformations can then be used as a basis forwhat is called the authentic view which can effectively hide the XML tags from theuser of a XML document.

http://www.oxygenxml.com/ <oXygen/> is a good XML editor for PC, Linux, MacOS Xand Solaris. Real time valid suggestions are offered in the text view. Validation canbe done within the editor. Stylesheets transformations can be displayed in a windowof the editor. It also features a tree editing mode and a similar graphical output of aschemato to what is provided by XMLSpy . Unfortunately, it is not possible to edit theschema graphically.

http://www.thaiopensource.com/nxml-mode/ nXML mode in Emacs [18] offers real timevalid suggestions for editing xml files when their schema is written in RELAX NG.Trang can be used for translating an XML Schema or a DTD into RELAX NG.The most interesting feature of nXML is its real-time validation during editing as itincrementally reparses and validates the document during idle periods in the typingprocess.

54

http://www.XML.com

http://www.XML.org

http://www.mulberrytech.com/quickref/XMLquickref.pdf

http://www.xfront.com/xml-schema.html

http://www.xmlspy.com

http://www.oxygenxml.com/

http://www.thaiopensource.com/nxml-mode/

Chapter 4

Document Transformation

Since XML is a tree-structured representation of information, it is relatively simple to processthis information either to change its shape or to select some sub-trees. To achieve this, XMLdesigners have defined the eXtensible Stylesheet Language (XSL) [16] technology which refersto two components:

XSLT [16] a transformation language to convert an XML document into either anotherXML document, into HTML, or into a plain text document (a very wide one-leveltree!)

XSL-FO a platform- and media-independent formatting language composed of a set ofXML elements, called formatting objects, that describe parts of a printed page at ahigh-level, e.g. <block>, <table>, etc. These elements are most often produced byXSLT transformations of an XML document.

XSLT depends on XPath [19] (explained in section 4.1), a syntax to identify nodes inan XML document. This specification is separate because several other W3C specificationsdepend on it; we saw an example in section 3.2.3 where XPath expressions were used todefine keys and keyrefs.

XSLT is an XML based formalism to define production rules (similar to OPS5 or Prologwithout unification) that match nodes in a tree of an XML document and produce a newtree. These rules are defined in stylesheets (XML files named with the .xsl extension) thatcan be validated with a predefined XSLT Schema. This transformation mechanism is verygeneral and can be used to produce any kind of tree, but most often it is used for presentation,one simple kind of tree being an HTML document. In fact, most web browsers can processXML documents linked with XSLT stylesheets to display the resulting transformation. Forexample, Internet Explorer (figure 1.2) and Firefox have a predefined stylesheet for XMLfiles to explore them gradually by folding and unfolding elements .

In section 4.3.1, we will show how to transform our cellar book instance document intoan HTML page with indented bulleted lists. We will see in section 4.3.2 how to createan HTML tabular presentation of our wine-catalog. Section 4.3.3 illustrates features ofstylesheets that allow to better select information and perform some simple calculations to

55

produce information that was not present in the original XML file. We will then show, insection 4.4, how to transform our XML instance document into the compact text representa-tion we presented in figure 1.4. Finally, we will illustrate in section 4.5 the use of FormattingObjects to produce a PDF output from an XML document.

4.1 XPath

Because XML documents are tree-structured, we must be able to designate nodes in theirtrees either absolutely (i.e. starting from the root) or relatively to a given node. An XPathexpression1 refers to either a single node or to a set of nodes in the document tree.

There are seven types of nodes in an XML document:

root the starting point of the document

element the most common type of node, it may contain other elements

text containing the real information; it cannot contain any element

attribute string information contained in the start-tag of the containing element, it isconsidered as a child of the element which contains it

comment information that is normally ignored for processing but that is nevertheless keptin the structure of the document

processing instruction elements starting with <? that will not be discussed in this paper

namespace information about the namespace of an element, its processing will not bediscussed in this section.

An XPath expression designating a set of nodes in the document tree consists of threeparts.

1. an axis specifier gives the path to a set of nodes; we will only use here the abbreviatedsyntax similar to the path notation used in computer systems to designate files anddirectories.

• / is the path separator between levels in the tree

• an absolute path from the root starts with /

• a relative path from the current node starts with something else than a /

• .. is the parent of the current node

• . is the current node

1In this report, we only use XPath 1.0 syntax [19]. Recently, a more involved specification XPath 2.0 [11]has been proposed and is already implemented in some XML processors.

56

• // indicates a path with any number of intervening levels between two nodes; if //appears at the start of the expression then it means any arbitrary path betweena node and the root

Element names are used to select nodes in the path but attributes (preceded by @) canalso be used. The unabbreviated syntax allows access to siblings or ancestors in thetree, but we will not use it in this document.

2. node test can be the name of a node (with or without the namespace prefix), * toindicate all nodes, or it can be a function name such as node(), text() or comment()

to indicate the type of the node that is looked for.

3. predicate is a boolean expression given between square brackets ([]) that can furtherfilter the set of nodes identified with the axis specifier and the node test. If theexpression is a number i, then it refers to the ith child element (numbering starts at1).

A predicate can use variables (their creation will be shown later) by prefixing their nameby $, string manipulation functions (concat(.,.), substring(,.,.), ...), number functions(sum(.), floor(.), ...) and node set functions (position(), count(.), local-name(.), ...).We will give some examples of their use but the full set can be found in the XPath referencedocument [19].

Table 4.1 presents examples of absolute XPath expressions that return a node or a set ofnodes on the cellar book document shown in listing 2.2. The table makes explicit the threeparts of an XPath expression. The XPath expressions of the table can be paraphrased asfollows:

1. refers to the owner element of the cellar

2. returns the wines for which we have 2 bottles or less. The nodes returned are thewine elements even though the predicate uses an internal element; please note thatthe predicate is evaluated in the current context of the path specified. When XPathexpressions are used in the context of an XSL file, as it is most often the case, the <

must be replaced by < (even within strings!)

3. refers to the first wine of the cellar

4. returns the elements which contain a postal-code element. This is achieved by findinga postal-code anywhere in the tree from the root and then getting the parent element

5. finds the street of the owner of the cellar

6. returns the value of the code attribute for all wines in the cellar.

7. returns the value of the code attribute for all wines in the catalog (note the use of thenamespace prefix)

57

XPath expression line numbers(s) of start ofnode(s) in listing 2.2

1 /cellar-book/︸︷︷︸axis specifier

owner︸︷︷︸test

11

2 /cellar-book/cellar/︸︷︷︸axis specifier

wine︸︷︷︸test

[quantity<=2]︸︷︷︸pred

28, 36



[1]︸︷︷︸p

28

4 //postal-code/..︸︷︷︸axis specifier

11,21

values of expression5 /cellar-book/owner/︸︷︷︸

axis specifier

street︸︷︷︸test

1234 rue des Chateaux

6 //wine/︸︷︷︸a

@code︸︷︷︸test

C00043125, C00312363,C10263859, C00929026

7 //cat:wine/︸︷︷︸a


C00043125, C00042101,C10263859, C00312363,C00929026

8 //︸︷︷︸a


[last()]︸︷︷︸pred

/

︸︷︷︸axis specifier


C00929026



[1]︸︷︷︸p︸︷︷︸

axis specifier

/comment/cat:bold︸︷︷︸test

Guy Lapalme, Montreal

10 sum(/cellar-book/cellar/wine/︸︷︷︸axis specifier

quantity︸︷︷︸test

) 14

Table 4.1: Examples of XPath expressions and their result when applied to Listing 2.2

58

8. returns the code of the last wine of the cellar

9. returns the cat:bold element (note again the use of the namespace prefix) within thecomment of the first wine of the cellar.

10. finds the total number of bottles in the cellar by applying the predefined XPathfunction sum to the value of all quantity elements of the wines in the cellar.

Many other uses of XPath expressions will be shown in the XSL stylesheets in thefollowing sections.

4.2 XSL Transformations

A template to transform a node in a tree has the following form:

<xsl:template match="XPath expr ">value replacing the matching node(s)

</xsl:template >

Like with XML Schemas, we must distinguish between the XSLT predefined elementsand the elements used to create the document. The namespace xsl is most often used forelements of XSLT. In order to trigger a xsl:template (a production rule), the transformationprocess must first identify the node (or nodes) to which it applies. This is done with the valueof the match attribute that specifies an XPath expression. The content of the xsl:template

defines the new structure of the tree by combining any part of the matched tree, new partsor even other parts of the document tree. The parts of the tree used as building blocksare referenced by XPath expressions and combined with functions, conditions, restrictedlooping constructs, etc. But the reader must remember that XSLT is a declarative language(similar to Prolog in some ways), so the ordering of templates cannot be used to influencethe order of processing of the document tree.

A stylesheet follows a simple process: find a node for which a template applies and then,according to the content of the template, build a new tree structure in the context of thisnode: a context gives access to the current node, its parent, its siblings and its position withinits siblings. To build the new tree structure, a template usually involves the application oftemplates to its children and their combination. This is done with the xsl:apply-templates

element; without attributes, this forces the application of templates to all the children nodesof the current node, but the transformation can be applied to other nodes by using theselect attribute which specifies an XPath expression.

Templates can also be named and called with xsl:call-template, similar to proceduresin standard programming languages. But be aware that these procedures are rules and thatthey cannot have variables that can change their value: XSLT is thus a single assignmentlanguage much like functional languages; parameters are the only mean of passing variableinformation between templates. Contrarily to ordinary templates, named templates do notchange the context of their application. Therefore, we see that the principles underlyingXSLT are general, simple and powerful.

59

<xsl:stylesheet version=”1.0” xmlns:xsl=”http://www.w3.org/1999/XSL/Transform”>xsl:output? xsl:template*

</xsl:stylesheet>

<xsl:output method=”xml” indent=”yes” encoding=”UTF8”/>

<xsl:template match=”pattern”>xsl:param*xsl:apply-templates | xsl:call-template | xsl:variable | xsl:attribute | xsl:element*

</xsl:template><xsl:template name=”QName”>

xsl:param*xsl:apply-templates | xsl:call-template | xsl:variable | xsl:attribute | xsl:element*

</xsl:template>

<xsl:apply-templates select=”node-set-exp”?/><xsl:apply-templates select=”node-set-exp”?>

xsl:sort | xsl:with-param*</xsl:apply-templates>

<xsl:with-param name=”QName”> ...</xsl:with-param><xsl:with-param name=”QName” select=”expr”/>

<xsl:call-templates name=”node-set-exp”?/><xsl:call-templates name=”node-set-exp”?>

xsl:sort | xsl:with-param*</xsl:call-templates>

<xsl:param name=”QName”> ...</xsl:param><xsl:param name=”QName select=”expr””/>

<xsl:value-of select=”expr”/>

<xsl:variable name=”QName”> ...</xsl:variable><xsl:variable name=”QName select=”expr””/>

<xsl:if test=”boolean-expr”>...</xsl:if>

<xsl:choose><xsl:when test=”expr”>...</xsl:when>+<xsl:otherwise> ... </xsl:otherwise>?

</xsl:choose>

<xsl:for-each select=”XPathExpr”>xsl:sort* ...

</xsl:for-each>

<xsl:sort select=”XPathExpr” order=”ascending|descending”? data-type=”number”?/>

<xsl:element name=”QName” namespace=”URI”>...</xsl:element><xsl:attribute name=”QName” namespace=”URI”>...</xsl:attribute><xsl:text> #PCDATA</xsl:text>

Table 4.2: A reminder of the subset of XSLT syntax used in our examples. Names in italicsrefer to other elements. Regular expressions are used to describe the allowed forms: bracesare used for grouping, ? indicates an optional grouping, * a repetition (possibly none) and+ a repetition at least once.

60

Table 4.2 shows the main stylesheet elements to define transformation rules. As astylesheet is itself a XML document, it can be validated with the appropriate schema.The root element is xsl:stylesheet which contains a certain number of templates.

A xsl:template element with a match attribute will be called when an element match-ing its pattern is encountered during the processing of the XML instance document. Axsl:template element with a name attribute must be called explicitly by a xsl:call-templates.The content of the matched element in the source document is replaced by the content ofthe template which usually involves the application of templates to its children.

xsl:apply-templates is the fundamental operation for traversing the document tree.Without attributes, it indicates that processing should be recursively applied to all childrenelements and text nodes (not to attributes). If the element is empty, then the nodes areprocessed in the document order but it is possible to specify a different ordering with axsl:sort element. Actual parameters can also be given by name and value using with-param

elements. Formal parameters are declared at the start of the xsl:template elements.xsl:value-of is the fundamental way of getting the information contained in an element

from the source document.Local single assignment variables can be defined within templates with a xsl:variable

element. Its value can then be recovered in an XPath expression by prefixing its name with$.

Conditional processing is achieved using either xsl:if which returns its content when thevalue of its test attribute is true or a non-empty set of nodes. xsl:choose is for selectingthe first value from a series of alternatives indicated by xsl:when. The xsl:when are tested insequence and the first one that returns true is the value of this element. If no test succeedsand an xsl:otherwise element is present, its value is the value of the xsl:choose element.

Although recursive traversal of the document tree is the preferred way of going throughnodes, it is also possible to do this traversal iteratively with a xsl:for-each.

Processing of the nodes is usually done in document order but the order can be changedusing xsl:sort which allows to specify the sorting key with the select attribute and anascending or descending sort according to the value of order attribute. The values sortedare usually the text value but their numeric value can also be used for sorting by specifyingthe data-type attribute.

The dynamic creation of target document elements, attributes and text nodes is doneusing the xsl:element, xsl:attribute and xsl:text elements.

We will now look at many examples of use of these principles and XSL elements in thefollowing sections; first with straightforward transformations into HTML, then into plaintext and finally into formatting objects to produce more complex formatting.

61

Listing 4.1: [compactHTML.html]: HTML output (slightly reformatted here to fit in thepage) produced by the transformation of listing 4.6 on the cellar book (listing 2.2)

<html xmlns="http: //www.w3.org /1999/ xhtml"><head><title >HTML compaction of the XML file</title ></head><body>

<ul><li xmlns="">5 <b>cellar -book</b> noNamespaceSchemaLocation="CellarBook.xsd"<ul>

<li><b>wine -catalog </b> ...</li><li><b>owner </b>

<ul><li><b>name</b><ul><li><b>first </b> Jude</li>

10 <li><b>family </b> Raisin </li></ul>

</li><li><b>street </b> 1234 rue des Chateaux </li><li><b>city</b> St-George </li>

15 <li><b>province </b> ON</li><li><b>postal -code</b> M7W 7S0</li>

</ul></li><li><b>location </b>

20 <ul><li><b>street </b> 4587 des Futailles </li><li><b>city</b> Vallee des crus</li><li><b>province </b> QC</li><li><b>postal -code</b> H3C 4J8</li>

</ul>25 </li>

<li><b>cellar </b><ul><li><b>wine</b> code="C00043125"<ul>

<li><b>purchaseDate </b> 2005 -06 -20</li><li><b>quantity </b> 2</li>

30 <li><b>comment </b><ul>

<li><b>bold</b> Guy Lapalme , Montreal </li>:should reorder soon

</ul>35 </li>

</ul></li>....<li><b>wine</b> code="C00929026"<ul>

40 <li><b>purchaseDate </b> 2003 -10 -15</li><li><b>quantity </b> 1</li><li><b>comment </b>

<ul>for <li><b>bold</b> big</li> parties </ul></li>

45 ...</body>

</html>

62

Figure 4.1: [compactHTML.jpg]: HTML display of listing 4.1

63

4.3 Transformation in HTML

4.3.1 Bulleted Lists

XSLT was designed from the start to transform trees into other trees. One kind of treeeasy to produce with XSLT from an XML tree is a well-formed HTML document, mostoften an XHTML document which is a valid XML file that can also be displayed directlyby most modern web browsers. When the body of a template contains XML elements thatare not XSLT elements (i.e. transformation instructions), they are copied verbatim to theoutput. So it is relatively easy to build the structure of an HTML document in which onlysome parts will be processed. This is similar in spirit to the backquote macro processing inLisp. To transform the cellar book (listing 2.2) into the HTML of listing 4.1 (rendered infigure 4.1) we can use the code given in listing 4.2 which has three templates:

• one matching the root element (line 10) that produces the overall structure of theHTML file with its head and body elements. The processing of subelements (line 17)is called within an unnumbered list delimited by ul tags.

• matching attributes (line 23) is done by outputting a space (with an entity defined online 3), the name of the attribute followed by an equal sign and its value within doublequotes.

• other elements (line 28) are transformed by outputting the name of the element re-turned by the function local-name() in bold (line 30) followed by its attributes. If theelement is a node without any children (line 33) then it is output with only its content,otherwise (line 36) a new unnumbered list is started and template matching is appliedon children nodes (line 38).

The * in the match attribute (line 28) indicates that this rule applies to all element nodesnot matched by a more rule such as the one on line 10 that matches only the root node.

A observant reader might wonder where the text within a text element is coming from,because there is no explicit rule in our program for this. This output is achieved by abuilt-in template which says that text nodes (i.e. those who have the following attributematch="text()") have <xsl:value-of select="."/> as content. The built-in rule for at-tribute nodes specifies to ignore their value but here we output both the attribute name andvalue.

Listing 4.2: [compactHTML.xsl]: XSLT transformation to produce a bulleted list (list-ing 4.1) from the cellar book (listing 2.2)

<?xml version="1.0" encoding="UTF -8"?><!DOCTYPE stylesheet [<!ENTITY space "<xsl:text > </xsl:text >">]>

5 <xsl:stylesheet xmlns:xsl="http://www.w3.org /1999/ XSL/Transform"version="1.0">

64

<xsl:strip -space elements="*"/><xsl:output indent="yes"/>

10 <xsl:template match="/"><html xmlns="http://www.w3.org /1999/ xhtml">

<head><title>HTML compaction of the XML file</title>

</head>15 <body>

<ul><xsl:apply -templates/>

</ul></body>

20 </html></xsl:template >

<xsl:template match="@*">&space;<xsl:value -of select="local -name()"/>

25 <xsl:text >="</xsl:text ><xsl:value -of select="."/><xsl:text >"</xsl:text ></xsl:template >

<xsl:template match="*"><li>

30 <b><xsl:value -of select="local -name()"/></b><xsl:apply -templates select="@*"/>

<xsl:choose ><xsl:when test="count (*)=0">

&space;<xsl:value -of select="."/>35 </xsl:when >

<xsl:otherwise ><ul>

<xsl:apply -templates/></ul>

40 </xsl:otherwise ></xsl:choose >

</li></xsl:template >

45 </xsl:stylesheet >

4.3.2 Table

XSLT can also be used to select information in an XML file to produce tabulated informa-tion. Selecting only red wines in our wine catalog (listing 2.3) and outputting a HTML table

65

of a subset of the available information for each (figure 4.2) can be done with the XSLTstylesheet given in listing 4.4.

The stylesheet in listing 4.4 first defines a template matching the root node (line 13).Because the wine catalog is defined in a specific namespace, its prefix must be declared (line 4)and used for selection. This template outputs the overall structure of the XHTML file likein the previous example and then selects (line 29) only wines whose color property is red.A value set by xsl:variable or xsl:parameter at the level of the stylesheet (line 5) actsas a global variable. But when it has been defined as a parameter of the stylesheet, it ispossible to change its value when the stylesheet is called by the XSLT processor. The wayof setting this value from outside the stylesheet depends on each processor. In order that theselect attribute value be the character string ’red’ and not the value associated with thenode red (which is empty at this moment), single quotes must be added within the doublequotes indicating the value of the attribute. Note again the use of the namespace prefix. Thepredicate used in square brackets limits the nodes to which the templates will be appliedbut does not change the context of the node, which is still the wine node.

The output for each selected wine is defined with a template that applies to each cat:wine

(line 36). It outputs, in a single line of the table, the values of its attributes (namespaces donot apply to attributes), its color, its year (right aligned) and formats its price to end witha dollar sign (right aligned) and volume of each bottle in milliliters. As the information inthe wine catalog is not given in milliliters, we call (line 46) a named template to transformit appropriately.

The toML named template (line 53) has a parameter named fmt referred by the XPathexpression $fmt within the template. This is a very simple template that chooses the valueto return depending of the value of the format attribute that has been given as value of theparameter when the named template was called (line 46).

Listing 4.4: [WineCatalog.xsl]: XSLT to select the red wines in the wine catalog (list-ing 2.3) and producting listing 4.3 shown as figure 4.2

<?xml version="1.0" encoding="UTF -8" ?><xsl:stylesheet xmlns:xsl="http://www.w3.org /1999/ XSL/Transform"

version="1.0"xmlns:cat="http: //www.iro.umontreal.ca/lapalme/wine -catalog">

5 <xsl:param name="color" select="’red ’"/><xsl:output method = "xml"

doctype -public = " -//W3C//DTD XHTML 1.0 Strict //EN"doctype -system =

10 "http: //www.w3.org/TR/xhtml1/DTD/xhtml1 -strict.dtd"indent = "yes" encoding = "UTF -8"/>

<xsl:template match="/cat:wine -catalog"><html>

15 <head><title>Wine Catalog </title></head><body>

66

Listing 4.3: [WineCatalog.html]: HTML tabular output (slightly reformatted andtrimmed to fit on the page) of of the red wines in the wine catalog produced by the trans-formation of listing 2.3

<!DOCTYPE html PUBLIC " -//W3C//DTD XHTML 1.0 Strict //EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1 -strict.dtd">

<html xmlns:cat="http: //www.iro.umontreal.ca/lapalme/wine -catalog">4 <head><title >Wine Catalog </title></head>

<body><h1>Wine Catalog (red only)</h1><table border="1">

<tr><th width="200">Wine Name</th>9 <th>Code</th><th>Color</th><th>Year</th>

<th>Price </th><th>ml</th></tr><tr><td>Domaine de l'Ile Margaux </td>

<td>C00043125 </td>14 <td>red</td>

<td align="right">2002</td><td align="right">$22.80 </td><td align="right">750</td>

</tr>19 <tr><td>Prado Rey Roble </td>

<td>C00929026 </td><td>red</td><td align="right">2002</td><td align="right">$35.25 </td>

24 <td align="right">1500</td></tr>

</table ></body>

</html>

Figure 4.2: [WineCatalog.jpg]: HTML rendering of the top listing It was produced bythe execution of the stylesheet given in listing 4.4

67

<h1>Wine Catalog (<xsl:value -of select="$color"/> only)</h1>

20 <table border="1"><tr>

<th width="200">Wine Name</th><th>Code</th><th>Color</th>

25 <th>Year</th><th>Price</th><th>ml</th>

</tr><xsl:apply -templates

30 select="cat:wine[cat:properties/cat:color =$ color]"/></table>

</body></html>

</xsl:template >35

<xsl:template match="cat:wine"><tr>

<td><xsl:value -of select="@name"/></td><td><xsl:value -of select="@code"/></td>

40 <td><xsl:value -of select="cat:properties/cat:color"/></td><td align="right"><xsl:value -of select="cat:year"/></td><td align="right">

<xsl:value -of select="format -number(cat:price ,’$0.00 ’)"/></td>

45 <td align="right"><xsl:call -template name="toML">

<xsl:with -param name="fmt" select="@format"/></xsl:call -template >

</td>50 </tr>

</xsl:template >

<xsl:template name="toML"><xsl:param name="fmt"/>

55 <xsl:choose ><xsl:when test="$fmt=’375ml ’">375</xsl:when ><xsl:when test="$fmt=’750ml ’">750</xsl:when ><xsl:when test="$fmt=’1l’">1000</xsl:when ><xsl:when test="$fmt=’magnum ’">1500</xsl:when >

60 <xsl:otherwise >big</xsl:otherwise ></xsl:choose >

68

</xsl:template ></xsl:stylesheet >

4.3.3 Computing New Information

XSLT can also be used for more complex selections and transformations. We will now showhow to create a web page presenting the content of the cellar and integrating informationfrom the wine catalog. The end result is shown in figure 4.3 (an outline of the underlyingHTML is in listing 4.5). There are external links in order to get more information aboutthe wines directly from the website of the Societe des Alcools du Quebec (SAQ). There aretwo similar links for each wine but it is just to show and compare ways of creating them inXSL.

Listing 4.5: [CellarBook.html]: HTML output (slightly reformatted and trimmed to fiton the page) about the cellar. The XSLT code is shown in listing 4.6. Because of the useof a namespace in the html tag, the XML processor has generated xmlns="" attributes for alltop level HTML elements.

<html xmlns="http: //www.w3.org /1999/ xhtml"xmlns:cat="http://www.iro.umontreal.ca/lapalme/wine -catalog">

<head><title >Cellar of Jude Raisin </title></head><body>

5 <h1>Cellar of JudeRaisin </h1><table xmlns="" border="1">

<tr><th>Personal address </th><th>Cellar address </th></tr><tr>

<td>1234 rue des Chateaux <br/>St -George <br/>10 ON<br/>M7W 7S0<br/></td>

<td>4587 des Futailles <br/>Vallee des crus<br/>QC<br/>H3C 4J8<br/></td>

</tr></table >

15 <p xmlns=""/><table xmlns="" border="1">

<tr><th>Code</th><th>Name</th><th>Purchase date</th><th>Rating </th><th>Nb bottles </th></tr>

<tr>20 <td>00043125 </td>

<td><a href="http: //www.saq.com /... rech =00043125">Domaine de...</a><br/><i>

25 <a href="http: //www.saq.com /... rech =00043125">Domaine de...</a></i>

</td><td align="right">2005 -06 -20</td><td align="center"/>

30 <td align="right">2</td></tr>

69

Figure 4.3: [CellarBook.jpg]: HTML display of listing 4.5

70

....<tr>

<td colspan="3">Estimated value</td>35 <td align="right">333.75 $</td>

<td align="right">14</td></tr>

</table ><h3 xmlns="">Comments </h3>

40 C00043125 :<b xmlns="">Guy Lapalme , Montreal </b>: should reorder soon

....</body>

</html>

Listing 4.6 illustrates some new features. The root template (line 15) creates the highlevel structure of the XHTML file: the title of the page, also displayed at the top of thepage, refers to the name of the owner. In listing 2.2, element name (line 12) is structuredin two elements: first and family. When the value of such an element is returned from axsl:value-of, it is the text context of all elements: in fact, there are 5 parts in this case:

• the text node comprising a \n2 following the name opening tag and white space untilthe start of first

• the content of the element first

• the \n and spaces between the closing tag of first and the opening tag of family

• the content of the element family

• the \n and spaces between the closing tag of family and the closing tag of name

Given the fact, that a sequence of \n and spaces in HTML is displayed as a single space bythe browser, we get an appropriate display in this case. But this shows that dealing withtext content can become a bit tricky. The next section will explain how to deal with someof the most frequent cases.

The content of the cellar book is obtained by an explicit call (line 26) to the templatedefined at (line 34) which creates a table with the address of the owner (line 40) and thecellar (line 44). It then calls the cellar template (line 51). The lines of the addresses areobtained by looping on all elements with a for-each and outputting a <br/> between the textvalue of each element of the owner and location elements. Because we want to skip the firstelement (the owner name has already been given at the top), we only keep those (line 44)with a position number be greater than 1.

The comment (line 55) template outputs the value of the code of the parent node, a colonand the text value of the comment followed by a line break. In getting the content of thecomment, the cat:bold elements will be processed by the appropriate template (line 61) totransform them by HTML b tags.

2\n is the notation for an end of line character.

71

The cellar (line 66) template produces a table of information about wines in the cellarsorted by their code. This is why the content of the xsl:apply-templates element (line 75)is an xsl:sort element indicating the sorting key and the sort order. The last line of thetable contains an estimated total value of the cellar obtained by a call (line 81) to the total

named template. To compute the total number of bottles (line 88), we can use the predefinedsum function. Finally, if there are any comment in the wine elements of the cellar (line 93),we add a Comments section and output each of them, also in increasing order of wine codesso that they be in the same order as in the table of wines.

We define the wine template (line 102) to create a row in the table of wines. We firstdefine a variable code (line 104) having the value of the code attribute. We display itwithout the first letter, to make it match with the official SAQ code.3 In order to gathersome information about this wine from the wine catalog and we pass the wine node fromthe wine-catalog as a parameter to the call to the nameAndUrl template (line 107). The linkbetween the current element and the corresponding element in the catalog is done using thevalue of the code variable given in the XPath expression. The remaining elements of therow are the purchase date (right-aligned), a number of stars corresponding to the rating andthe quantity (right-aligned).

The named template total (line 121) computes the estimated value of the cellar. Sincethere are no assignment in XSLT, we cannot change the value of a variable in a template,so we call total recursively in order to update the sum. This is why this template has twoparameters: index is the index of the current node in the list of wines and sum is used as anaccumulator. These are initialized at the call (line 81) respectively to 1 and 0. The total

template first tests whether index is greater than the number of wine elements (line 125).If so, it returns the current value of sum properly formatted. If the index is less than thenumber of wines, then we process the indexth wine, for which we get the quantity (line 129)and the code. We then search the wine catalog to get its price, which is used to compute anew value for the sum parameter for the recursive call to total (line 133).

nameAndUrl (line 142) is another named template that receives a wine element as param-eter. From this wine element, we create a XHTML link (line 146) with an a element withan href attribute whose value is a string giving the address of the SAQ website. This is abit involved because the link must be created dynamically using the xsl:element and thexsl:attribute elements with appropriate contents. For that, we use two variables:

$SAQquery set in xsl:param for the whole stylesheet (line 5). This string corresponds to theURL encoded CGI call to query the SAQ database from a webpage.

$SAQcode is the code of the wine (minus its first character).

The values are used to create the value of the attribute href of the a element in the resultingHTML.

Because creating such dynamic elements and attributes is often required, XSLT designershave defined a non-XML formalism called in which we only indicate the shape of the element

3We had added an initial C in order to be the right format for an ID type of a DTD or XML Schema.

72

to create. Within this element, we can use a variable between braces that will be replacedby its current value (line 155). One should consult the documentation to determine whichcontexts allow this shortcut notation.

Listing 4.6: [CellarBook.xsl]: XSLT stylesheet to produce information about the cellar(listing 2.2). The resulting HTML (listing 4.5) is displayed in figure 4.3

<?xml version="1.0" encoding="UTF -8"?><xsl:stylesheet xmlns:xsl="http://www.w3.org /1999/ XSL/Transform"

version="1.0"xmlns:cat="http: //www.iro.umontreal.ca/lapalme/wine -catalog">

5 <xsl:param name="SAQquery">http://www.saq.com/pls/devsaq/recherche.pp_build_query?p_iden_tran =19944336& amp;P_type_rese_dist =99& amp;P_NIVE =2&P_mot_rech=</xsl:param >

<xsl:output method="xml"

doctype -public=" -//W3C//DTD XHTML 1.0 Strict //EN"10 doctype -system="http://www.w3.org/TR/xhtml1/DTD/xhtml1 -strict.dtd"

indent="yes"encoding="UTF -8"/>

15 <xsl:template match="/">

<html xmlns="http://www.w3.org /1999/ xhtml"><head>

<title>Cellar of <xsl:value -ofselect="cellar -book/owner/name"/>

20 </title></head><body>

<h1>Cellar of <xsl:value -ofselect="cellar -book/owner/name"/>

25 </h1><xsl:apply -templates select="cellar -book"/>

</body></html>

</xsl:template >30

<xsl:template match="cellar -book">

35 <table border="1"><tr>

<th>Personal address </th><th>Cellar address </th>

</tr>

73

40 <tr><td><xsl:for -each select="owner /*[ position ()>1]"><xsl:apply -templates/><br/>

</xsl:for -each></td><td><xsl:for -each select="location /*">

45 <xsl:apply -templates/><br/></xsl:for -each>

</td></tr>

</table>50 <p/>

<xsl:apply -templates select="cellar"/></xsl:template >

55 <xsl:template match="comment">

<xsl:value -of select="../ @code"/> :<xsl:apply -templates/><br/>

</xsl:template >

60 <xsl:template match="cat:bold">

<b><xsl:apply -templates/></b></xsl:template >

65 <xsl:template match="cellar">

<table border="1"><tr>

<th>Code</th>70 <th>Name</th>

<th>Purchase date</th><th>Rating </th><th>Nb bottles </th>

</tr>75 <xsl:apply -templates select="wine">

<xsl:sort select="@code" order="ascending"/></xsl:apply -templates ><tr>

<td colspan="3">Estimated value</td>80 <td align="right">

<xsl:call -template name="total"><xsl:with -param name="index" select="1"/><xsl:with -param name="sum" select="0"/>

</xsl:call -template >

74

85 </td><td align="right">

<xsl:value -of select="sum(wine/quantity)"/>

</td>90 </tr>

</table><xsl:if test="count(wine/comment)>0">

<h3>Comments </h3>95 <xsl:apply -templates select="wine/comment">

<xsl:sort select="../ @code" order="ascending"/></xsl:apply -templates >

</xsl:if ></xsl:template >

100

<xsl:template match="wine">

<tr><xsl:variable name="code" select="@code"></xsl:variable >

105 <td><xsl:value -of select="substring ($code ,2)"/></td><td><xsl:call -template name="nameAndUrl"><xsl:with -param name="wine"select="/cellar -book/cat:wine -catalog/cat:wine[@code=$code]"/>

110 </xsl:call -template ></td><td align="right"><xsl:value -of select="purchaseDate"/></td><td align="center">

<xsl:value -of select="substring (’*****’,1, rating/@stars)"/>115 </td>

<td align="right"><xsl:value -of select="quantity"/></td></tr>

</xsl:template >

120 <xsl:template name="total">

<xsl:param name="index"/><xsl:param name="sum"/><xsl:choose >

125 <xsl:when test="$index >count (./ wine)"><xsl:value -of select="format -number ($sum ,’$0.00 ’)"/>

</xsl:when ><xsl:otherwise ><xsl:variable name="qty" select="./wine[$ index]/ quantity"/>

75

130 <xsl:variable name="code" select="./wine[$ index]/ @code"/><xsl:variable name="price"

select="/cellar -book/cat:wine -catalog/cat:wine[@code=$code]/ cat:price"/><xsl:call -template name="total"><xsl:with -param name="index" select="$index + 1"/>

135 <xsl:with -param name="sum" select="$sum + $qty*$ price"/></xsl:call -template >

</xsl:otherwise ></xsl:choose >

</xsl:template >140

<xsl:template name="nameAndUrl">

<xsl:param name="wine"/><xsl:variable name="SAQCode" select="substring ($wine/@code ,2)"/>

145 <xsl:element name="a">

<xsl:attribute name="href"><xsl:value -of select="$SAQquery"/><xsl:value -of select="$SAQCode"/>

150 </xsl:attribute ><xsl:value -of select="$wine/@name"/>

</xsl:element ><br/>

155 <i><a href="$ SAQquery $ SAQCode"><xsl:value -of select="$wine/@name"/>

</a></i>

</xsl:template >160

</xsl:stylesheet >

4.4 Transformation into a Compact Textual Form

We will now use an XSLT stylesheet (shown in listing 4.8) to produce a compact form ofan XML file. The output of this transformation will only be a stream of plain characterswithout any tags. This illustrates that XSLT can be used to transform XML input intosomething else than XML.

The algorithm follows the same pattern as the one explained in section 4.3.1. We recur-sively follow the structure of tree and output a corresponding stream of characters. In ourcase, only one rule is applied to all element nodes: we output the name of the element andthen output its attributes and children, but with an indentation corresponding to the number

76

of characters in the name of the element. The root node has an indentation of 0. Becausewe want an output with few blank lines, we only change line after having output the firstattribute or child element. This rule is implemented with the template starting on (line 50).The * in the match attribute indicates that this rule applies to all element nodes not matchedby a more specific rule such as the one on line 10 that matches only the root node. In thislatter rule, we apply the general rule to all children with <xsl:apply-templates> withoutany select attribute. All templates have a parameter indicating the indentation given asxsl:with-param element and declared in the template with the xsl:param element.

Listing 4.7: [CellarBook.txt]: Text compaction (some lines have been omitted here) ofthe cellar book of listing 2.2. It was produced by the stylesheet in listing 4.8.

cellar -book[@noNamespaceSchemaLocation[CellarBook.xsd]wine -catalog[@schemaLocation[http: ... catalog WineCatalog.xsd]

@base[WineCatalogXSD.xml]....

5 wine[@name[Prado Rey Roble]@appellation[Ribera -del -duero]@classification[d.o.]@code[C00929026]@format[magnum]

10 properties[color[red]alcoholic -strength [12.5]nature[still ]]

origin[country[Spain]region[Old Castille]

15 producer[Real Sitio de Ventosilla SA]]price [35.25]year [2002]]]

owner[name[first[Jude]family[Raisin ]]

20 street [1234 rue des Chateaux]city[St-George]province[ON]postal -code[M7W 7S0]]

location[street [4587 des Futailles]25 city[Vallee des crus]

province[QC]postal -code[H3C 4J8]]

cellar[wine[@code[C00043125]purchaseDate [2005 -06 -20]

30 quantity [2]comment[bold[Guy Lapalme , Montreal]

: should reorder soon]].....wine[@code[C00929026]

35 purchaseDate [2003 -10 -15]quantity [1]comment[for

bold[big]parties ]]]]

77

Characters are put in the output stream literally by xsl:text elements or computed byxsl:value-of elements whose select attribute is an expression containing functions or spe-cial values such as "." which designates the content of the element or local-name() whichis the name of the element. Conditions can be introduced with an xsl:if element whosetest attribute can refer to the current context; here count(../@*) counts the number ofattributes of the parent and position() indicates the rank of the current node among itssiblings. With this information, we can decide if a current line should be ended and insertthe appropriate number of blanks according to the value of the indent parameter beforeoutputting the next information. After the output of the name of the element, we call thetemplates for all attributes, element and text nodes of this node (this is achieved with theselect attribute on line 59). In our case, we also update the current indentation that willbe given to all these nodes. The output of all these recursive template applications will beenclosed in a pair of square brackets.

Because all characters and new lines in the stylesheets are returned as they appear in theinput including \n and leading and trailing spaces between elements, it can be difficult toget a specific output format. This is not really a problem if the output of the transformationis HTML, because in this case these spurious spaces and newlines are removed before beingdisplayed to the user. In our case, the transformation output is given as is to the user so it issimpler to only output the content of elements of the stylesheet without any \n and withoutleading and trailing spaces. This is why on line 8, we declare that all elements (*) of thisstylesheet should ignore all spaces. To output explicit space and newlines we declare at thestart of the file up to line 5 two entities to designate a space character and a carriage return.

On line 27, we define the template to output the value of an attribute, which is simplythe name of the attribute preceded by a @ and followed by its value in square brackets. Ifthe attribute is not the first one, the line is ended and a new indentation is produced.

On line 40, we also check if we need to end the current line and then we output the valueof the current node with all extraneous space removed using the normalize-space function.This removes all white space at the start and end of the value and leaves only one spacebetween non space characters. Note that the xsl:strip-space declaration (line 8) indicatethat newlines and spaces should be ignored within the stylesheet, but that the XSL functionnormalize-space removes leading and trailing spaces and returns in the source XML files.

To output a given number of spaces, we have defined a named template called spaces

(line 18) with one parameter. If this template is called with a parameter greater than 0, itoutputs one space and then calls itself recursively until the required number of spaces havebeen written. This is another example of how looping can be achieved. However to loopon elements, we usually rely on the implicit repetition inherent to the xsd:apply-templates

element.

Listing 4.8: [compact.xsl]: Stylesheet used to transform the cellar book instance document(listing 2.2) into listing 4.7

<!DOCTYPE stylesheet [<!ENTITY space "<xsl:text > </xsl:text >">

78

<!ENTITY cr "<xsl:text ></xsl:text >">

5 ]><xsl:stylesheet xmlns:xsl="http://www.w3.org /1999/ XSL/Transform"

version="1.0"><xsl:strip -space elements="*"/>

10 <xsl:template match="/">&cr;

<xsl:apply -templates ><xsl:with -param name="indent" select="0"/>

</xsl:apply -templates >15 &cr;

</xsl:template >

<xsl:template name="spaces"><xsl:param name="nb"/><xsl:if test="$nb >0">

20 &space;<xsl:call -template name="spaces">

<xsl:with -param name="nb" select="$nb - 1"/></xsl:call -template >

</xsl:if >25 </xsl:template >

<xsl:template match="@*"><xsl:param name="indent"/><xsl:if test="position ()>1">

&cr;30 <xsl:call -template name="spaces"><xsl:with -param

name="nb" select="$indent"/></xsl:call -template >

</xsl:if ><xsl:text >@</xsl:text >

35 <xsl:value -of select="local -name()"/><xsl:text >[</xsl:text ><xsl:value -of

select="."/><xsl:text >]</xsl:text ></xsl:template >

40 <xsl:template match="text()"><xsl:param name="indent"/><xsl:if test="count (../@*)>0 or position ()>1">

&cr;<xsl:call -template name="spaces"><xsl:with -param

name="nb" select="$indent"/>45 </xsl:call -template >

</xsl:if ><xsl:value -of select="normalize -space (.)"/>

79

</xsl:template >

50 <xsl:template match="*"><xsl:param name="indent"/><xsl:if test="count (../@*)>0 or position ()>1">

&cr;<xsl:call -template name="spaces"><xsl:with -param

name="nb" select="$indent"/>55 </xsl:call -template >

</xsl:if ><xsl:value -of select="local -name()"/><xsl:text >[</xsl:text ><xsl:apply -templates select="@*|*| text()">

60 <xsl:with -param name="indent"select="$indent + string -length(local -name ())+1"/>

</xsl:apply -templates ><xsl:text >]</xsl:text >

</xsl:template >65

</xsl:stylesheet >

4.5 Transformation into PDF with XSL-FO

The previous sections have illustrated the principles of XSLT templates for producingHTML and character output. XSL also defines a more involved and powerful format-ting tool: XSL-FO standing for eXtensible Stylesheet Language-Formatting Object. It issimilar in principle to the Cascading Style Sheets (CSS) defined for HTML to separatethe information computation process from the rendering on a specific device (screen, paper,PDA, speech). As shown in the middle of the flow diagram of figure 1.3, Transformations canproduce Formatting Objects, XML elements in the http://www.w3.org/1999/XSL/Format

namespace with prefix fo, which are then rendered on different devices, particularly in PDF.XSL-FO is a declarative language to describe the content of pages in terms of nested

areas, laid out under certain constraints. The main purpose of this approach is the productionof printed pages: it allows the definition of the general shape of pages (margins, headers,page numbers, etc.) and the relative placement and nesting of the areas containing theinformation of the document. Great care has been given to provide an uniform processingof multiple languages and writing systems (not necessarily going from left to right and topto bottom) in the same page. It is also possible to create HTML-like tables and generalizedlists as pairs of items with aligned labels and bodies. We will use these lists to illustrate thenesting of XML elements in our PDF output as shown at the bottom of figure 1.5 and infigure 4.4.

Figure 4.4 shows the three pages generated by the stylesheet of listing 4.9 on listing 2.2.

80

http://www.w3.org/1999/XSL/Format

cellar-book @noNamespaceSchemaLocationCellarBook.xsdwine-catalog @schemaLocationhttp://www.iro.umontreal.ca/lapalme/wine-catalog

WineCatalog.xsd@base WineCatalogXSD.xmlwine @name Domaine de l'Ile Margaux

@appellation Bordeaux supérieur@classification a.c@code C00043125@format 750mlproperties color red

alcoholic-strength12.5nature still

origin country Franceregion Bordeauxproducer SCEA Domaine

de L'Ile Margaux(B.P. 5)

comment Ready for drinking nowfood-pairing _ Accompanies

emph Bordelaiseribsteak

_ ,bold pork with prunes_ or magret de

canard.price 22.80year 2002

wine @name Riesling Hugel@appellation Alsace@classification a.c.@code C00042101@format 750mlproperties color white

alcoholic-strength12nature still

origin country Franceregion Alsace and Eastproducer Hugel & Fils

price 17.95year 2002

wine @name Château Montguéret@appellation Anjou@classification a.c.@code C10263859@format 750mlproperties color rosé

alcoholic-strength11nature still

cellar-book Page 1

origin country Franceregion Loire Valleyproducer SCEA Château

de Montguéretcomment Made with Grolleau (100%).

Ready to drink now. Serve at8º-10ºC.

tasting-note _ Tender pink incolor, this wineshows

emph light raspberry_ highlights.


wine @name Mumm Cordon Rouge@appellation Champagne@classification a.c.@code C00312363@format 375mlproperties color white

alcoholic-strength12nature Champagne

origin country Franceregion Champagneproducer G.H. Martel & Co

comment Ready for drinking now. Serve itfresh but not too cold.

tasting-note This champagne has a light fruityaroma. It is delicate and hasexquisite bubbles.


wine @name Prado Rey Roble@appellation Ribera-del-duero@classification d.o.@code C00929026@format magnumproperties color red

alcoholic-strength12.5nature still

origin country Spainregion Old Castilleproducer Real Sitio de

Ventosilla SAprice 35.25year 2002

owner name first Jude

cellar-book | wine-catalog | wine | origin Page 2

family Raisinstreet 1234 rue des Chateauxcity St-Georgeprovince ONpostal-code M7W 7S0

location street 4587 des Futaillescity Vallée des crusprovince QCpostal-code H3C 4J8

cellar wine @code C00043125purchaseDate 2005-06-20quantity 2comment bold Guy Lapalme,

Montréal_ : should reorder

soonwine @code C00312363

purchaseDate 2004-11-19quantity 5rating @stars 3comment Bottle too small...

wine @code C10263859purchaseDate 2005-06-19quantity 6comment Really great

wine @code C00929026purchaseDate 2003-10-15quantity 1comment _ for

bold big_ parties

cellar-book | location Page 3

Figure 4.4: [compactFO.pdf]: PDF output of compaction by Formating Objects (XSL-FO) of listing 2.2. The output spans over three pages, note the page headers that show thecontext and the page number.

81

Figure 4.5: [compactFO.xml] Outline of the XSL-FO file produced by running the XSLstylesheet of listing 4.9 on the cellar book example of listing 2.2. This picture was producedwith the fold/unfold feature in <oXygen/> XML editor.

82

An XML element is displayed with its name in green4 aligned with its contents. In somecases, the label overlaps the value but we could not find a reliable way of adjusting theposition of a list item body depending on the length of its list item label. We simplify byleaving a distance of 30 mm between the start of the label, given by the element name, and thestart of the indented block describing the element value. This limitation is understandablebecause the relative positions of the label and body must be determined when the fo elementsare generated by the transformation process but the length of a label is determined when itis rendered on the PDF page.

4.5.1 XSL-FO Input to the Renderer

As the output of a XSL-FO is processed by another program to get the final PDF output, itis a bit difficult to grasp the processing behind the XSL-FO path. So we will go backwardsby first looking at the XSL-FO given as input to the renderer. In principle we could writethis XML file by hand but looking at figure 4.5 which shows only the outline of its 1740lines for three PDF pages, we appreciate the fact that it can be produced by a machine...The figure is the output produced by the application of XSL templates of listing 4.9 on thecellar book instance document (listing 2.2).

A XSL-FO file is an XML file starting with a fo:root element with two children ele-ments:

fo:layout-master-set that describes the shape of the different types of pages that occurin the document and the sequence in which they appear. In our case, we have a simpledocument so

fo:simple-page-master (lines 4-7) defines a single model for all page with 1 cmmargins. Within it, the header, defined with fo:region-before will take the top1 cm and the real content of the page will start another 1 cm lower. Here we defineonly a single type of page master but for more complex documents it would bepossible to have different page masters for title pages, for first pages, for even orodd pages, etc.

fo:page-sequence-master (lines 8-10) declares that the document is an infinite rep-etition of the above page.

fo:page-sequence (lines 12-1739) defines the content of the document that will be renderedaccording to the page layout we have defined above. The content of the page is givenin the fo:flow element (lines 28-1738) which starts in the current page and continuesin the region-body of the next pages. In fact the only visible text from figure 4.5 thatappears in figure 4.4 is the first word (cellar-book) produced on line 32. Contentthat appears at the same place within each page, such as headers and footers is calledstatic-content (line 12).

4the green and blue appear as shades of gray on a black and white printed page ...

83

4.5.2 From the Instance Document to the XSL-FO file

We will now look at how to build a stylesheet to produce the XSL-FO file described inthe previous section from the XML instance document. Similarly to what we have done toproduce HTML output (section 4.3.1), all tree structures defined with elements with thefo namespace prefix will appear verbatim in the output. XSL-FO elements can be also becreated by xsl:element templates but we will not need to do it here.

We must start with a fo:root element (line 6 of listing 4.9) with two children:

• fo:layout-master-set (line 7) defines the shape of a page with a fo:simple-page-master

element (line 8) that defines its margins relative to the page; within it, we define theregion-body in which the content will appear; we also define areas for the header (calledfo:region-before) and the footer (not used in our example). Then the sequence ofpage masters is given (line 14): here a simple repetition of our single page master.

• fo:page-sequence (line 18) refers to a page-sequence-master in which the content ofthe page will be given in the fo:flow element (line 37) which will start in the currentpage and continue on the next pages within their region-body. The static-content

(line 19) is the content of a header divided in two parts: a left (start) aligned string(line 24) computed during page execution that shows the context of the elements atthe start of the page; a right (end) aligned part (line 31) with the page number.

The overall tree is defined once for the root element of the document (line 5) and the traversalof the instance document starts on line 39 within the fo:list-block element in the top-levelfo:flow element which corresponds to line 28 of figure 4.5.

The nested boxes will be at a distance of 30 mm of each other (line 38). We use thesame recursive tree traversal algorithm as the one for the HTML compaction (listing 4.2)and text compaction (listing 4.8).

Formatting objects create lists as aligned blocks whose relative size and position must sat-isfy presentation constraints. As can be seen on the outline of in figure 4.5, a fo:list-block

(created on line 38 of listing 4.9) is composed of fo:list-items one on top of each other.A fo:list-item (line 47) is composed of a fo:list-item-label aligned horizontally witha fo:list-item-body (line 53) even if they are not of the same height. In our example,the top and left borders of blocks are colored to show the nesting of blocks which corre-sponds to the nesting of elements in the XML file. The starting horizontal position of eachlist-item-label is computed from the value of the enclosing block but its end positionmust be specified; here it is computed by a predefined function label-end) which takes intoaccount the value specified for the distance between blocks (line 38). The starting positionof the list item body must also be specified, most often again with a predefined functionbody-start (line 53).

The processing for elements starts by creating a new fo:list-item (line 47) with thename in bold green as fo:list-item-label. The fo:list-item-body (line 53) processing isdivided in two parts depending on whether there are children elements or attributes:

84

• when there is no child element (line 55) but possibly a text node, we display the contentof the text node in a fo:block.

• when there are children nodes (line 62), the fo:list-item-body is a bordered fo:block

whose content is a recursively built (line 73) fo:list-block. As we will explain later,the current context is also computed and saved in the fo:marker (line 67).

Attributes (line 82) are displayed using the labeledValue named template: their name isin blue italicized text (as the fo:list-item-label) and their text content as fo:list-item-body.Note here the use of a tree fragment value as actual parameters (lines 85 and 89) to thelabeledValue named template. fo:inline elements create an ordinary text. The area for dis-play is created with a fo:block element in the labeledValue named template (line 109). Thistemplate creates a fo:list-item comprising a fo:list-item-label and a list-item-body. Inboth cases (lines 115 and 120), we insert the tree given as parameter by means of xsl:copy-ofand not the usual xsl:value-of which would return only the content of the tree given asparameter and not the tree value itself.

Text nodes (line 97) are output as a list item with an empty label5, again using thelabeledValue named template, and the text content as body for the list item. Text nodescomprising only spaces and newlines are ignored.

As the output of this program is longer than a single page (see figure 4.4), then it flowson the following one. But it is interesting to show the current context of the start of thepage in its header. This is done by creating a fo:marker (line 67) at each new nested block.The current value of marker at the start of the page is then put in the left part of the header(line 25).6

To compute a string describing the current context of a block, we use the context namedtemplate (line 126) which is called (line 67) with the current node as parameter and whichgoes up the document tree (line 131) until it is called at the first element of the documentexcluding the root node; this is why the recursive stopping criteria checks if a node has agrand-father (line 129). At each recursion level, the value of the template computes a stringwith all its ancestor elements separated with a vertical bar (line 136).

Listing 4.9: [compactFO.xsl]: Stylesheet used to transform the information of the cellarbook (listing 2.2) into the colored nested blocks representation of figure 4.4.

<?xml version="1.0" encoding="UTF -8"?><xsl:stylesheet version="1.0" xmlns:fo="http://www.w3.org /1999/ XSL/Format"

xmlns:xsl="http: //www.w3.org /1999/ XSL/Transform">

5 <xsl:template match="/">

5For implementation reasons, a label cannot really be empty, so we output a white underline characteras a value that will not be visible on white paper.

6An observant reader might suggest that a better context would be the one exactly at the top insteadof the current one which displays the first new element within a page. Looking at the documentation, I amquite sure that this is what the current code should do but it seems that the implementation of FOP that Iused does not implement this correctly.

85

<fo:root xmlns:fo="http://www.w3.org /1999/ XSL/Format"><fo:layout -master -set>

<fo:simple -page -master master -name="a-page"margin -bottom="1cm" margin -left="1cm"

10 margin -right="1cm" margin -top="1cm"><fo:region -before extent="1cm"/><fo:region -body margin -top="1cm" margin -bottom="1cm"/>

</fo:simple -page -master ><fo:page -sequence -master master -name="page -layout">

15 <fo:repeatable -page -master -reference master -reference="a-page"/></fo:page -sequence -master >

</fo:layout -master -set><fo:page -sequence master -reference="page -layout">

<fo:static -content flow -name="xsl -region -before">20 <fo:list -block provisional -distance -between -starts="12cm"

provisional -label -separation="0cm"><fo:list -item>

<fo:list -item -label><fo:block text -align="start">

25 <fo:retrieve -marker retrieve -class -name="context"retrieve -position="first -including -carryover"retrieve -boundary="page -sequence"/>

</fo:block ></fo:list -item -label>

30 <fo:list -item -body><fo:block text -align="end"> Page <fo:page -number/></fo:block >

</fo:list -item -body></fo:list -item>

35 </fo:list -block></fo:static -content ><fo:flow flow -name="xsl -region -body">

<fo:list -block provisional -distance -between -starts="30mm"><xsl:apply -templates/>

40 </fo:list -block></fo:flow >

</fo:page -sequence ></fo:root >

</xsl:template >45

<xsl:template match="*"><fo:list -item>

<fo:list -item -label end -indent="label -end()"><fo:block font -weight="bold" color="green">

50 <xsl:value -of select="local -name()"/>

86

</fo:block ></fo:list -item -label><fo:list -item -body start -indent="body -start()">

<xsl:choose >55 <xsl:when test="count (*)=0 and count(@*)=0">

<fo:block ><fo:inline font -style="normal" color="black">

<xsl:value -of select="."/></fo:inline >

60 </fo:block ></xsl:when ><xsl:otherwise >

<fo:block border -color="black"border -left -style="solid" border -left -width="thin"

65 border -top -style="solid" border -top -width="thin"padding -left="2mm" space -after="1mm"><fo:marker marker -class -name="context">

<xsl:call -template name="context"><xsl:with -param name="current" select="."/>

70 </xsl:call -template ></fo:marker ><fo:list -block>

<xsl:apply -templates select="@*|*| text()"/></fo:list -block>

75 </fo:block ></xsl:otherwise >

</xsl:choose ></fo:list -item -body>

</fo:list -item>80 </xsl:template >

<xsl:template match="@*"><xsl:call -template name="labeledValue">

<xsl:with -param name="label">85 <fo:inline font -style="italic" color="blue">

@<xsl:value -of select="local -name()"/></fo:inline >

</xsl:with -param><xsl:with -param name="value">

90 <fo:inline font -style="normal" color="black"><xsl:value -of select="."/>

</fo:inline ></xsl:with -param>

</xsl:call -template >95 </xsl:template >

87

<xsl:template match="text()"><xsl:variable name="content" select="normalize -space (.)"/><xsl:if test="string -length ($ content)>0">

100 <xsl:call -template name="labeledValue"><xsl:with -param name="label">

<fo:inline color="white">_</fo:inline ></xsl:with -param><xsl:with -param name="value" select="$content"/>

105 </xsl:call -template ></xsl:if >

</xsl:template >

<xsl:template name="labeledValue">110 <xsl:param name="label"/>

<xsl:param name="value"/><fo:list -item>

<fo:list -item -label end -indent="label -end()"><fo:block >

115 <xsl:copy -of select="$label"/></fo:block >

</fo:list -item -label><fo:list -item -body start -indent="body -start()">

<fo:block >120 <xsl:copy -of select="$value"/>

</fo:block ></fo:list -item -body>

</fo:list -item></xsl:template >

125

<xsl:template name="context"><xsl:param name="current"/><xsl:choose >

<xsl:when test="$current /../..">130 <xsl:variable name="ancestors">

<xsl:call -template name="context"><xsl:with -param name="current" select="$current /.."/>

</xsl:call -template ></xsl:variable >

135 <xsl:value -ofselect="concat ($ancestors ,’ | ’,local -name($ current ))"/>

</xsl:when ><xsl:otherwise >

<xsl:value -of select="local -name($ current)"/>140 </xsl:otherwise >

88

</xsl:choose ></xsl:template >

</xsl:stylesheet >

This section has shown how to produce publication quality output from an XML file. Wehave used nested list blocks to align the name of the element with its content, but nestedtables could also have been used. This would allow the horizontal centering of the elementname with respect to its content. The principles remain the same but the code would be abit longer because tables have more options. The table would still not be able to dynamicallyadapt to the length of the information contained in it because widths of columns have tobe given or computed relative the overall page. Should a reader find a way to solve thisproblem, we would be very interested to know about it.

4.6 Associating an Instance File to a Stylesheet

Transforming an instance file with a stylesheet is most often done by specifying externallythe transformation stylesheet file to apply to a given instance file. This can be achieved usingan XML editor in which we can associate an XML file with a stylesheet (and vice versa);some editors also allow the definition of many transformation scenarios. The transformationcan also be done in batch mode by specifying a stylesheet to a transformation engine. Forexample, in the companion website of this report, we show a simple Java program which canbe used as a Unix filter to standard input. We can get on the standard output the compacttext output of the wine catalog with the following call:

1 java Transform compact.xsl < WineCatalog.xml

This program also allows to specify run-time parameters to the stylesheet (declared withtop-level xsl:param elements as shown in section 4.3.2). To get an HTML file with the tableof the white wines of the catalog (similar to listing 4.3), we can use the following:

java Transform WineCatalog.xsl color white <WineCatalog.xml >whites.html

Because stylesheets can also be interpreted by web browsers, it is also possible to link aninstance file directly to a stylesheet by means of the xml-stylesheet processing instruction.For example if one adds the following to the start of listing 2.3

<?xml -stylesheet type="text/xsl" href="compactHTML.xsl"?>

then, upon loading the XML file WineCatalog.xml into a web browser, the catalog will bedisplayed after the transformation defined by listing 4.2. But one must be careful with suchautomatic transformations, because not all browsers implement the all XSL transformations.Internet Explorer on Windows seems to be one of the most reliable one. Most others (Firefox,Mozilla, Safari on MacOS X) seem to do something reasonable but unfortunately, not alwaysconsistently across platforms.

89

4.7 Additional Information on XSL

The official information on XSL [16] is comprehensive and detailed (400 pages). The first50 pages describe the basic principles; the remaining pages describe all the possible optionsfor all parameters. One should also consult the XPath language description [19] becauseXSL description takes it for granted.

http://www.w3.org/Style/XSL/ is the best starting point to get information on XSLTwith links to tools and tutorials.

http://www.mulberrytech.com/quickref/XSLTquickref.pdf is a nice XSLT and XPathQuick Reference (US legal size)

http://www.dpawson.co.uk/xsl/ is another very useful site with a lot of practical infor-mations about XSL

http://xml.apache.org/xalan-j Xalan is a public domain stylesheet processor that worksin conjunction with Xerces [10]

http://www.ibiblio.org/xml/books/bible2/chapters/ch18.html is very thorough tu-torial on Formatting Objects

http://xml.apache.org/fop/ FOP is a public domain Formatting object renderer writtenin Java; it still has a few limitations with respect to the official standards.

http://www.renderx.com/ RenderX is selling XEP , written in Java, a commercially avail-able XSL FO rendering engine implementing the official specification. An academiclicense agreement is available. They also publish a tutorial which is a very good startingpoint for learning XSL.

90

http://www.w3.org/Style/XSL/

http://www.mulberrytech.com/quickref/XSLTquickref.pdf

http://www.dpawson.co.uk/xsl/

http://xml.apache.org/xalan-j

http://www.ibiblio.org/xml/books/bible2/chapters/ch18.html

http://xml.apache.org/fop/

http://www.renderx.com/

http://www.renderx.com/tutorial.html

Chapter 5

Document Processing byProgramming

We have described in the previous sections how to process XML files with XML declarativetools but it is also possible to use standard programming languages to process XML files thusallowing a much finer control on the output. Processing XML files is often done in Java butPerl and Python have also comprehensive packages to process XML files. Prolog, the logicprogramming language and Haskell, a functional language, also have comprehensive XMLlibraries. But before explaining how to program our compacting example, it is important tounderstand two generally accepted models of processing: Document Object Model (DOM)and Simple Application programming interface for XML (SAX).

The DOM programming model is similar to the one we have used implicitely in previouschapters: by reading an XML file, a parser builds an internal tree structure file which it thentraverses and modifies. Using a programming language without any restriction instead of arule language such as XSLT, it is easy to loose the tree structures in those manipulationsso it is important to have a rigorous programming discipline.

Building the entire tree structure in memory before starting to process it can be pro-hibitive in the case of big XML files, so SAX, an alternative programming model, has beendefined: as elements are parsed, user defined call-back procedures are invoked to do the pro-cessing. This requires much less memory because only a part of the document needs to bekept in memory at all time. However, the program is more limited in the kind of processingit can do efficiently or simply. This limitation is similar to the one observed between algo-rithms reading random-access files and those reading sequential-access file. As we will see,our compact pretty-print application shown in section 4.4 is relatively simple to implementwith both DOM and SAX programming models.

5.1 Document Object Model (DOM)

The document object model is standardized by the W3C consortium but its Java bindings candepend on the implementation. As Java 1.4 already integrates XML processing packages,

91

we use them in our examples because no other special library is needed this way. Anotherpopular XML package is the Xerces Java API. The Java program given in listing 5.1 isa command line application that accepts an XML file as parameter and outputs the samecompact text representation (listing 4.7) that we obtained with the compact XSLT stylesheet(listing 4.8). The first lines of listing 5.1 import the necessary packages to process the XMLfiles. On line 26 the main method creates a DocumentBuilderFactory object (line 34)from which we will obtain a DOM parser after having configured the necessary options.By default, parsing only checks for well-formedness, so in order for the parsing to validateagainst a DTD a flag must be set (line 36) and to validate against a XML Schema anotherone must be set (line 38). As explained in sections 3.1.1 and 3.4, the XML instance makesreference to their corresponding DTD or XML Schema.

Creating a parser to build a new DOM document is done (line 41) by using a factorymethod which returns a DocumentBuilder object to which an error handler (line 5 in list-ing 5.2) is assigned to get a notification of possible error messages. If the file is valid (i.e.no parseException has been raised),1 a Document object can be obtained and the compact

method (line 63) is called (line 44) on the root element.The processing depends on the type of the element that is obtained on line 65. If it is

an element node (line 69), we first print the node name followed by an open bracket. Online 71, attributes are printed with their name preceded by @, followed by their value insquare brackets. A new line is started if there are more than one attribute. The processingof children first starts by removing (line 77) empty text nodes that appear in the originalfile. We also remove other nodes that are not text or element nodes. This could be donewhile printing but we want to show here how to filter nodes correctly, making sure that thecorrect links are preserved in the final tree.

The processing of the children elements, starting on line 93, is a simple traversal algorithmwith a recursive call to compact, followed by the printing of a closing bracket.

Listing 5.1: [DOMCompact.java]: Text compaction of the cellar book (listing 2.2) with Javaprocessing using the DOM model

import org.w3c.dom.Attr;import org.w3c.dom.Document;import org.w3c.dom.NamedNodeMap;import org.w3c.dom.Node;

5

import javax.xml.parsers.DocumentBuilder;import javax.xml.parsers.DocumentBuilderFactory;import javax.xml.parsers.FactoryConfigurationError;import javax.xml.parsers.ParserConfigurationException;

10

import org.xml.sax.InputSource;import org.xml.sax.ErrorHandler;

1Even in the DOM model, SAXExceptions and SAXParseExceptions can be raised by the documentbuilder.

92

import org.xml.sax.SAXException;import org.xml.sax.SAXParseException;

15

import java.io.IOException;

import javax.swing.JTree;import javax.swing.JFrame;

20 import javax.swing.JScrollPane;import javax.swing.tree.DefaultMutableTreeNode;import javax.swing.tree.DefaultTreeModel;

public class DOMCompact25

public static void main(String argv []) // is there anything to do?

if (argv.length != 1) System.out.println("Usage: java DOMCompact file");

30 System.exit (1);// parse file

try DocumentBuilderFactory factory =

35 DocumentBuilderFactory.newInstance ();factory.setValidating(true);factory.setNamespaceAware(true);factory.setAttribute(

"http :// java.sun.com/xml/jaxp/properties/schemaLanguage",40 "http ://www.w3.org /2001/ XMLSchema");

DocumentBuilder builder = factory.newDocumentBuilder ();builder.setErrorHandler(new CompactErrorHandler ());Document doc = builder.parse(argv [0]);compact(doc.getDocumentElement (),"");

45 System.out.println ();new TreeViewer(

new JTree(new DefaultTreeModel(TreeViewer.jTreeBuild(doc.getDocumentElement ()))

)). show ();50 catch (SAXParseException e)

System.out.println(argv [0]+"is not well -formed");System.out.println(e.getMessage ()+"at line "+e.getLineNumber ()+

", column "+e.getColumnNumber ()); catch (SAXException e)

55 System.out.println(e.getMessage ()); catch (ParserConfigurationException e)

System.out.println("Parser configuration error");

93

catch (IOException e) System.out.println("IO Error on"+argv [0]);

60

public static void compact(Node node ,String indent) if (node == null)return;

65 short type = node.getNodeType ();// System.out.println (" compact :"+ node +":"+ indent +":"+ type);

switch (type) case Node.ELEMENT_NODE:

System.out.print(node.getNodeName ()+’[’);70 indent += blanks(node.getNodeName (). length ()+1);

NamedNodeMap attrs = node.getAttributes ();for (int i = 0; i < attrs.getLength (); i++)

if(i>0) System.out.print(’\n’+indent );System.out.print(’@’+attrs.item(i). getNodeName ()+’[’

75 +attrs.item(i). getNodeValue ()+’]’);Node child = node.getFirstChild ();// remove empty text nodes (ie nothing else than spaces

// and return)

80 // and nodes that are not text or element ones

while(child !=null)// save the sibling of the node that will

// perhaps be removed and set to null

Node c = child.getNextSibling ();85 if(( child.getNodeType ()== Node.TEXT_NODE &&

child.getNodeValue (). trim (). length ()==0) ||(( child.getNodeType ()!= Node.TEXT_NODE )&&(child.getNodeType ()!= Node.ELEMENT_NODE )))node.removeChild(child);

90 child=c;// process children

child = node.getFirstChild ();while(child != null)

95 if(attrs.getLength ()>0|| child!=node.getFirstChild ())System.out.print(’\n’+indent );

compact(child ,indent );child = child.getNextSibling ();

100 System.out.print(’]’);

break;

94

// process a text node

case Node.TEXT_NODE: 105 System.out.print(node.getNodeValue (). trim ());

break;

110

// production of string of spaces with a lazy StringBuffer

private static StringBuffer blanks = new StringBuffer ();private static String blanks(int n)

115 for(int i=blanks.length ();i<n;i++)blanks.append(’ ’);

return blanks.substring(0,n);

120

Listing 5.2: [CompactErrorHandler.java]: Error handler of the DOM parsing of listing 5.1

import org.xml.sax.ErrorHandler;import org.xml.sax.SAXException;

3 import org.xml.sax.SAXParseException;

public class CompactErrorHandler implements ErrorHandlerprivate void message(String mess ,SAXParseException e)throws SAXException

8 System.out.println("\n"+mess+"\n Line:"+e.getLineNumber ()+"\n URI:" +e.getSystemId ()+"\n Message:"+e.getMessage ());

13

public void fatalError(SAXParseException e) throws SAXExceptionmessage("Fatal error",e);

18 public void error(SAXParseException e) throws SAXExceptionmessage("Error",e);

public void warning(SAXParseException e) throws SAXException23 message("Warning",e);

95

5.2 Simple API for XML (SAX)

The SAX model of processing is stream-oriented: as the system parses the file, it call appro-priate event-handler methods. This is quite memory efficient, but means that global variableshave to be maintained across calls to allow communication between handler methods.

For our example, this is achieved quite simply once we notice that a new line shouldbe output when opening a new bracket only if the last thing printed was a closing bracket.Many closing brackets can be put on the same line though. So we keep a shared booleanvariable for this state and a shared integer giving the current number of blank spaces forindentation.

Creating a SAX parser is done via a factory in much the same way as we have shownin the previous section for a DOM parser. Listing 5.3 describes the main procedure whichcreates a factory (line 29) and sets flags for validation. The parser is obtained on line 32and a property is set to indicate that we want validation to be done with a XML Schema.Parsing is then started line 36 by passing a reference to a Handler object which receivescall-backs during the parsing process.

Listing 5.3: [SAXCompact.java]: Text compaction of the cellar book (listing 2.2) with Javaprocessing using the SAX model.

import org.xml.sax.SAXException;import org.xml.sax.SAXParseException;import org.xml.sax.helpers.XMLReaderFactory;

5 import org.xml.sax.helpers.DefaultHandler;import javax.xml.parsers.SAXParserFactory;import javax.xml.parsers.ParserConfigurationException;import javax.xml.parsers.SAXParser;

10 import java.io.IOException;

import javax.swing.JTree;import javax.swing.JFrame;import javax.swing.JScrollPane;

15

public class SAXCompact

private static JTree jtree = new JTree ();

20 public static void main(String argv []) if (argv.length != 1)

96

System.out.println("Usage: java SAXCompact file");return;

25 // XMLParser creation

SAXParserFactory factory;SAXParser saxParser;try

factory = SAXParserFactory.newInstance ();30 factory.setNamespaceAware(true);

factory.setValidating(true);saxParser = factory.newSAXParser ();saxParser.setProperty("http :// java.sun.com/xml/jaxp/properties/schemaLanguage",

"http ://www.w3.org /2001/ XMLSchema");35 // parse file and print compact form

saxParser.parse(argv[0], new CompactHandler ());System.out.println ();

// parse file and build a tree form

40 saxParser.parse(argv[0],new JTreeHandler(jtree ));// display the built tree

new TreeViewer(jtree).show (); catch (ParserConfigurationException e)

System.out.println("Bad parser configuration");45 catch (SAXParseException e)

System.out.println(argv [0]+" is not well -formed");System.out.println(e.getMessage ()+

" at line "+e.getLineNumber ()+", column "+e.getColumnNumber ());

50 catch (SAXException e)System.out.println(e.getMessage ());

catch (IOException e) System.out.println("IO Error on "+argv [0]);

55 // main(String [])

// class SAXCompact

Listing 5.4 shows the structure of a SAX event handler a subclass of DefaultHandlerclass which defines empty handlers for all type of events including errors. In our case, onlystartElement (line 20), endElement (line 38) and characters (line 45) are called whenencountering text nodes.

For a start-tag (line 20), we first check whether the current line should be terminated.Then we print the name of the current element and an opening bracket. Last, we updatethe shared indentation value and output the attributes if there are any.

For an end-tag (line 38), we only output a closing bracket and decrease the indentationvalue by the length of the name. Because the file has been validated during the parsing

97

process, we can be sure that the localName variable is the same as the one used to increasethe indentation in the corresponding start-tag method.

For text nodes (line 45), we only output the characters after having removed leadingand trailing whitespace. Note that the characters method is not called with a String butwith an array of characters as well as a start position and the number of characters to usefrom the character array. This can sometimes avoid the allocation of a new String for eachelement.

When an error is encountered during the parsing process, one of the methods defined onlines starting at 67 is called. When this happens, we call the message method (line 58) whichgives some useful information about the error.

Listing 5.4: [CompactHandler.java]: SAX Handler for text compacting an XML file suchas the one described in listing 2.2.

import org.xml.sax.SAXException;import org.xml.sax.helpers.DefaultHandler;import org.xml.sax.Attributes;import org.xml.sax.SAXParseException;

5

public class CompactHandler extends DefaultHandler

// production of string of spaces with a lazy StringBuffer

private static StringBuffer blanks = new StringBuffer ();10 private String blanks(int n)

for(int i=blanks.length ();i<n;i++)blanks.append(’ ’);

return blanks.substring(0,n);

15

private boolean closed = false; // closed mode?

protected int indent; // current indentation value

20 public void startElement(String uri , String localName ,String raw , Attributes attrs)

throws SAXException if(closed )

System.out.print(’\n’+blanks(indent ));25 closed = false;

indent=indent +1+ localName.length ();System.out.print(localName+’[’);// deal with attributes

30 for (int i = 0; i < attrs.getLength (); i++) if(i>0) System.out.print(’\n’+blanks(indent ));System.out.print(’@’+attrs.getLocalName(i)+’[’

98

+attrs.getValue(i)+’]’);closed=true;

35 // startElement(String ,String ,String ,Attributes)

public void endElement(String uri , String localName , String raw)throws SAXException

40 System.out.print(’]’);closed = true;indent=indent -1- localName.length ();

// endElement(String ,String ,String)

45 public void characters(char[] ch , int start , int length)throws SAXException if(closed )

System.out.print(’\n’+blanks(indent ));closed = false;

50 String s = new String(ch ,start ,length ).trim ();System.out.print(s);if(s.length ()>0)

closed=true;55 // characters(char[],int ,int)

// error handling ...

private void message(String mess ,SAXParseException e)throws SAXException

60 System.out.println("\n"+mess+"\n Line:"+e.getLineNumber ()+"\n URI:" +e.getSystemId ()+"\n Message:"+e.getMessage ());

//throw new SAXException (" Raised :"+ mess);

65

public void fatalError(SAXParseException e) throws SAXExceptionmessage("Fatal error",e);

70

public void error(SAXParseException e) throws SAXExceptionmessage("Error",e);

75 public void warning(SAXParseException e) throws SAXExceptionmessage("Warning",e);

99

5.3 Showing an Interactive Tree View

Figure 5.1: JTree display (on Mac OS X) of listing 2.2

The Java API already provides a graphical view of trees with the JTree class. It displaysthe nodes in a window and they can be expanded and collapsed by clicking on their handles.This kind of display can also be obtained by XSL-FO (see section 6.9 of [8]) but few systemscurrently implement the full specification which would allow this to happen. Internet Ex-plorer (top right of figure 1.2) and Firefox use a similar scheme when displaying XML files.This form of interaction is used on most operating systems to display contents of directories.For example, the XML file of listing 2.2 can be displayed in a window like the one shownin figure 5.1. Nodes are shown there with directory icons and can be expanded or collapsed

100

by clicking on the triangle to the left of the icon. A node showing a collapsed subtree has atriangle that points downward when it is expanded. A different display can be obtained bychanging the look and feel in the Java API but the principle stays the same.

5.3.1 Building a JTree with DOM

Creating such a view from a DOM structure is only a matter of traversing the structure tocreate nodes that will be part of the JTree display. Its nodes are instances of the predefinedDefaultMutableTreeNode class. To obtain the display in figure 5.1, Listing 5.1 (line 46)creates a new TreeViewer instance that displays a JTree using a tree model built fromthe DOM tree constructed using the jTreeBuild method. Listing 5.5 defines the classTreeViewer (line 13) that creates a window to display the JTree instance. It is assigned aninitial constant position and size and made scrollable; the application should terminate whenthe window is closed.

jTreeBuild (line 21) is a static method (so it is called asTreeViewer.jTreeBuild(.))follows the same algorithm as compact (line 63 in listing 5.1) by recursively processing el-ement or text DOM nodes. In the case of an element node (line 25), it creates a newDefaultMutableTreeNode for the element and adds attributes as its first child; it then pro-cesses each child (line 39) by recursively building its subtree (line 37) which is added asa child of the current node. A non-empty text node is simply a DefaultMutableTreeNode

having the text as label.

Listing 5.5: [TreeViewer.java]: JTree building with DOM Processing of an XML file

import org.w3c.dom.Attr;import org.w3c.dom.Document;import org.w3c.dom.NamedNodeMap;import org.w3c.dom.Node;

5 import org.w3c.dom.NodeList;

import javax.swing.JTree;import javax.swing.JFrame;import javax.swing.JScrollPane;

10 import javax.swing.tree.DefaultMutableTreeNode;

public class TreeViewer extends JFrameTreeViewer(JTree jtree)

15 super("Tree viewer");setBounds (100 ,100 ,600 ,450);getContentPane (). add(new JScrollPane(jtree ));setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE );

20

public static DefaultMutableTreeNode jTreeBuild(Node node)

101

if (node == null)return null;

switch (node.getNodeType ()) 25 case Node.ELEMENT_NODE:

DefaultMutableTreeNode treeNode =new DefaultMutableTreeNode(node.getNodeName ());

NamedNodeMap attrs = node.getAttributes ();for (int i = 0; i < attrs.getLength (); i++)

30 treeNode.add(new DefaultMutableTreeNode(

’@’+attrs.item(i). getNodeName ()+’[’+attrs.item(i). getNodeValue ()+’]’));

// process children

35 Node child = node.getFirstChild ();while(child != null)

DefaultMutableTreeNode childTree = jTreeBuild(child );if(childTree !=null)

treeNode.add(childTree );40 child = child.getNextSibling ();

return treeNode;

case Node.TEXT_NODE:

45 String text = node.getNodeValue (). trim ();return text.length ()==0 ? null

: new DefaultMutableTreeNode(text);default : return null; // ignore other types of nodes

50

5.3.2 Building a JTree with SAX

Building a JTree using SAX processing is quite simple by defining a special-purpose han-dler whose methods will be called when XML elements are encountered during the parsingprocess. This is similar to the process described in section 5.2 where we only needed to keepthe current indentation value.

Listing 5.6: [JTreeHandler.java]: JTree building with SAX Processing of an XML file

import org.xml.sax.SAXException;import org.xml.sax.Attributes;import org.xml.sax.helpers.DefaultHandler;

102

5 import javax.swing.JTree;import javax.swing.tree.DefaultMutableTreeNode;import javax.swing.tree.DefaultTreeModel;

public class JTreeHandler extends DefaultHandler10

private DefaultTreeModel treeModel;private DefaultMutableTreeNode node = null;private JTree jtree;

15 JTreeHandler(JTree jtree)this.jtree=jtree;

public void startElement(String uri , String localName , String raw ,20 Attributes attrs) throws SAXException

super.startElement(uri ,localName ,raw ,attrs );if(node==null) // initialise node and model

jtree.setModel(new DefaultTreeModel(

25 node=new DefaultMutableTreeNode(localName )));else // add child and set node to the child

node.add(node=new DefaultMutableTreeNode(localName ));// add attributes as children

for (int i = 0; i < attrs.getLength (); i++)30 node.add(new DefaultMutableTreeNode(

’@’+attrs.getLocalName(i)+"["+attrs.getValue(i)+’]’));

public void endElement(String uri , String localName , String raw)35 throws SAXException

super.endElement(uri ,localName ,raw);node = (DefaultMutableTreeNode) node.getParent ();

40 public void characters(char[] ch , int start , int length)throws SAXException super.characters(ch,start ,length );String text = new String(ch ,start ,length ).trim ();if(text.length ()>0) node.add(new DefaultMutableTreeNode(text ));

45

In listing 5.6, we build a tree using DefaultMutableTreeNode instances and keep track ofthe current node being built (line 12). When an XML element start-tag is encountered, the

103

startElement (line 19), a new tree node is created and made the current node (line 26). Itsattributes are then added as children (line 28). A text node is simply added as a child of thecurrent node (line 41). When an end-tag is encountered (line 34), we set the current nodeto the parent of the current node. The constructor (line 15) only keeps a copy of the JTree

that will be displayed. Its model is initialised on the first call of startElement (line 22).To create a JTree, we add (line 40 of listing 5.3) a new JTreeHandler given as argument

to the SAX parser. Display of the tree is done by creating an instance (line 42) of theTreeViewer class (line 13 of listing 5.5). The JTree instance is built by calling the SAXparser and giving it a JTreeHandler (line 40).

SAX allows the pipelining of handlers that work in succession during a single parse ofthe file, a process called filtering, but here we simplify by parsing the file twice: once for thetextual input (line 36) and once more for the tree display (line 40).

5.4 Additional Information on Programming Models

In this chapter, we have used Java to access the content of an XML file. But there arenow XML parsers for almost any other computer languages such as Perl [30] [6], Python,Haskell [35], Prolog [37], even COBOL [7].

In our case, we were dealing similarly with all document nodes without ever looking atthe names of the elements. As it often happens that the information needed in a document isdeep down in the tree, one must then access specific nodes using a sequence getFirstChild()

and getNextSibling() until we get to the right element and do this for every level. So eachprogrammer often defines special purpose functions and methods to traverse a specific docu-ment tree. But there are other ways, for example, XMLSpy can generate automatically accessmethods, in either C++ or Java, from the the XML Schema that validates a document.This is very useful and much less error prone that relying on a hand traversal of the tree.When the schema is changed, then methods can be regenerated and processing of a validXML fill will always follow the structure of the XML Schema. There are also tools toconvert data base schemas into XML Schema. So we see that XML processing can beeasily integrated with other systems and languages.

Because the Java API version 1.4 includes classes for both the DOM and SAX ap-proaches, XML processing is now integrated into some Java books such as Big Java [23].XML in a Nutshell [28] provides a short but thorough presentation of these programmingmodels. Sun publishes an excellent online tutorial [21].

Xerces [10] (named after the Xerces Blue butterfly) is the de facto XML Schema processorwith validating parsers are available for Java and C++. There are also wrappers for Perland COM.

Java & XML [27] is a very good information source about creating and manipulatingXML documents in Java.

104

http://java.sun.com/j2se/1.4/docs/api/

http://xml.apache.org/#xerces

Chapter 6

Document Creation by Programming

In the previous section, we have seen how to parse an existing XML document by program-ming. It is also important to see how an XML document can be created by programming.We will therefore show the inverse of the programs shown previously by writing a programthat parses the compact form we produced to expand it into the corresponding XML docu-ment.1 Although one can write a program to create a file with XML tags and their contentusing print methods, we will show (section 6.1) that it is both simpler and more systematicto create an XML document in memory using the DOM model, to modify it and then toprint it using a serializer. In section 6.2, we will describe how to create an XML documentby parsing a text file and send SAX events to a transformer.

6.1 Creating a DOM Document

The DOM API provides an exhaustive collection of methods to create and modify a docu-ment. The most frequently used are:

new DocumentImpl() creates an empty document to which elements can be added

doc.createElement(String s) creates a new element named s in document doc

parent.appendChild(Element e) adds element e as the last children of element parent

e.setAttribute(String name, String value) adds the attribute name with the corre-sponding value to the element e if the attribute already exists, its value is replaced;

doc.createTextNode(String s) creates a new text node with content s in document doc

In order to simplify parsing, we create a customized StreamTokenizer which returns asingle token for all characters between separators used in the compact form (i.e. open and

1In principle, after compaction and expansion, we should recover the original XML document with whichwe started but since we have not faithfully transformed whitespace, the files are not strictly identical.

105

close brackets, at-sign and newline). The separators are also returned as a single token. Theimplementation of this tokenizer is given in listing 6.2.

In listing 6.1, the main method (line 22) first creates a Document instance (line 27) whichwill hold the XML tree. it then creates a specialized tokenizer (line 30) from the file namegiven as argument to the program. It goes on to find the name of the root element (line 35)and calls the expand method (line 39) which returns the whole content of the element whichis added as a child of the document. To output the DOM structure, we create an identitytransformation (line 41) and use the document as source and System.out as output (line 44).We also set an output property so that the output is nicely indented (line 43).

Expansion (line 52) is a recursive process that

• creates (line 54) an element of the name received as a parameter

• processes each attribute (line 56) by getting the name and value of the attribute andadding it to the current element line 62

• processes the content of the element (line 66) and creates either a new child recursively(line 71) if the next node is followed by an open bracket or a text node line 73 otherwise.

Listing 6.1: [DOMExpand.java]: Compact form parsing to create a DOM XML document.A sample input for this program is listing 4.7 to give back listing 2.2.

import org.w3c.dom.Element;import org.w3c.dom.Document;import javax.xml.parsers.DocumentBuilder;import javax.xml.parsers.DocumentBuilderFactory;

5

import javax.xml.transform.OutputKeys;import javax.xml.transform.Transformer;import javax.xml.transform.TransformerFactory;import javax.xml.transform.TransformerException;

10 import javax.xml.transform.TransformerConfigurationException;import javax.xml.transform.dom.DOMSource;import javax.xml.transform.stream.StreamResult;

15 import java.io.IOException;import java.io.BufferedReader;import java.io.FileInputStream;import java.io.InputStreamReader;import java.io.StreamTokenizer;

20

public class DOMExpand public static void main( String [] argv )

try DocumentBuilderFactory factory =

106

25 DocumentBuilderFactory.newInstance ();DocumentBuilder builder = factory.newDocumentBuilder ();Document doc = builder.newDocument ();String rootName = "dummyElement";CompactTokenizer st

30 = new CompactTokenizer(new BufferedReader(

new InputStreamReader(new FileInputStream(argv [0]))));

// ignore everything preceding the word before the first "["

35 while(st.getTokenType ()!=’[’)rootName=st.getString ();st.nextToken ();

doc.appendChild(expand(st ,doc ,rootName ));

40 // output with an "identity" Transformer

TransformerFactory tFactory = TransformerFactory.newInstance ();Transformer transformer = tFactory.newTransformer ();transformer.setOutputProperty(OutputKeys.INDENT ,"yes");DOMSource source = new DOMSource(doc);

45 StreamResult result = new StreamResult(System.out);transformer.transform(source ,result );

catch ( Exception ex ) ex.printStackTrace ();

50

static Element expand(CompactTokenizer st ,Document doc ,String elementName) throws IOException

Element elem = doc.createElement(elementName.trim ());55 st.nextToken (); // skip [

while(st.getTokenType ()==’@’)// process attributes

st.nextToken ();String attName = st.getString ();st.nextToken (); // skip [

60 st.nextToken ();String attValue=st.getString ();elem.setAttribute(attName ,attValue );st.nextToken (); // skip ]

st.nextToken ();65

while(st.getTokenType ()!=’]’) // process content of element

if(st.getTokenType ()== StreamTokenizer.TT_WORD )String s = st.getString (). trim ();st.nextToken ();

107

70 if(st.getTokenType ()==’[’)elem.appendChild(expand(st ,doc ,s));

elseelem.appendChild(doc.createTextNode(s));

75

st.nextToken (); // skip ]

return elem;

80

The CompactTokenizer class is a simple customization of the standard Java StreamTokenizer.Its constructor (line 8) receives a Reader and creates an internal StreamTokenizer that iscustomized. Because we do not want to have numbers and Java comments dealt with, wereset the syntax (line 10), indicate that all characters can be part of a word except for spe-cial separators used in the compact form. nextToken (line 19) calls the Java tokenizer andskips newlines and empty text nodes. getString (line 27) and getTokenType (line 31) areconvenient calls to access the information returned by the Java tokenizer.

Listing 6.2: [CompactTokenizer.java]: Specialized stream tokenizer that ignores blanktokens.

import java.io.Reader;import java.io.IOException;import java.io.StreamTokenizer;

5 public class CompactTokenizer private StreamTokenizer st;

CompactTokenizer(Reader r)st = new StreamTokenizer(r);

10 st.resetSyntax (); // remove parsing of numbers ...

st.wordChars(’\u0000’,’\u00FF’); // everything is part of a word

// except the following ...

st.ordinaryChar(’\n’);st.ordinaryChar(’[’);

15 st.ordinaryChar(’]’);st.ordinaryChar(’@’);

public void nextToken () throws IOException20 st.nextToken ();

while(st.ttype ==’\n’||(st.ttype == StreamTokenizer.TT_WORD &&st.sval.trim (). length ()==0))

108

st.nextToken ();25

public String getString ()return st.sval;

30

public int getTokenType ()return st.ttype;

6.2 Creating a Document with SAX Events

Another way of creating an XML document is to let a Transformer do it for us. Since thetransformer must receive an XML document, we could think that this is pointless. However,this transformer can create a document from SAX events as we saw in section 5.2. So whatwe will have to do is to create a parser for our compact textual form and have it generateSAX events as it parses its contents. This illustrates a clever and efficient way to convertnon-XML files into XML.

Listing 6.3: [SAXExpand.java]: XML document creation using SAX events

import org.xml.sax.InputSource;import javax.xml.transform.sax.SAXSource;import javax.xml.transform.stream.StreamResult;import javax.xml.transform.Transformer;

5 import javax.xml.transform.TransformerFactory;import javax.xml.transform.TransformerException;import javax.xml.transform.TransformerConfigurationException;

import java.io.BufferedReader;10 import java.io.FileReader;

import java.io.IOException;

public class SAXExpand public static void main( String [] argv )

15 try InputSource inputSource =

new InputSource(new BufferedReader(new FileReader(argv [0])));

CompactReader saxReader = new CompactReader ();20 SAXSource source = new SAXSource(saxReader ,inputSource );

StreamResult result = new StreamResult(System.out);

109

TransformerFactory tFactory =TransformerFactory.newInstance ();

25 Transformer transformer = tFactory.newTransformer ();transformer.transform(source ,result );

catch (TransformerException ex)System.out.println("TransformerException"+ex);ex.printStackTrace ();

30 catch (IOException ex) System.out.println("IOException"+ex);ex.printStackTrace ();

35

The main class for the SAX transformation (listing 6.3) is very simple: it creates aCompactReader (line 20) that will read the file as an InputSource; then it creates a transformer(line 25) to process this source into an output stream (here System.out). All the magic ofsetting the input file and creating the document elements is done via the transformationprocess.

Listing 6.4: [CompactReader.java]: Compact form parsing to generate SAX events

import org.xml.sax.XMLReader;import org.xml.sax.ContentHandler;import org.xml.sax.DTDHandler;import org.xml.sax.EntityResolver;

5 import org.xml.sax.ErrorHandler;import org.xml.sax.InputSource;import org.xml.sax.Attributes;import org.xml.sax.SAXException;import org.xml.sax.helpers.AttributesImpl;

10

import java.io.IOException;import java.io.StreamTokenizer;import java.util.Arrays;

15 public class CompactReader implements XMLReader

private String nsu = ""; // no namespace URI

private ContentHandler handler;

20 private static char[] blanks = "\n ".toCharArray ();

private void ignorableSpacing(int nb) throws SAXException if(nb>blanks.length )// extend the length of space array

blanks = new char[nb];25 blanks [0]=’\n’;

110

Arrays.fill(blanks ,1,blanks.length ,’ ’);handler.ignorableWhitespace(blanks ,0,nb);

30

// Return the current content handler.

public ContentHandler getContentHandler () return handler ;//Allow an application to register a content event handler.

public void setContentHandler(ContentHandler handler )35 this.handler=handler;

//Parse an XML document.

private CompactTokenizer st;40

public void parse(InputSource input)try

String rootName = "dummyRoot";st = new CompactTokenizer(input.getCharacterStream ());

45 // ignore everything before the word before the first "["

while(st.getTokenType ()!=’[’)rootName=st.getString ();st.nextToken ();

50 handler.startDocument ();

expand(rootName ,1);ignorableSpacing (1);handler.endDocument ();

catch (SAXException e)55 System.out.println(e.getMessage ());

catch (IOException e) System.out.println("IO Error:"+e);

60

void expand(String elementName ,int indent)throws IOException ,SAXExceptionAttributesImpl attrs = new AttributesImpl ();st.nextToken ();

65 while(st.getTokenType ()==’@’)st.nextToken ();String attName = st.getString ();st.nextToken ();st.nextToken ();

70 String attValue=st.getString ();

111

attrs.addAttribute(nsu ,attName ,attName ,"CDATA",attValue );st.nextToken ();st.nextToken ();

75 ignorableSpacing(indent );

handler.startElement(nsu ,elementName ,elementName ,attrs);while(st.getTokenType ()!=’]’) // process content of element

if(st.getTokenType ()== StreamTokenizer.TT_WORD )String s = st.getString (). trim ();

80 st.nextToken ();if(st.getTokenType ()==’[’)

expand(s,indent +3); else

ignorableSpacing(indent +3);85 handler.characters(s.toCharArray (),0,s.length ());

st.nextToken (); // remove "]"

90 ignorableSpacing(indent );handler.endElement(nsu ,elementName ,elementName );

// dummy definitions ...

95 // Return the current DTD handler.

public DTDHandler getDTDHandler () return null;

// Return the current entity resolver.

public EntityResolver getEntityResolver () return null;100

// Return the current error handler.

public ErrorHandler getErrorHandler () return null;

//Look up the value of a feature.

105 public boolean getFeature(String name) return false;

//Look up the value of a property.

public Object getProperty(String name) return null;

110 //Parse an XML document from a system identifier (URI).

public void parse(String systemId )

//Allow an application to register a DTD event handler.

public void setDTDHandler(DTDHandler handler )115

112

//Allow an application to register an entity resolver.

public void setEntityResolver(EntityResolver resolver )

//Allow an application to register an error event handler.

120 public void setErrorHandler(ErrorHandler handler )

//Set the state of a feature.

public void setFeature(String name , boolean value )

125 //Set the value of a property.

public void setProperty(String name , Object value )

The reading of the file and generation of SAX events are done by CompactReader (list-ing 6.4), a specialized XMLReader. As CompactReader implements the XMLReader interface, itmust define many methods but only a few of them are really important in this special case.This explains the many dummy definitions of methods at the end of listing 6.4.

The SAX parsing events will be sent to an event handler (line 18) for which we define(get... and set...) accessor functions available to the user of the handler. The mainmethod is parse (line 41) that uses the same algorithm as was done for DOM approach.parse uses a CompactTokenizer instance (line 39) of the same class used with the DOMapproach in listing 6.2. This method will generate SAX events as it goes through the textfile. It first finds the name of the root element (line 47) and then calls expand (line 51) withan indentation level (used only to obtain a nicer output).

Most of the processing is done within the expand (line 61) method that gets the name ofthe current element as parameter. It first creates an AttributesImpl data structure which ispopulated with the names and values of all attributes (line 65). Once all the attributes havebeen gathered, we can send a startElement event to the handler (line 76). We then processthe content of the event either by a recursive call to expand (line 82) or by creating a textelement by sending a characters event to the handler (line 85). Once all children elementshave been processed, an endElement event is sent to the handler (line 91).

The previous processing produces a correct XML file but one that is very hard to readbecause everything will appear on the same line. A way to produce a nicely formattedoutput is to send ignorable spacing to the handler. As the formatting rule we use is toalways end the current line and add some indentation, we have defined an auxiliary methodignorableSpacing (line 22) that manages an array of characters of the appropriate length;it is initialized with nine spaces but it expands as necessary. This method is called at theappropriate moment in the expansion process, i.e. before creating a new start (line 75) orend element (line 90) or before a new text node (line 84).

113

6.3 Additional Information on XML Document Cre-

ation

Java & XML [27] is a very good information source about creating and manipulating XMLdocuments in Java. It also shows how to integrate stylesheet processing with Java.

More information on the SAX event model for data conversion into XML can be foundin chapter 7 of [21], an excellent tutorial on Java for XML processing.

114

Chapter 7

Conclusion

This report has presented some XML techniques using a single, simple example in order togive a pedagogical overview of the different approaches to processing XML files. In fact,it was our own way of learning XML so in no way should these techniques be viewed asoptimal or definitive. We also made some connections with other computer science techniquesbecause this helped us to learn XML by making links to our previous knowledge.

We have deliberately chosen to ignore many details in order that the main ideas, whichare relatively simple, can emerge. XML processing is only starting and much more remainsto be done, especially as it is one of the fundamental building block for the Semantic Webinitiative [12]. XML is the encoding for upper level languages such as RDF for defininginformation about documents and for OWL to define ontologies.

115

Bibliography

[1] Information processing, Text and Office Systems Standard Generalized Markup Lan-guage (SGML). First edition. Technical report, ISO (International Organization forStandardization). ISO 8879:1986(E), 1986.

[2] Document Object Model (DOM) Technical Reports. Technical report, W3C, http://www.w3.org/DOM/DOMTR, 2003.

[3] Uniform resource locators. Technical report, W3C, http://www.w3.org/Addressing/URL/Overview.html, 2003.

[4] Universal Resource Identifiers. Technical report, W3C, http://www.w3.org/

Addressing/URL/URI_Overview.html, 2003.

[5] Portable document format. Technical report, Adobe Corporation, http://partners.adobe.com/public/developer/pdf/index_reference.html, 2005.

[6] Xml package for python. Technical report, http://pyxml.sourceforge.net/, 2005.

[7] XML4cobol SE. Technical report, ECM Systemintegration, http://xml4cobol.com/,2005.

[8] Sharon Adler, Anders Berglund, Jeff Caruso, Stephen Deach, Tony Graham, PaulGrosso, Eduardo Gutentag, Alex Milowski, Scott Parnell, Jeremy Richman, andSteve Zilles. Extensible Stylesheet Language (XSL). Technical report, W3C, http://www.w3.org/TR/xsl, 2001.

[9] Altova Corp., http://www.xmlspy.com/. XML Spy 5 Enterprise Edition Manual, 2005.

[10] Apache XML Project. Xerces Java and C++ Parsers. http://xml.apache.org/

#xerces, 2.7.1 edition, 2005.

[11] Anders Berglund, Scott Boag, Don Chamberlin, Mary F. Fernandez, Michael Kay,Jonathan Robin, and Jerome Simeon. XML Path Language (XPath) 2.0. Technicalreport, W3C, http://www.w3.org/TR/xpath20, 2005.

[12] Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web. Scientific Amer-ican, May 2001.

116

http://www.w3.org/DOM/DOMTR

http://www.w3.org/DOM/DOMTR

http://www.w3.org/Addressing/URL/Overview.html

http://www.w3.org/Addressing/URL/Overview.html

http://www.w3.org/Addressing/URL/URI_Overview.html

http://www.w3.org/Addressing/URL/URI_Overview.html

http://partners.adobe.com/public/developer/pdf/index_reference.html

http://partners.adobe.com/public/developer/pdf/index_reference.html

http://pyxml.sourceforge.net/

http://xml4cobol.com/

http://www.w3.org/TR/xsl

http://www.w3.org/TR/xsl

http://www.xmlspy.com/



http://www.w3.org/TR/xpath20

[13] Paul V. Biron and Ashok Malhotra. XML Schema Part 2: Datatypes. Technical report,W3C, http://www.w3.org/TR/xmlschema-2/, 2004.

[14] Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, and Eve Maler. Extensible MarkupLanguage (XML) 1.0 Tthird Edition). Technical report, W3C, http://www.w3.org/TR/REC-xml, 2004.

[15] Allen Brown, Matthew Fuchs, Jonathan Robie, and Philip Wadler. XMLSchema: Formal Description. Technical report, W3C, http://www.w3.org/TR/

xmlschema-formal/, Sept 2001.

[16] James Clark. XSL Transformations (XSLT). http://www.w3.org/TR/xslt, 1999.

[17] James Clark. Multi-format schema converter based on RELAX NG. Technical report,Thai Open Source Software Center Ltd, http://www.thaiopensource.com/relaxng/trang.html, 2003.

[18] James Clark. New mode for XML. http://www.thaiopensource.com/nxml-mode/,2004.

[19] James Clark and Steve DeRose. XML Path Language (XPath). Technical report, W3C,http://www.w3.org/TR/xpath, 1999.

[20] James Clark and Makoto Murata. RELAX NG specification. Technical report, Or-ganization for the Advancement of Structured Information Standards [OASIS], http://www.oasis-open.org/committees/relax-ng/spec-20011203.html, 2001.

[21] E. Armstrong et al. The J2EE 1.4 Tutorial for Sun Java System Application ServerPlatform Edition 8.1 2005Q2. Sun corp., http://java.sun.com/j2ee/1.4/docs/

tutorial/doc/index.html, June 2005.

[22] Benoıt Habert. Objectif : CLOS. Objectif. Masson, Paris, 1996.

[23] Cay Horstmann. Big Java. Wiley, 2002.

[24] Dongwon Lee and Wesley W. Chu. Comparative analysis of six XML schema languages.ACM SIGMOD Record, 29(3):76–87, 2000.

[25] Murali Mani and Dongwon Lee. XML to relational conversion using theory of regulartree grammars. In Proc. VLDB Workshop on Efficiency and Effectiveness of XMLTools, and Techniques (EEXTT), pages 81–103, Hong Kong, China, August 2002.

[26] Jonathan Marsh and David Orchard. XML Inclusions (XInclude) Version 1.0. Technicalreport, W3C, http://www.w3.org/TR/xinclude/, 2004.

[27] Brett McLaughlin. Java & XML. O’Reilly, 2nd edition, 2001.

117

http://www.w3.org/TR/xmlschema-2/

http://www.w3.org/TR/REC-xml

http://www.w3.org/TR/REC-xml

http://www.w3.org/TR/xmlschema-formal/

http://www.w3.org/TR/xmlschema-formal/

http://www.w3.org/TR/xslt

http://www.thaiopensource.com/relaxng/trang.html

http://www.thaiopensource.com/relaxng/trang.html

http://www.thaiopensource.com/nxml-mode/

http://www.w3.org/TR/xpath

http://www.oasis-open.org/committees/relax-ng/spec-20011203.html

http://www.oasis-open.org/committees/relax-ng/spec-20011203.html

http://java.sun.com/j2ee/1.4/docs/tutorial/doc/index.html

http://java.sun.com/j2ee/1.4/docs/tutorial/doc/index.html

http://www.w3.org/TR/xinclude/

[28] W. Scott Means and Elliotte Rusty Harold. XML in a Nutshell. O’Reilly, 2nd edition,2002.

[29] David Megginson. Official website for SAX. Technical report, Sourceforge, http://www.saxproject.org/, 2005.

[30] Matt Sergean. XML-Parser-2.34. Technical report, CPAN, http://search.cpan.org/~msergeant/XML-Parser-2.34/Parser.pm, 2003.

[31] Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn. XMLSchema Part 1: Structures. Technical report, W3C, http://www.w3.org/TR/

xmlschema-1/, 2004.

[32] Henry S. Thompson and Richard Tobin. XML Schema Validator. http://www.ltg.

ed.ac.uk/~ht/xsv-status.html, 2002.

[33] Eric van der Vlist. XML Schema. O’Reilly, 2002.

[34] Eric van der Vlist. Relax NG. O’Reilly, http://books.xmlschemata.org/relaxng/,2004.

[35] Malcom Wallace and Colin Runciman. HaXml. University of York, http://www.cs.york.ac.uk/fp/HaXml/, 2002.

[36] Sean Wheller. <oXygen/> XML Editor User Guide. SyncRO Soft Ltd, http://www.oxygenxml.com/doc/ug-standalone-en/index.htm, 2005.

[37] Jan Wielemaker. SWI-Prolog SGML/XML parser. Technical report, University ofAmsterdam, http://www.swi-prolog.org/packages/sgml2pl.html, 2005.

118

http://www.saxproject.org/

http://www.saxproject.org/

http://search.cpan.org/~msergeant/XML-Parser-2.34/Parser.pm

http://search.cpan.org/~msergeant/XML-Parser-2.34/Parser.pm



http://www.ltg.ed.ac.uk/~ht/xsv-status.html

http://www.ltg.ed.ac.uk/~ht/xsv-status.html

http://books.xmlschemata.org/relaxng/

http://www.cs.york.ac.uk/fp/HaXml/

http://www.cs.york.ac.uk/fp/HaXml/

http://www.oxygenxml.com/doc/ug-standalone-en/index.htm

http://www.oxygenxml.com/doc/ug-standalone-en/index.htm

http://www.swi-prolog.org/packages/sgml2pl.html

Appendix: Some XML RelatedTechnologies and Systems

Abbreviation Full name Sections SpecsDOM Document Object Model 5.1 [2]DTD Document Type Definition 3.1 [14]PDF Portable Document Format 4.5 [5]RELAX NG REgular LAnguage for Xml, New Generation 3.3 [20]SAX Simple Application programming interface for Xml 5.2 [29]SGML Standard Generalized Markup Language 1 [1]URI Uniform Resource Identifier 2.1 [4]URL Uniform Resource Locator 2.1 [3]XML eXtended Markup Language * [14]XML Schema XML Schema 3.2 [31, 13]XPath Xml PATH language 4.1 [19]XSL eXtensible Stylesheet Language 4.2 [8]XSLT XSL Transformations 4.2 [16]

119

Quick Reference Tables

Quick reference tables that are taken from previous chapters. Names in italics refer to otherelements. Regular expressions are used to describe the allowed forms: braces are used forgrouping, ? indicates that the preceding grouping is optional, * that it can be repeated asoften as necessary possibly none and + that it must be appear at least once.

DTD

<!DOCTYPE rootElement SYSTEM ”file.dtd” [ !ENTITY *]? ><!ELEMENT NCName ( #PCDATA |? regexpOf !ELEMENT ) ><!ELEMENT NCName (#PCDATA) ><!ELEMENT NCName EMPTY ><!ATTLIST elementNCName attributeNCName declValue default>

declValue = CDATA | ID | IDREF | (CNAME | CNAME+ )default = #REQUIRED | #IMPLIED

<![CDATA[ ... ]]><!ENTITY name ” ... ”><!ENTITY % name ” ... ”><!ENTITY name SYSTEM ”file.xml”>

Table 3.1, Section: 3.1, Page 22

120

RELAX NG

Compact Syntax (RNC) XML syntax (RNG)default? namespace id = URI <grammar>| datatypes id = URI * <start>pattern</start>

start = pattern | <define name=”NCName”> pattern+ </define> *| id = pattern * </grammar>

element QName pattern <element name=”QName”> pattern+ </element>attribute QName pattern <attribute name=”QName”> pattern+ </attribute>pattern , pattern + <group name=”QName”> pattern+ </group>pattern & pattern + <interleave name=”QName”> pattern+ </interleave>pattern | pattern + <choice name=”QName”> pattern+ </choice>pattern ? <optional name=”QName”> pattern+ </optional>pattern * <zeroOrMore name=”QName”> pattern+ </zeroOrMore>pattern + <oneOrMore name=”QName”> pattern+ </oneOrMore>mixed pattern <mixed name=”QName”> pattern+ </mixed>id <ref name=”NCName”/>empty <empty/>text <text/>dataTypeValue <value name=”NCName”?> string </value>dataTypeName id = value* <data type=”NCName”?>

<param name=”NCName”>string</param>*</data>


122

XSLT

<xsl:stylesheet version=”1.0” xmlns:xsl=”http://www.w3.org/1999/XSL/Transform”>xsl:output? xsl:template*

</xsl:stylesheet><xsl:output method=”xml” indent=”yes” encoding=”UTF8”/><xsl:template match=”pattern”>


</xsl:template><xsl:template name=”QName”>


</xsl:template><xsl:apply-templates select=”node-set-exp”?/><xsl:apply-templates select=”node-set-exp”?>

xsl:sort | xsl:with-param*</xsl:apply-templates><xsl:with-param name=”QName”> ...</xsl:with-param><xsl:with-param name=”QName” select=”expr”/><xsl:call-templates name=”node-set-exp”?/><xsl:call-templates name=”node-set-exp”?>

xsl:sort | xsl:with-param*</xsl:call-templates><xsl:param name=”QName”> ...</xsl:param><xsl:param name=”QName select=”expr””/><xsl:value-of select=”expr”/><xsl:variable name=”QName”> ...</xsl:variable><xsl:variable name=”QName select=”expr””/><xsl:if test=”boolean-expr”>...</xsl:if><xsl:choose>

<xsl:when test=”expr”>...</xsl:when>+<xsl:otherwise> ... </xsl:otherwise>?

</xsl:choose><xsl:for-each select=”XPathExpr”>

xsl:sort* ...</xsl:for-each><xsl:sort select=”XPathExpr” order=”ascending|descending”? data-type=”number”?/><xsl:element name=”QName” namespace=”URI”>...</xsl:element><xsl:attribute name=”QName” namespace=”URI”>...</xsl:attribute><xsl:text> #PCDATA</xsl:text>


123

Index

xi:include, 53xs:attribute, 28xs:complexType, 28xs:element, 28xs:group, 28xs:import, 28xs:key, 28, 39–40xs:keyref, 28, 39–40xs:restriction, 28xs:sequence, 28xs:simpleType, 28xs:unique, 28, 39–40xsl:apply-templates, 61xsl:choose, 61xsl:for-each, 61xsl:if, 61xsl:template, 61xsl:text, 61xsl:value-of, 61xsl:variable, 61XSLT, 55XMLSpy , 54<oXygen/>, 54nXML mode, 54

ATTLIST, 23attribute, 6Attribute Value Template, 72axis specifier, 56

compact notation, 43–53complex type, 37–39

DOCTYPE, 16document creation, 105–114Document Object Model (DOM), 91–96

DOM, 91–96DTD, 21–26DTD association, 25dynamic element creation, 72–76

ENTITY, 23entity, 16, 23

Formatting object, 80–89

generalized tree, 6

HTML transformation, 64–76

IMPLIED, 23include, 53instance document, 14–20

JTree, 100

Lisp, 6

named template, 59namespace, 19–20, 40–41node test, 57node types, 56

parameter entity, 23PCDATA, 22

reference constraints, 39Relax NG, 43–53REQUIRED, 23

SAX, 96–100SAX error handler, 98SAX event handling, 97SAX events creation, 109schema, 26–54

124

schema association, 53–54simple type, 37stylesheet, 55–90stylesheet association, 89

template, 59template named, 59Trang schema converter, 44transformation, 55–90tree, 11

uniqueness constaints, 39

validation, 21–54

well-formed, 14

XPath, 56–59XSL, 55–90XSL predicate, 57XSL transformations, 59–89XSL-FO, 80–89

125

xml: looking at the forest instead of the trees - igt.net

Documents