compact storage for efficient management of xml documents

Compact Storage for E�cientManagement of XML Documents

Ramez Alkhatib

A Doctoral Dissertationsubmitted in partial satisfaction of the requirements for the degree of

Doctor of Engineering Science (Dr.-Ing.)-Doktor der Ingenieurwissenschaften-

from theUniversity of Konstanz

Department of Computer and Information Science

http://nbn-resolving.de/urn:nbn:de:bsz:352-opus-108503

http://kops.ub.uni-konstanz.de/volltexte/2010/10850/

ii

Advisor:

Prof. Dr. Marc H. SchollUniversity of Konstanz, Konstanz, Germany

Reviewers:

Prof. Dr. Marcel WaldvogelUniversity of Konstanz, Konstanz, Germany

Prof. Dr. Marc H. SchollUniversity of Konstanz, Konstanz, Germany

Date of the oral examination: February 19, 2010

iv

DEDICATION

To my parents

and

To my wife and my children

vi

Acknowledgments

I would like to express my gratitude to all those who gave me the possibilityto complete this thesis.First and foremost, I would like to thank my God for all his blessings withoutwhich nothing of my work would have been done.I would like to express my sincerest thanks to my advisor, Professor Marc H.Scholl, who has helped me shape my research from the very �rst day, and whohas always been supportive and patient throughout the whole period of mystudy until the very last day before submission.Special thanks go to all my colleagues at Department of Computer and Infor-mation Engineering for providing a good working atmosphere. Thanks alsoto the Syrian government for providing me with the �nancial support for mystudies.Personally, I would like to express my deepest gratitude to my family, AhmadAlkhatib, Malak Awir, Yasser, Marhaf, Anas, Razan, Eyhab, Malek, TakredSaied, Malak, Rima amd Mohamed-Jesan, who serve as an inspiration for meto move on against all odds on my way.

viii

Abstract

XML is becoming widely used for data exchange and manipulation. As aconsequence, an increasing number of XML documents need to be managed.There are many works that use main-memory to process XML data. SinceXML usage is continuing to grow and the nature of XML is extremely verbose,large or even moderately large XML documents cannot be processed withinthe main memory. Consequently, these works will su�er from the limitationsof current main-memory. On the other hand, because of the maturity andwidespread deployment of (object) relational database technologies, they havebeen suggested as an alternative to store and manage XML data. However,the persistent storage of XML in its native format will avoid transformationcost and present the best alternative. This has generated an increasing needfor robust, high performance XML database systems, which are able to notonly query and update XML data e�ciently, but also store it in a compactrepresentation.There have been many proposals to manage XML documents. However, twocommon strategies are available to provide robust storage and e�cient queryprocessing.The �rst is based on numbering schemes for gathering structural informationfrom XML documents and storing it in such a way that allows quickidenti�cation of structural relationships between nodes. This identi�cationplays a crucial role in e�cient XML query processing.The second strategy tries to reduce the size of XML documents throughcompaction techniques. While a naive representation of XML documentsleads to excessive redundancy, the compaction of XML documents not onlyreduces the amount of disk space occupied by the data, but also enhancesquery processing speed.The thesis presents di�erent solutions for the e�cient management of XMLdata by proposing approaches that combine the strengths of labeling andcompaction technologies and bridge the gaps between these technologies toexploit their bene�ts and avoid their drawbacks and produce a performancethat is better when these technologies are used independently.An extensive experimental evaluation of the approaches proposed showsthat they yield considerable performance improvements for XML processingcompared to other approaches in this �eld.

Keywords: XML Compaction, XML Querying, XML Updating, XMLLabeling Scheme

ix

Zusammenfassung

XML wird mehr und mehr für Datenaustausch und -manipulation genutzt.Viele Ansätze verarbeiten XML-Daten im Hauptspeicher. Weil XMLzunehmend häu�g verwendet wird und die XML-Syntax zusätzlichen Spe-icher benötigt, können grösere XML-Dateien nicht im Hauptspeicher verar-beitet werden. Infolgedessen leiden diese Dateien unter den Begrenzungenaktueller Arbeitsspeicher. Hingegen werden objektrelationale Datenbanktech-nologien wegen ihrer Ausgereiftheit und weiten Verbreitung als Alternativenzum Speichern und Verwalten von XML-Daten genannt. Die dauerhafte Spe-icherung von XML in seinem ursprünglichen Format vermeidet Verluste durchUmwandlung und stellt die beste Alternative dar. Daraus folgt ein steigen-der Bedarf an robusten, leistungsfähigen XML-Datenbanken, die XML-Datennicht nur e�zient abfragen und aktualisieren, sondern sie auch kompakt spe-ichern können.Es gibt viele Ansätze zur Verwaltung von XML-Dokumenten. Hingegen sindzwei gängige Strategien bekannt, die eine robuste Speicherung und e�zienteSuche gewährleisten.Die erste beruht auf einem Nummerierungsschema, das strukturelle Informa-tionen aus XML-Dokumenten gewinnt. Diese Informationen werden auf eineArt gespeichert, die schnelle Identi�kation zwischen den Knotenbeziehungenerlaubt. Diese Identi�kation spielt eine entscheidende Rolle bei der e�zientenAbfrageverarbeitung.Die zweite Strategie verkleinert XML-Dateien mittels Komprimierungstech-niken. Während eine naive Darstellung von XML-Dateien eine starke Redun-danz erzeugt, reduziert die Komprimierung vn XML-Dateien nicht nur denbenötigten Speicherplatz, sondern erhöht auch die Abfragegeschwindigkeit.Die vorliegende Arbeit präsentiert verschiedene Lösungsansätze für die ef-�ziente Verwaltung von XML-Daten. Sie stellt Ansätze vor, die dieStärken von Kennzeichnungs- und Komprimierungs-Technologien verbindenund sowohl die Lücke zwischen diesen Technologien schlieÿen als auch ihreNachteile überwinden und eine bessere Leistung als bei separatem Einsatzdieser Technologien gewährleisten.Eine ausführliche experimentelle Evaluation der vorgestellten Anätze zeigt,dass sie im Vergleich mit anderen Ansätzen auf diesem Gebiet deutliche Leis-tungsverbesserungen bei der XML-Verarbeitung erzielen.

Contents

1 Introduction 2

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 XML and Related Technologies 6

2.1 XML: Extensible Markup Language . . . . . . . . . . . . . . . 72.2 XML Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Constraints on XML Documents . . . . . . . . . . . . . . . . . 10

2.3.1 Well-formed XML . . . . . . . . . . . . . . . . . . . . . 102.3.2 Document Type De�nitions (DTD) . . . . . . . . . . . 112.3.3 XML Schema (XSD) . . . . . . . . . . . . . . . . . . . 12

2.4 XML Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.1 Document object model (DOM) . . . . . . . . . . . . . 132.4.2 The simple API for XML (SAX) . . . . . . . . . . . . . 13

2.5 XML Query Languages . . . . . . . . . . . . . . . . . . . . . . 142.5.1 XML Path Language . . . . . . . . . . . . . . . . . . . 142.5.2 The XQuery Language . . . . . . . . . . . . . . . . . . 18

2.6 XML and Databases . . . . . . . . . . . . . . . . . . . . . . . 192.6.1 XML-Enabled Databases . . . . . . . . . . . . . . . . . 192.6.2 Native XML Databases . . . . . . . . . . . . . . . . . . 21

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Related Work 24

3.1 XML Labeling Schemes . . . . . . . . . . . . . . . . . . . . . . 243.1.1 Static Labeling Schemes . . . . . . . . . . . . . . . . . 253.1.2 Pre�x Labeling Schemes . . . . . . . . . . . . . . . . . 27

3.2 XML Compression . . . . . . . . . . . . . . . . . . . . . . . . 283.2.1 XMill compression . . . . . . . . . . . . . . . . . . . . 293.2.2 XML skeleton compression . . . . . . . . . . . . . . . . 29

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 SCQX: Compacting, Storing and Querying XML Documents

Using a Static Labeling Scheme 32

4.1 The Level-Order Labeling Scheme used in SCQX . . . . . . . 334.2 Compaction Principles of SCQX . . . . . . . . . . . . . . . . . 344.3 The Storage Model of SCQX . . . . . . . . . . . . . . . . . . . 38

Contents xi

4.3.1 Storage structure . . . . . . . . . . . . . . . . . . . . . 394.3.2 Index methods . . . . . . . . . . . . . . . . . . . . . . 404.3.3 Query Evaluation . . . . . . . . . . . . . . . . . . . . . 41

4.4 A Real-Life XML Example . . . . . . . . . . . . . . . . . . . . 434.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 CXQU: A Cluster Labeling Scheme for Storing, Querying and

Updating Compacted XML Documents 49

5.1 The Cluster Labeling Scheme . . . . . . . . . . . . . . . . . . 505.1.1 The Initial Labeling of the Cluster Labeling Scheme . . 525.1.2 Inserting new nodes . . . . . . . . . . . . . . . . . . . . 525.1.3 Byte Representation of Cluster Labels . . . . . . . . . 54

5.2 Compaction Principles of CXQU . . . . . . . . . . . . . . . . 575.3 The CXQU Storage Model . . . . . . . . . . . . . . . . . . . . 59

5.3.1 Storage structure . . . . . . . . . . . . . . . . . . . . . 605.3.2 Query Evaluation . . . . . . . . . . . . . . . . . . . . . 625.3.3 Support for Updates to Compacted XML Structures . 64

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6 CXDLS: Compacting, Storing, Querying and Updating XML

Documents Using a Dynamic Labeling Scheme 67

6.1 The Pre�x Labeling Scheme . . . . . . . . . . . . . . . . . . . 686.2 XML Compaction . . . . . . . . . . . . . . . . . . . . . . . . . 716.3 The Storage Model of CXDLS . . . . . . . . . . . . . . . . . . 75

6.3.1 Storage structure . . . . . . . . . . . . . . . . . . . . . 756.3.2 Query Evaluation . . . . . . . . . . . . . . . . . . . . . 776.3.3 Support for Updates to Compacted XML Structures . 79

6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7 Experimental Evaluation 82

7.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . 827.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 837.3 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 85

7.4.1 Storage Requirements . . . . . . . . . . . . . . . . . . 857.4.2 Query Performance . . . . . . . . . . . . . . . . . . . . 997.4.3 Update Performance . . . . . . . . . . . . . . . . . . . 105

7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

8 Conclusions and Future Work 109

8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1098.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

Contents xii

A The used Query sets 113

Bibliography 116

List of Figures

2.1 A Sample XML Document (bibliography.xml) . . . . . . . . . 92.2 The tree structure of the Sample XML document . . . . . . . 102.3 A DTD of the Sample XML Document(bibliography.dtd) . . . 122.4 The basic structure of an XML schema . . . . . . . . . . . . . 132.5 The XPath Axes . . . . . . . . . . . . . . . . . . . . . . . . . 172.6 The Structure of XML databases . . . . . . . . . . . . . . . . 22

3.1 The preorder ranks pre (left numbers) and the postorder rankspost (right numbers) for tree of simple XML example. . . . . . 25

3.2 The pre/post plane illustrates XPath axis conditions for thefour major XPath axes ancestor, descendant, following, andpreceding as seen from node f. . . . . . . . . . . . . . . . . . . 26

3.3 A Dewey Encoding (Example) . . . . . . . . . . . . . . . . . . 273.4 An XML tree, its skeleton and storage . . . . . . . . . . . . . 30

4.1 XML document structure with level-order IDs (shown insidenodes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 Compacted structure with unique numbers of elements and car-dinality counters (in parentheses). . . . . . . . . . . . . . . . . 36

4.3 Element Table of SCXQ's storage structure . . . . . . . . . . . 394.4 Value Table of SCXQ's storage structure . . . . . . . . . . . . 404.5 Path Table of SCXQ's storage structure . . . . . . . . . . . . 404.6 The inverted value index for the XML data value of XML ex-

ample in �gure . . . . . . . . . . . . . . . . . . . . . . . . . . 414.7 A random part of Hamlet XML document . . . . . . . . . . . 444.8 The Level-Order labels (a) and the compacted structure (b) of

the Part of Hamlet XML document with Level-order labels(red)and cardinality counters (blue) . . . . . . . . . . . . . . . . . . 45

4.9 Storage structures for the Part of Hamlet XML document . . . 464.10 The Indexes for the Part of Hamlet XML document . . . . . . 47

5.1 An XML document with cluster labels (CIDs) of element groupsand TIDs of the text nodes . . . . . . . . . . . . . . . . . . . . 51

5.2 An example for an insertion of subtree after the last child of aparent element . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3 An example of an insertion of subtrees and nodes before the�rst child of a parent element . . . . . . . . . . . . . . . . . . 54

List of Figures xiv

5.4 An example of an insertion of subtrees between two existingelement nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.5 A Pre�x-Free Encoding of the pre�x Bitstrings . . . . . . . . . 555.6 An Alternative Pre�x-Free Encoding of the pre�x Bitstrings . 565.7 Compacted structure of simple XML document in Figure 5.1

with cluster labels (CIDs) and the odd position number (OPN)for each node . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.8 Element Table of CXQU's storage structure . . . . . . . . . . 605.9 Value Table of CXQU's storage structure . . . . . . . . . . . . 615.10 An Inverted Path Index . . . . . . . . . . . . . . . . . . . . . 625.11 XML Storage Model (top), Internal Representation in CXQU

(bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.12 Element node insertion, and subtree insertion in CXQU . . . . 65

6.1 An XML document with ORDPATH labeling . . . . . . . . . 706.2 An example for the binary encoding of an ORDPATH label . . 706.3 The XML structure and its compacted form . . . . . . . . . . 746.4 Example for inferring the labels from label of compacted node 756.5 Element table of CXDLS's storage structure . . . . . . . . . . 766.6 Value table of CXDLS's storage structure . . . . . . . . . . . . 766.7 Path table of CXDLS's storage structure . . . . . . . . . . . . 776.8 The XML storage Model (left) nad its internal representation

(right) in CXDLS . . . . . . . . . . . . . . . . . . . . . . . . . 786.9 Element node insertion, and subtree insertion in CXDLS . . . 80

7.1 Comparison of storage requirements for SCQX, Level-Order LSand Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87





7.6 Comparison of storage requirements for CXQU, Cluster LS andORDPATH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90



List of Figures xv



7.11 Comparison of storage requirements for CXDLS and ORDPATH 937.12 Comparison of storage requirements for CXDLS and ORDPATH 947.13 Comparison of storage requirements for CXDLS and ORDPATH 947.14 Comparison of storage requirements for CXDLS and ORDPATH 957.15 Comparison of storage requirements for CXDLS and ORDPATH 957.16 Comparison of storage requirements for di�erent approaches . 967.17 Comparison of storage requirements for di�erent approaches . 977.18 Comparison of storage requirements for di�erent approaches . 977.19 Comparison of storage requirements for di�erent approaches . 987.20 Comparison of storage requirements for di�erent approaches . 987.21 Comparison of storage requirements for di�erent approaches . 997.22 Query Performance of our approaches vs. MonetDB/XQuery

System (XMark Queries) . . . . . . . . . . . . . . . . . . . . . 1007.23 Query Performance of our approaches vs. MonetDB/XQuery





System (Shakespeare Queries) . . . . . . . . . . . . . . . . . . 1027.28 Query Performance of our approaches vs. MonetDB/XQuery




System (Shakespeare Queries) . . . . . . . . . . . . . . . . . . 1047.32 The performance of inserting a subtree to Shakespeare . . . . 1067.33 The performance of inserting a node to Hamlet . . . . . . . . 1067.34 The performance of inserting node to XMark . . . . . . . . . . 1077.35 The performance of inserting a subtree to XMark . . . . . . . 107

List of Tables

4.1 Path Index for the example XML document . . . . . . . . . . 41

5.1 Cluster labels format . . . . . . . . . . . . . . . . . . . . . . . 55

7.1 XML datasets used in the experiments . . . . . . . . . . . . . 867.2 Update queries for Shakespeare . . . . . . . . . . . . . . . . . 1057.3 Update queries for XMark . . . . . . . . . . . . . . . . . . . . 106

A.1 The Query set for XMark . . . . . . . . . . . . . . . . . . . . 113A.2 The Query set for XMark . . . . . . . . . . . . . . . . . . . . 114A.3 The Query set for Shakespeare . . . . . . . . . . . . . . . . . . 115

Chapter 1

Introduction

Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . 4

1.1 Motivation

Due to the growing popularity of XML as a data exchange and storage for-mat, the need to develop e�cient techniques for managing XML documentshas emerged. In addition, thanks to its �exible, open, extensible features, itis widely used in numerous domains such as industries, business and Web ap-plications, etc. As a consequence, more and more data are being representedin XML format, and document size and the number of documents that needto be processed is increasing rapidly. An urgent need has emerged to develope�cient data management systems for storing and querying large repositoriesof XML data: the world of database research has initiated the rebirth of in-terest in XML database systems.Standard database management systems, such as relational database man-agement systems (RDBMS) have existed for a long time for the management,update and querying of databases, which are a collection of data stored onpersistent media. Researchers have proposed using conventional (relational)database systems to store and query XML documents, which is why tradi-tional databases are adding XML support to their systems.Unlike the relational data model, XML data have a rich, ordered andsemi-structured nature. Therefore the use of traditional (object) relationaldatabases may come at the expense of performance. Typically, native XMLdatabases will be more e�cient and can provide higher performance since theystore the XML data persistently in its native format avoiding the transforma-tion cost. Consequently, there may be a need for improved or novel techniquesfor designing e�cient XML data management.

1.2. Contributions 3

There are several challenges and aspects to consider when designing an e�-cient XML data management system.

• An important aspect of XML data management systems concerns pro-viding persistent storage structures for XML databases, which representand maintain the structure of XML trees and at the same time solvethe problem of data redundancies and main memory limitation.

• A second important aspect of XML data management systems, in whichongoing research plays a central role, is query processing.

• Another issue in XML data management, which must be addressed, isXML updates.

Since XML documents can exist without an associated schema description, itis also useful to explore techniques for storing and querying such schema-lessdocuments. The focus of this thesis is to address such issues and challengesand to present robust solutions for storing, querying and updating these XMLdocuments.

1.2 Contributions

This thesis presents novel approaches for managing XML data, which aimat having su�cient capabilities to provide high performance while minimiz-ing the storage requirements of XML data. Our work makes the followingcontributions:

• The �rst contribution of the thesis is concerned with providing an e�ec-tive algorithm for the compaction of the structure of XML documentsbased on exploiting repetitive consecutive tags in the XML structureby using a labeling scheme, which has a small storage requirement, andmaintains the relationships among XML tags after compaction. We alsopresent a robust storage structure that stores the compacted XML struc-ture and the data separately to provide the guarantee that data valueswill only be accessed on demand. It includes a set of access supportstructures to guarantee fast query performance and it also processesqueries directly over the compacted structure without de-compaction.

• The second contribution of this thesis is the introduction of a new hi-erarchical labeling scheme called cluster labeling scheme that is derivedfrom the ORDPATH labeling scheme, retaining all its desirable features.In the cluster labeling scheme, the sibling element nodes are clustered

1.3. Thesis Outline 4

and labeled and then compacted by our compaction algorithm. In ad-dition a complete and robust storage structure for compacted XML isdeveloped, in which the structural information of an XML is storedseparately from the contents to improve query processing performanceby avoiding scans of irrelevant data values. Using this storage struc-ture, it is possible to support e�cient query and update processing oncompacted XML documents and to reduce storage space dramatically.

• Another important contribution of this thesis is aimed at combining thestrengths of labeling and compaction technologies and bridging the gapsbetween them to get the most bene�ts and to avoid the drawbacks ofthese technologies to produce a performance, which is better when thesetechnologies are used independently. Here, a new approach is proposedfor XML compacting based on the exploitation of the similarity of con-secutive tags and subtrees in the structure of the XML documents. Italso uses the ORDPATH labeling scheme for gathering su�cient struc-tural information from the compacted XML document. It then storesthe compacted XML in a way that allows fast access and supports bothupdate and query processing e�ciently by using the labeling technique.

• We have conducted an experimental study that compares the perfor-mance of the proposed approaches against that of current state-of-the-art solutions.

1.3 Thesis Outline

The rest of the thesis is structured as follows:

• Chapter 2, XML and related technologies: contains background infor-mation on XML and its underlying technologies. It presents the XMLdata model and some important languages to query XML data and todescribe the structure of XML data. It brie�y discusses the di�erentkinds of techniques to store XML documents in databases.

• Chapter 3, Related Work: gives an overview of previous research thatis related to the topic of this thesis. It describes two important strate-gies, namely the labeling scheme and compaction techniques, which areproposed for improving XML performance, and explains the advantagesand disadvantages of each approach.

• Chapter 4, SCQX: describes an e�cient algorithm for compacting thestructure of XML documents, which reduces the amount of space oc-

1.3. Thesis Outline 5

cupied by the structure of XML documents and retains the ability toexecute traversals and queries over the structure.

• Chapter 5, CXQU: presents a new labeling scheme, which enables ef-�cient querying and updating and the compact storage of XML docu-ments.

• Chapter 6, CXDLS: provides an ideal solution for e�cient compactionof XML documents, especially for those with a regular structure, usinga dynamic labeling scheme. This combination of XML compaction anddynamic labeling scheme can achieve signi�cant reduction in storagespace, and at the same time can enable a high performance for bothquery and update processing.

• Chapter 7, Experimental Evaluation: covers an experimental evaluationof the proposed approaches. It includes performance studies on the stor-age requirements and the execution times needed to process queries andexamines the performance of the updates of the approaches proposed inthis thesis in comparison with other existing approaches in this �eld.

• Chapter 8, Conclusions and Future Work: concludes this thesis andprovides an outlook on possible future work.

Chapter 2

XML and Related Technologies

Contents2.1 XML: Extensible Markup Language . . . . . . . . . . 7

2.2 XML Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Constraints on XML Documents . . . . . . . . . . . . 10

2.3.1 Well-formed XML . . . . . . . . . . . . . . . . . . . . 10

2.3.2 Document Type De�nitions (DTD) . . . . . . . . . . . 11

2.3.3 XML Schema (XSD) . . . . . . . . . . . . . . . . . . . 12

2.4 XML Parsing . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4.1 Document object model (DOM) . . . . . . . . . . . . . 13

2.4.2 The simple API for XML (SAX) . . . . . . . . . . . . 13

2.5 XML Query Languages . . . . . . . . . . . . . . . . . . 14

2.5.1 XML Path Language . . . . . . . . . . . . . . . . . . . 14

2.5.2 The XQuery Language . . . . . . . . . . . . . . . . . . 18

2.6 XML and Databases . . . . . . . . . . . . . . . . . . . . 19

2.6.1 XML-Enabled Databases . . . . . . . . . . . . . . . . 19

2.6.2 Native XML Databases . . . . . . . . . . . . . . . . . 21

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 23

This chapter will give a brief introduction into XML and some backgroundinformation on its underlying technologies used in the following chapters ofthis thesis. First, in Section 2.1, we begin with a description of the ExtensibleMarkup Language (XML). Afterwards, Section 2.2 introduces the XML treemodel. In Section 2.3, some constraints on XML Documents are discussed.Afterwards, in Section 2.4, we present an overview of XML parsing. Next,in Section 2.5, we describe the two most important query languages for XMLdata. Finally, in Section 2.6, we discuss the techniques to store XML docu-ments showing the convergence of XML and databases and how they in�uenceeach other.

2.1. XML: Extensible Markup Language 7

2.1 XML: Extensible Markup Language

Since Extensible Markup Language [Bray 2008], abbreviated as XML, becamean o�cial World Wide Web Consortium recommendation in 1998, it hasplayed an increasingly important role in the exchange of a wide variety ofdata on the Internet. XML is now a global standard for describing structureddata. It is designed from a simpli�ed subset from the Standard GeneralizedMarkup Language (SGML), which was standardized by ISO in 1986 andallows the de�nition of di�erent markup languages. Another famous markuplanguage is Hyper Text Markup Language (HTML) [Raggett 1999], which iscommonly used in presenting Web Pages. HTML is also a subset of SGML.The markup languages' aim is to enrich textual data with metadata. Themarkup refers to the method of conveying metadata. However, SGML andHTML have a number of disadvantages. SGML is a rich, very powerfulMarkup Language and thus it is very complex to understand and use,especially on the World Wide Web (WWW). Although HTML has been asuccess in Web page presentation, its extensibility is limited. It is used morein describing the appearance of the data rather than in describing the dataitself. HTML provides a �xed vocabulary of markup elements for describinghyperlinked Web documents and it mixes the content with the appearance.The wide variety of the Internet calls for applications with a simple, extensibleand �exible markup language to exchange data on the Web. Towards thatgoal, the World Wide Web Consortium (W3C) has developed the eXtensibleMarkup Language (XML) to bridge the gap between SGML and HTML.XML was meant to be SGML on the Web. Due to a lot of interestingproperties of XML, it has gained enormous momentum and widespreadusage. Some of these properties are:

• XML is more �exible than HTML and it is extensible because it doesnot have a �xed format, unlike HTML.

• It also has a strict separation of content and the appearance.

• XML not only retains most of the advantages of SGML, but also is easierto learn, use, and implement than full SGML.

• It allows users to invent and use their own tags to better describe thedata. Thus users can de�ne new tags as needed, and the tags can benested to an arbitrary depth.

• XML documents can be written and edited easily using simple text toolsor standard commercial software on any platform.

2.1. XML: Extensible Markup Language 8

• Since XML information is encoded in text, it is clearly readable byhumans and machines.

• Since XML is based on Unicode, which can contain characters frompractically all known written languages: XML is considered multilin-gual. Thus, XML is poised to play an important role in the exchange ofa wide variety of data on the Internet

XML is simply a format, which includes both structure and content types, toencode data. The two basic units to structure documents are: elements andattributes. But apart from elements and attributes, other types of nodes canoccur in the XML document:

• Elements: Elements are the basic part of an XML Document and theyrepresent the logical components of a document. Each element carriesname called a tag and consists of a start tag and an end tag. A start tagstarts with the < character and ends with the > character. An end tagstarts with </ and ends with >. An element may be empty, or it mayencapsulate textual data, called the content of the element, between itsstart and end tags. An element can also contain other elements, knownas the sub-elements of that element. The element that contains all otherelements in in the document is called the root element. In the sampleXML document shown in Figure 2.1, the root element of the documentis the bibliography element and the article element is a sub-element ofthe bibliography element. In addition, elements may contain one ormore attributes.

• Attributes: Attributes are normally used to provide additional infor-mation about elements; each attribute has a name and a value sepa-rated by the `=' character. Attributes appear within the start tag ofan encapsulating parent element. Attributes cannot contain elements orsub-attributes. In addition, each attribute may be speci�ed only once,and the order of attributes is not important.

• Comments: They are used to add information for human readers to thedocument. Users may insert comments in an XML document, bound by<!� and �> sequences. Comments may occur anywhere in a documentas long as they are outside other markups and not immediately prior tothe XML header.

• Processing Instruction: It provides information to be used by soft-ware appliations. Processing instructions are delimited by <? and ?>.

2.2. XML Tree 9

• XML Declaration: It is a processing instruction that appears onceat most and only at the beginning of an XML document. It is used toidentify the document.

• Document Type Declaration: The <!DOCTYPE> is used to spec-ify DTD (Document Type De�nition) for an XML document. This isexplaioned in the next section.

Further information on XML speci�cation can be found in [Bray 2008].

Figure 2.1: A Sample XML Document (bibliography.xml)

2.2 XML Tree

An XML document is usually modeled as an ordered, labeled tree of nodes,which is the most natural representation for an XML document. In a nodelabeled representation, the tree consists of nodes and edges. Each node rep-resents an element, an attribute or atomic content, which is always of the�string� type. The edges represent relationships between elements or betweenelements and contents. We have to distinguish leaf nodes from inner nodes.The leaf nodes always contain atomic data while the inner nodes hold the tags

2.3. Constraints on XML Documents 10

and attributes names. The tree always starts at a single node representing theroot element and it develops from the root into child elements. Each elementis contained inside a parent element (except the root element). The tree forthe Sample XML document is shown in Figure 2.2.

Figure 2.2: The tree structure of the Sample XML document

2.3 Constraints on XML Documents

As previously mentioned, XML is extensible, because XML can represent anykind of information using elements and attributes as needed. In order tomaintain consistency over XML, the XML speci�cation de�nes a syntacticalconstraint called Well-formedness constraint that apply to all XML documentsand include the basic requirements. In addition to the restrictions mandatedby the XML standard, it is possible to de�ne a general set of rules for a docu-ment's elements and attributes. This is done using XML Schema DescriptionLanguages such as the document type de�nition (DTD) [Bosak 1998b] or theXML schema (XSD) [Fallside 2004].

2.3.1 Well-formed XML

A document is said to be well-formed, if its structure follows the rules set:


• The names of attributes have to be unique per attribute list.

• The position of attributes has to be within element start tags.

• The document contains exactly one root element.

• The element names do not contain spaces.

• The open tags and the close tags must be balanced and properly nested.This property ensures that XML elements can be considered as trees.

2.3.2 Document Type De�nitions (DTD)

The Document Type De�nition (DTD) is one of the constraint de�nitionlanguages that help to restrict the set of permissible document structures.A DTD consists of element declarations, attribute list declarations, whichare essential for describing an XML document, and other constructs such asconditional sections, entity declarations and notation declarations.

De�nition 2.1 (Element Declaration)

An element type declaration describes the name and content of an element.

It has the following form: <!ELEMENT element-name content-model>

where element-name is an element name, content-model describes the con-

tent model, which is the format and the syntactical restrictions satis�ed

by the contents of the elements with the same element-name.

De�nition 2.2 (Attribute List Declaration)

An attribute list declaration speci�es the attributes associated with a par-

ticular element. It has the form <!ATTLIST element-name attribute-

de�nitions>, where element-name is an element name and attribute-

de�nitions is a list of attribute de�nitions, each of which has the form

attribute-name attribute-type default-declaration.

Further information about DTD can be found in [Bosak 1998b].A DTD can either be de�ned directly in the XML document or it can bereferenced indirectly by a document type declaration at the start of an XMLdocument. An XML document is valid, if its structure conforms to the rulesset by its associated DTD. Note that it is possible for an XML document tobe well-formed yet invalid, but a valid document must additionally be well-formed. Figure 2.3 shows a DTD of the Sample XML document in Figure 2.1.Although DTD has been widely used, it has some limitations in managingtype information. For example, DTD can only specify that elements are text


strings. Furthermore, it provides only very limited support for XML names-paces. Due to these limitations, a DTD does not provide su�cient more-advanced ways of specifying constraints for the XML document structure.

Figure 2.3: A DTD of the Sample XML Document(bibliography.dtd)

2.3.3 XML Schema (XSD)

XML Schema [Fallside 2004] is a de�nition language for describing the struc-ture and constraining the contents of XML documents. It was proposed toovercome some of the de�ciencies of DTDs and it is being developed and stan-dardized by the World Wide Web Consortium. XML Schema is a much morecomprehensive schema description language for XML documents, expressedwith the syntax of XML document. XML Schema Supports namespaces andprovides a rich set of data types and rules to declare elements and attributes.XSD also allows the creation of user-de�ned data types. The basic struc-ture of an XML schema is as follows: An XML schema is always stored in aseparate �le with the �le extension *.xsd, abbreviated for XML schema de�-nition language. Using an XML schema, it is possible to create a valid XMLdocument, known as an instance document. It is important to note that thetechniques described in this thesis do not require DTDs or XML schemas. Forthis reason, we have included only the general principles of these XML schemalanguages, and have not gone into much detail in this area.

2.4. XML Parsing 13

Figure 2.4: The basic structure of an XML schema

2.4 XML Parsing

In order for XML documents to be used by application programs, they haveto be parsed. To this end, standardized interfaces between programminglanguages and XML have been developed. The two most important modelsfor XML parsing are the Document Object Model (DOM) [W3C 1998] andthe Simple API for XML (SAX) [Saxproject ].

2.4.1 Document object model (DOM)

The document object model (DOM) is a speci�cation by the World Wide WebConsortium (W3C). DOM represents the XML document as an ordered tree ofNodes, which can be accessed and manipulated. These nodes can be elements,attributes, comments, text nodes, etc. The main aim of the DOM standardis to provide a programmatic representation of XML documents, which canbe used to access documents from di�erent programming languages. DOMparsers typically load an entire XML document into main memory for fastaccess. The downside of DOM is that it is slow and requires huge amounts ofmemory. Therefore DOM as an XML parser is not suitable for working withvery large XML documents.

2.4.2 The simple API for XML (SAX)

The simple API for XML (SAX) provides an event-based interface for parsingXML documents. It reads the XML document sequentially and whenever acertain piece of XML text has been recognized, such as start or end tags,it sends events to the application which can then interpret them and takecorresponding actions. SAX is fast and requires very little memory becausethere is no need to keep the whole XML document in memory: it can thereforebe used for large XML documents. The disadvantage of SAX is that it does notprovide random access to XML because it does not allow backward navigation.

2.5. XML Query Languages 14

2.5 XML Query Languages

SAX and DOM APIs, described in the previous section, provide a universalway of accessing information in XML documents by di�erent programminglanguages. However these APIs are basically based on a strictly algorithmicalmethod to extract information from documents, which is not very convenient.Another e�cient way to navigate through documents is by using the querylanguages, which are usually declarative rather than procedural. A tree is themost widely accepted data model of XML, on which XML query languages areevaluated. The tree has a single document root and elements as the interme-diate nodes. An expression is used to locate nodes in XML, in which the nameor value of the nodes and the path to reach the nodes are usually speci�ed bythe query languages of XML. The query processor traverses the data modeltree to get to the desired nodes. Many query languages for querying, updating,transforming, and integrating XML data have been proposed. First of all, thequery languages speci�ed by the World Wide Web Consortium (W3C) mustbe mentioned: XPath [Clark 1999] and XQuery [Boag 1999]. Since XPath isan essential part of XQuery and in this thesis we consider only XPath, we donot pay too much attention to the details of XQuery. Therefore, we will �rstgive an overview of the XPath language and we will only shortly summarizeXQuery.

2.5.1 XML Path Language

The XML Path Language (XPath) was introduced in 1999 by the W3C asthe standard query language for de�ning how a speci�c part within an XMLdocument can be located. XPath models an XML document as a tree ofnodes, which includes element nodes, attribute nodes and text nodes. Thusit enables navigation through XML trees and the return of a set of matchingnodes.

2.5.1.1 XPath Expression

An Expression is the primary syntax construct in XPath. An expression isevaluated to one of the four basic types:

• A node sequence (an unordered collection of nodes without duplicates)

• A boolean value (�true� or �false�)

• A number (a �oating-point number)

• A string (a sequence of characters)


XPath expressions consist of one or more steps that are evaluated from left toright. Each step generates a sequence of nodes, which are input for the sub-sequent step. A step consists of an axis, de�ning the direction of movement,and a node test, selecting nodes based on their kind, name and(or) type. Op-tionally it is followed by predicates that �lter the node sequence to which thestep has evaluated and only retain the nodes that ful�ll the predicates.

2.5.1.2 Location Path

The most important kind of expression is the location path. Every locationpath has an initial starting point, which is called the context node. Locationpaths are applied to context nodes and produce a node sequence as a result.We have to distinguish between two location paths: absolute location pathsand relative location paths. An absolute location path starts at the rootelement of an XML document. It consists of a slash mark (/) optionallyfollowed by a relative location path. A relative location path can start atan arbitrary context node. It consists of a sequence of one or more locationsteps separated by a slash mark (/). The steps in a relative location path arecomposed together from left to right. Each step selects a set of nodes relativeto a context node.Example 2.1 (Absolute Location Path)

/bib/books/book/title

Example 2.2 (Relative Location Path)

book/title

2.5.1.3 Location Step

In general, a location step has three parts: the axis, the node test, and anoptional predicate.

• Axis: the axis speci�es the tree relationship between the nodes selectedby the location step and the context node.

• Node test: A node test speci�es the type of the nodes selected by thelocation step.

• Predicates (zero or more): Predicates use arbitrary expressions to �lterthe set of nodes selected by the location step.

The syntax for a location step is the axis name and node test separated bytwo colons (::), followed by zero or more expressions, each in square brackets.


2.5.1.4 Axes

An axis de�nes a node sequence relative to the current node. There aredi�erent axes in the XPath speci�cation. An extremely brief description ispresented below. The idea of our illustration, shown in Figure 2.5, is takenfrom [Esbudellat ], where � denotes the current context node.

• Child: The child-axis contains the children of the context node. At-tribute nodes are not included.

• Parent: The parent-axis contains the parent of the context node.

• Self: The self-axis contains the context node itself.

• Descendant: The descendant-axis contains the descendants of the con-text node. A descendant is a child or a child of a child and so on.

• Ancestor: The ancestor-axis contains the ancestors of the context node;the ancestors of the context node consist of the parent of context nodeand the parent's parent and so on; thus, the ancestor-axis will alwaysinclude the root node, unless the context node is the root node.

• Following-sibling: The following-sibling-axis contains all nodes that fol-low the context node in the same parent element. If the context node isan attribute or namespace node, the following-sibling-axis is empty.

• Preceding-sibling: Similar to the following-sibling-axis, the preceding-sibling-axis contains all nodes that appear before the context node in thesame parent element. If the context node is an attribute or namespacenode, the preceding sibling axis is empty

• Following: The following-axis contains all nodes that appear after thecontext node in document order, excluding any descendants and exclud-ing attribute nodes and namespace nodes.

• Preceding: The preceding-axis contains all nodes that appear beforethe context node in document order, excluding any descendants andexcluding attribute nodes and namespace nodes.

In addition to these axes there are some more that consist of the union of twoaxes like descendant-or-self or ancestor-or-self.


Figure 2.5: The XPath Axes

2.5.1.5 Node Test

The node test is a simple test on a node to determine what kinds of nodes areselected along a given axis. The node test is applied to each node in the axis.If the test succeeds the node is kept, if not; the node is disregarded. Thereare 7 types of nodes:

• root nodes

• element nodes

• text nodes

• attribute nodes

• namespace nodes

• processing instruction nodes

• comment nodes


2.5.1.6 Predicates

The predicate is a further test to retain or eliminate nodes. It �lters the nodesequence into a new node sequence. Each node is evaluated in turn and if apredicate is true for a given node, it is kept in the new node sequence, if falseit is removed.

2.5.2 The XQuery Language

XQuery [Boag 1999] is the XML query language proposal by W3C, which was�rst released in 2001 and the latest draft was released in 2007. It providesa mechanism to extract and manipulate XML data sources, including bothdatabases and documents. The language is based on Quilt [Chamberlin 2000],and borrows many features from other query languages. XQuery language isbased on the data model used by XPath, in which an XML document istreated as a tree of nodes. The type system of the XQuery models all valuesas sequences consisting items which can either be nodes or atomic values.XQuery uses XPath expression syntax to address speci�c parts of an XMLdocument and it supports rich functionalities such as joins, aggregations, andelement construction.

2.5.2.1 FLWOR Expression

A frequently used expression is the FLWOR Expression that binds values toone or more variables and then uses these variables to construct the result. AFLWOR expression is constructed from FOR, LET, WHERE, ORDER BY,and RETURN clauses.

• For and Let Clauses: The for and the let clauses bind variables to se-quences of XML nodes or atomic values.

• Where Clause: The items of the sequence are �ltered by the conditionsexpressed in the where clause that is an optional clause.

• Order By and Return Clauses: The remaining items can be ordered withthe optional order by clause and returned with the return clause as �nalresult.

Example 2.3 (FLWOR)

Find all books with a price less than 100 and return the titles in alphabetic

order.

for $x in doc(�bib.xml�)/books/book

let $y = $x/price

2.6. XML and Databases 19

where $y<100

order by $x/title

return $x/title

2.6 XML and Databases

XML has emerged as the predominant mechanism for data storage and ex-change, in particular over the World Web. Due to the �exibility and the easyof use of XML, it is nowadays widely used in a vast number of applicationareas and new information is increasingly being encoded as XML documentsand much data in existing databases are transformed into XML documents.In particular, XML and databases are converging and in�uencing each otherin many signi�cant ways. Because of the widespread use of XML and the largeamounts of data that are represented in XML, it is therefore important to pro-vide a repository for XML documents, which supports e�cient managementand storage of XML data. This need led the major commercial database sys-tem vendors (i.e., Microsoft, Oracle, and IBM) as well as many open sourceprojects (e.g., MonetDB, MySQL) to add XML support to their systems.Meanwhile, there are emerging native XML database management systems tostore and manage XML documents. In the following, we go through some ofdatabase approaches for storing XML data.

2.6.1 XML-Enabled Databases

Since traditional database management systems such as relational databases(RDBMS) or object-relational databases (ORDBMS) have existed for a longtime their technology is well tested and well developed. This maturity andwidespread deployment of (object) relational database technologies suggests aneed for storing XML documents in (object) relational databases. Thereforethe problem of storing and querying XML data in this type of database man-agement system has been widely studied. These database management sys-tems were extended and enriched with XML functionalities to support storingand querying of XML data and are therefore called XML-enabled databases.In an XML-enabled database, the storage model is not XML but rather re-lational or object-oriented. The two mainly provide two choices for storingand managing XML data. The choices are either to decompose XML datato relational instances or to store XML data as a Large Object (LOB) datatypes as (BLOB) Binary and (CLOB) Character large objects.For the �rst choice, the XML data is mapped onto a relational data model by


di�erent approaches, which proposed di�erent algorithms for shredding XMLdata so as to be able to store them in relational tables. To evaluate XMLqueries, a popular approach is to translate them to relational queries andthen to use a relational database system to evaluate the result. The results ofthe relational queries are then converted to XML documents before returningthe answer to the user.For the second choice, such as in Oracle XML DB (OXD) and IBM DB2 XMLExtender (DB2), the entire XML document is not shredded to relations butstored in its native form using the large object data types. In this case, theevaluation of XML queries is similar to XML query processing in a nativeXML database (see next section) and it does not involve XML to relationalquery translation. It is possible to apply XML query languages like XPath.Overall, the bene�t of an XML-enabled database is that it reuses and exploitsexisting technologies in an RDBMS or in an ORDBMS. Another advantageof this kind of database is that the same database can be used to manageand query both XML data and the already existing data. However, the XML-enabled databases have some disadvantages as they were designed for the(object) relational model and not for the tree-based data model of XML. Forexample, in the �rst choice, the data transformation process can be costly andsome important information may be lost. Another disadvantage is that thereconstruction of XML documents by composing fragmented nodes betweenmultiple tables can impair performance. In addition, the query time can belengthy due to the many operations, which are expensive to evaluate. Thesecond choice also has a few disadvantages, such as: The contents of XMLdocuments are hard to index and retrieve because the whole XML documentis stored together in a Large Object column. The other disadvantage of thisapproach is that even if only a small part of a document is updated, the wholedocument has to be updated.

2.6.1.1 MonetDB/XQuery

One of the successful projects, which could reuse mature relational datamanagement infrastructures, such as relational query processing operatorsand optimization techniques, to create fast and scalable XML databasetechnology, is the Path�nder project that is a cooperation between theUniversity of Konstanz (later TU munich and University of Tuebingen),Germany, the University of Twente, and CWI Amsterdam, The Nether-lands. This project has been developed on an open source system calledMonetDB/XQuery [Boncz 2006a] and consists of the Path�nder XQuery-to-Relational Algebra compiler [Boncz 2006b, Boncz 2005, Rittinger 2005,Rittinger 2007, Teubner 2007] on top of the MonetDB Relational Database


Management System [Zukowski 2005]. In MonetDB/XQuery, the sourceXML documents are encoded using the XPath Accelerator relational map-ping scheme [Grust 2002] (see next chapter) and Path�nder uses the loop-lifting technique (Staircase Join) [Grust 2003b, Grust 2003a] to compile theuser input into relational algebra and optimizes it for execution on MonetDB.

2.6.2 Native XML Databases

Because of the limitations associated with the use of an XML-enabled databaseand due to the growing popularity of data that is already formatted in XML,there is great demand to develop a special type of database that can providea more natural data model and query language for XML data and which istypically viewed using a tree representation. This new type of database isusually referred to as a native XML database management system(XDBMS).In these alternative systems, the XML data is persistently stored in its nativeformat avoiding the transformation cost. Also, the XML data can be queriedusing an XML query language such as XQuery or XPath. The promise of anative XML database is to provide better performance, as it is speci�callydesigned to store, query and manipulate XML data. Today, there are alreadysome commercial and open source implementations and also there is a numberof research prototypes that can natively handle XML data such as:

2.6.2.1 Natix

Natix is a native XML database management system [Fiebig 2002,Brantner 2005] that is implemented at the University of Mannheim, Germany.It includes a storage manager suitable for XML data. The XML tree is splitinto small subtrees. This splitting makes it possible to �t each of them intoone disk page. The Natix architecture consists of three main components:A storage layer that manages all persistent data storage; a service layer pro-vides DBMS functionality; a binding layer that consists of modules that mapdata and requests from other APIs to the Natix engine interface. Natix alsosupports the XPath and XQuery languages.

2.6.2.2 Timber

Timber is a scienti�c open source native XML database [Jagadish 2002,Wu 2008], developed and implemented at the University of Michigan. InTimber, an XML document is mapped to a tree, where each node is eitheran element, an attribute, or a text node. Internally, each node is given anode identi�er that consists of three numbers: start label, end label and levellabel. Timber supports XML queries in XQuery, which are parsed into an


algebraic operator tree using TAX, a tree algebra for XML [Jagadish 2001],which de�nes a suite of operators suited to manipulating trees.

2.6.2.3 BaseX

A another native XML database is BaseX [Grün 2006, Grün 2007,Holupirek 2009, Holupirek 2009], which has been developed at the Uni-versity of Konstanz. It uses a storage format in�uenced by and derived fromthe XPath accelerator relational mapping scheme [Grust 2002]. BaseX sup-ports XPath and XQuery and features a graphical user interface, facilitatingvisual access to the XML data and �lesystem data stored in the database.

However, such a system is relatively new and therefore many compo-nents of a native XML database must still be developed. For example,Timber and BaseX use variants of static labeling scheme that is ine�cientfor updating XML. While Natix is good at navigation and update but is badat complex twig matching. In addition, most of the XML native databasesdo not handle compression technique that is necessary for e�cient storage ofsuch data, because XML documents are highly redundant.

Figure 2.6: The Structure of XML databases

2.7. Summary 23

2.7 Summary

This chapter provided a background on the Extensible Markup Language(XML) and a brief overview of the more useful and relevant techniques thatare related to XML and its use, including XML data modeling, XML pars-ing and XML query languages. I also presented the conditional constraintson XML documents such as well-formedness and validity constraints. As hasbeen explained above, XML has become a standard means for data represen-tation, storage, and interchange. XML is used by many applications now andthe amount of data available in XML is rapidly increasing. It is therefore nec-essary to e�ciently manage a large volume of XML documents using databasetechnology. The use of database technology to manage XML data raises manyimportant issues that lead to e�cient handling of XML. These issues cover awide �eld, however in my thesis I examine the following:

• XML storage including compression, pooling and partitioning

• XML query processing

• XML updates

Chapter 3

Related Work

Contents3.1 XML Labeling Schemes . . . . . . . . . . . . . . . . . . 24

3.1.1 Static Labeling Schemes . . . . . . . . . . . . . . . . . 25

3.1.2 Pre�x Labeling Schemes . . . . . . . . . . . . . . . . . 27

3.2 XML Compression . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 XMill compression . . . . . . . . . . . . . . . . . . . . 29

3.2.2 XML skeleton compression . . . . . . . . . . . . . . . 29

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 31

XML is becoming widely used for data exchange and manipulation. As aconsequence, an increasing number of XML documents need to be stored andprocessed. There have been many proposals to build native XML stores andquery processors. However, two common strategies are available to providerobust storage and e�cient query processing. In this chapter, we present thesestrategies and provide an overview of previous work that is related to the topicof this thesis.

3.1 XML Labeling Schemes

Since the logical structure of an XML document is an ordered tree consistingof tree nodes that represent elements, attributes and text data, establishinga relationship between nodes such as parent-child relationship or ancestor-descendant relationship is essential for processing the structural part of thequeries. Therefore, tree navigation is essential to answer XML queries. How-ever standard tree navigation (such as depth-�rst or breadth-�rst traversals) isnot su�cient for e�cient evaluation of XML queries, especially the evaluationof ancestor and descendant axes. For this purpose, many proposals have beenmade, the most common ones are node labeling schemes. The use of labelingschemes to encode XML nodes is a common and most bene�cial technique toaccelerate the processing of XML queries and in general to facilitate XML pro-cessing when XML data is stored in databases. The main idea of the labeling

3.1. XML Labeling Schemes 25

schemes is to gather structural information from XML documents by assign-ing unique codes to the nodes in the tree and storing them in such a way thatpreserves the hierarchal structure of XML documents. In many cases, there isno need to access the actual documents. The labeling schemes allow a quickidenti�cation of structural relationships between nodes by performing a com-parison of their labels in constant time. This identi�cation plays a crucial rolein e�cient XML processing. Numerous labeling schemes have been proposedrecently to represent the XML tree structure and to support querying andupdating XML documents. In general, these schemes may be classi�ed intostatic labeling schemes [Li 2001, Grust 2002, Agrawal 1989, Zhang 2001]thatuse a �xed-length encoding and pre�x labeling schemes [Abiteboul 2001,Cohen 2002, Tatarinov 2002, Böhme 2004, Härder 2007], also called dynamiclabeling schemes because they use a variable-length encoding.

Figure 3.1: The preorder ranks pre (left numbers) and the postorder rankspost (right numbers) for tree of simple XML example.

3.1.1 Static Labeling Schemes

In static labeling schemes �xed-length codes are used to represent the XMLtree. At �rst these labeling schemes are mainly designed to label XML doc-uments which are static. Although this type of labeling scheme is e�ectivein generating concise labels, for dynamically changing XML documents, theyare lacking in-document-updating capabilities. If a new node has to be in-serted, relabeling of many nodes is inevitable. The most common work in


static labeling schemes family is the pre/post labeling scheme.

Figure 3.2: The pre/post plane illustrates XPath axis conditions for the fourmajor XPath axes ancestor, descendant, following, and preceding as seen fromnode f.

3.1.1.1 The Pre/Post Labeling Scheme

The pre/post labeling scheme, also known as the XPath accelerator encoding,is described by Grust [Grust 2002]: each node in the XML tree is labeledby a pair of unique integer values consisting of pre- and post-order traversalsequence numbers. The encoding can be generated by counting the numberof opening tags in the document as pre values and the corresponding closingtags as post values during a single sequential scan over the document, e.g.,using a SAX parser. Figure 3.1 illustrates this labeling for a small sample tree.According to this labeling, the nodes can be plotted into a two-dimensionalplane using the pre and post values, as shown in Figure 3.2 (The idea of thisillustration is taken from [Teubner 2006]), where the context node divides thepre/post plane into four disjoint regions representing the evaluation of themajor XPath axes for the context node as follows:

• All nodes in the top-left quadrant of the plane are the ancestors of thecontext node.

• All nodes in the top-right quadrant of the plane are the following nodesof the context node.


• All nodes in the bottom-left quadrant of the plane are the precedingnodes of the context node.

• All nodes in the bottom-right quadrant of the plane are the descendantnodes of the context node.

Although the pre/post labeling scheme supports XML query processing e�-ciently and it is e�ective in generating concise labels, it cannot handle updatese�ciently. Insertion or deletion of new nodes would require relabeling existingnodes.

3.1.2 Pre�x Labeling Schemes

In pre�x labeling schemes, also known as path-based labeling schemes, eachnode is labeled by a path label, which is the concatenation of all the labels ofnodes appearing on its incoming path from the root node. A path label can bereplaced simply with tag IDs or other labels. The label of a parent node is thepre�x of the labels of all of its descendants. Thus, checking for an ancestor-descendant relationship between two nodes is equivalent to determining, if thelabel of the �rst node is a pre�x of the label of the other node. In schemes ofthis type, the size of labels is variable and depends on the tree depth. Suchlabels can easily be obtained using for instance a traversal in document orderand they can be computed in time linear in the number of nodes in the tree.There are many proposals of this type. The simplest and most common oneof this type is the Dewey labeling scheme proposed in [Tatarinov 2002].

Figure 3.3: A Dewey Encoding (Example)

3.2. XML Compression 28

3.1.2.1 The Dewey labeling scheme

In the Dewey labeling scheme, which is based on Dewey Decimal Classi�ca-tion (DDC) and developed for general knowledge classi�cation, the nodes ofthe XML tree are given Dewey labels that represent the paths from the rootto each node in the XML tree. The Dewey label of a node is a combinationof its parent Dewey label and an integer number that re�ects the position ofthis node among its siblings. Hence each Dewey label consists of divisionsof integers separated by dots with the exception of the Dewey label of theroot node, which consists of only a single division because it has no par-ent. Figure 3.3 shows an XML tree labeled using the Dewey labeling scheme.Like pre�x labeling schemes, Dewey has several features such as: uniquelyidentifying the nodes in the XML tree and preserving the structural relation-ships between these nodes. By using labels of the parent nodes as a partof creating labels for child nodes, the containment relationship (parent-childrelationships, ancestor-descendant relationships and the sibling relationship)between two nodes can be conveniently and simply determined by comparingthe Dewey labels of these nodes. Also the level of a node in the tree can bedetermined by the Dewey label alone. However, this labeling scheme has somedrawbacks in terms of the extra space required to store paths from the rootto each node. Dewey labeling has also su�ered from the problem of dynamicupdating. Upon insertion or deletion of nodes from the XML tree labeled byDewey, many nodes have to be subsequently relabeled. New labeling schemeshave shown up as a consequence of the appearance of dynamic XML docu-ments. One of these is ORDPATH, which is an enhanced version of the Deweylabeling scheme and preserves structural �delity by reserving even numbersfor future insertion. Hence it allows insertion of nodes anywhere in the XMLtree without the need for the subsequent relabeling of existing nodes. It alsouses a pre�x-free binary encoding to circumvent the problem of dots and thushelps to reduce the size of the labels. In Chapter 6, we shall give a detailedoverview of OrdPath and its pre�x-free binary encoding.

3.2 XML Compression

The power of XML comes from the fact that it provides self-describing capa-bilities. XML repeatedly uses tags to describe the data itself. At the sametime this self-describing nature of XML makes it verbose with the result thatthe storage requirements of XML are often expanded and can be excessive.In addition, the increased size leads to increased costs for data manipulation.The inherent verbosity of XML causes doubts about its e�ciency as a stan-dard data format for data exchange over the internet. Therefore compression


of XML documents has become an increasingly important research issue andit also seems natural to use compression techniques to increase the e�ciencyof storing and querying XML data. Using general purpose compression toolsis one of the �rst concepts behind XML compression. These tools can reducethe size of XML documents. However, since the use of XML is not only fortransmission or the storing of data but also for managing this data, we needto devise a specialized compression method that takes advantage of the XMLdata structure and compresses XML for the purpose of exchange and man-agement of data. There are many approaches aimed at the compression ofXML. In this section, we describe two of the most signi�cant developments inthis �eld. XMill compressor [Liefke 2000] is one of the �rst specialized XMLcompressors. XML Skeleton compression [Buneman 2003, Buneman 2005] isan e�cient strategy to evaluate path queries on compressed XML documents.

3.2.1 XMill compression

The XMill compressor [Liefke 2000] was designed to take advantage of theXML data structure. XMill compresses the XML structure separately fromthe data. Separating the XML structure from data is the most importantidea of XMill, and is also applied in other XML compressors. This separa-tion increases the data similarity in each of them and enables better datacompression rates to be achieved. The separation is reminiscent of earliervertical partitioning techniques for relational data, which divide a table intomultiple tables de�ned over subsets of the attributes [Batory 1979]. Thispartitioning typically lets queries scan less data and thus improves query per-formance [Ailamaki 2001]. However, the separation in XMill is used for XMLcompression only and does not allow direct querying of the compressed data.Therefore a prior decompression of the complete compressed XML is requiredto be available for use.

3.2.2 XML skeleton compression

The XML Skeleton compression was proposed in [Buneman 2003] and ex-tends the separation idea of XMill for e�ciently querying compressed data.This approach removes the redundancy of the document structure by using atechnique based on the idea of sharing common subtrees and replacing iden-tical and consecutive branches with one branch and a multiplicity annota-tion. This approach works most e�ectively, if the XML documents exhibita highly regular structure, but it does not compress well in cases where theconsecutive branches of trees do not have exactly the same substructure (e.g.semistructured data). The work proposed in [Buneman 2005] extended the


Figure 3.4: An XML tree, its skeleton and storage

XML skeleton compression technique to facilitate the processing of XQuerybased on the vectorization approach, which is an extreme form of vertical par-titioning. The idea is to partition the document into a compressed skeletonand a set of vectors. For each distinct path from the document root to atext node, a vector is created named after that path; its tuples contain datavalues. In this approach main memory data structures are used for the com-pressed skeleton, while external memory data structures hold text contents.This partitioning provides the guarantee that data vectors can be accessedonly on demand. However those techniques have some drawbacks: First, thecompressed skeleton is sometimes still too large to �t into main memory. Sec-ond, the compressed skeleton is always scanned in its entirety to identify therelevant data vectors.

3.3. Summary 31

3.3 Summary

Two common strategies are available to provide robust storage and e�cientquery processing. The �rst is based on labeling schemes that are widely usedin XML query processing. These numbers represent the relationships betweennodes and play a crucial role in e�cient query processing. However, some la-beling schemes have the problem that they either do not support updates toXML documents or need huge storage. The second strategy focuses on theproblem of decreasing the cost of XML storage through compression tech-niques. While a naive representation of XML documents leads to excessiveredundancy, the compression of XML documents not only reduces the amountof disk space occupied by the data, but it also enhances query processing speedby saving scan time. Unfortunately this kind of strategy is also a little prob-lematic as it may not be able to discover all the redundancy present in thestructure of XML, and thus often does not yield the best compression result.Another signi�cant drawback of such a method is that it does not supportdirect updates or direct querying, i.e., querying a compressed document with-out decompressing it. Therefore, in this thesis, we propose three e�ectivemethods for storing and processing XML data, which combine the strengthsof both labeling and compression technologies and bridge the gap betweenthem to exploit their bene�ts and avoid their drawbacks to produce a level ofperformance that is better than using labeling and compression independently.

Chapter 4

SCQX: Compacting, Storing and

Querying XML Documents Using

a Static Labeling Scheme

Contents4.1 The Level-Order Labeling Scheme used in SCQX . . 33

4.2 Compaction Principles of SCQX . . . . . . . . . . . . 34

4.3 The Storage Model of SCQX . . . . . . . . . . . . . . 38

4.3.1 Storage structure . . . . . . . . . . . . . . . . . . . . . 39

4.3.2 Index methods . . . . . . . . . . . . . . . . . . . . . . 40

4.3.3 Query Evaluation . . . . . . . . . . . . . . . . . . . . . 41

4.4 A Real-Life XML Example . . . . . . . . . . . . . . . . 43

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 47

XML has a tree structure; thus, to e�ciently evaluate XML queries, it isimportant to quickly determine the structural relationships among any pairof nodes. Structural relationships can be determined by associating to node alabel allowing a particular numbering scheme. Thus the labeling schemes playan important role in any database to retrieve requested data without scanningall data. On the other hand, the verbosity and redundancy problem of XMLcauses doubts about its e�ciency as a data format in exchanging and storingdata. Therefore compaction is an important issue for XML. Compaction ofXML documents not only reduces the amount of disk space occupied by thedata, but it also decreases the overall query processing time. The idea inthis chapter is to combine the advantages of both areas. Speci�cally, we takeadvantage of XML structural peculiarities for attempting to reduce storagespace requirements and to improve the e�ciency of XML query processingusing a labeling scheme called Level-Order Labeling scheme.

4.1. The Level-Order Labeling Scheme used in SCQX 33

4.1 The Level-Order Labeling Scheme used in

SCQX

As explained in the last chapter, an XML document is treated as an XMLtree, a labeled, partially ordered tree in which each node corresponds to anelement, attribute, or piece of text in the document and the labeling schemesassign a unique node ID to every node in the document tree with the mainpurpose of accelerating querying by performing operations on labels insteadof accessing XML document and providing a compact storage representationof nodes. In the Level-Order scheme, each node of the XML tree is assigned aunique integer while traversing the document tree in level-order (breadth-�rsttraversal). The nodes on level i are visited from left to right before nodes atlevel i + 1 are visited from left to right.In our encoding scheme, each element node of the XML structure is assignedtwo numbers as follows: The root node is always labeled by the integer value1 and its parent ID is 0. In a level-order (breadth-�rst) traversal of XMLdocument, an element node is visited and assigned its �rst number; the secondnumber is the ID of its parent element node. In addition, each text node 1 isassigned a Text-ID, which is the ID of its parent element node. We use thisencoding scheme because it has important properties:

• In contrast to other schemes it supports all XPath axes especially child,parent and sibling axes.

• It has a �xed size.

• It allows a faithful representation of the XML document after our com-paction method, which is described in the next section.

The tree structure presented in Figure 4.1 is labeled using the Level-Orderlabeling scheme, where the root node is labeled with an integer number 1 andits parent ID is 0, as it is the �rst node of the tree. Then we continue thecounting to label the child nodes of the root. We continue with the next levelof the XML tree and so on. The Level-Order labels of the XML structure areshown inside nodes. (The second component, parentID, is not shown in the�gure)

1 For simplicity, we do not consider the other components such as attributes, comments,

namespace, and instruction, as they can be treated like text nodes.

4.2. Compaction Principles of SCQX 34

Figure 4.1: XML document structure with level-order IDs (shown insidenodes)

4.2 Compaction Principles of SCQX

This section presents how we can e�ciently compact an XML document. Aswe know most database queries tend to focus on structural aspects with onlyoccasional access to character contents. On the other hand most XML struc-tures are highly repetitive; the same tags appear again and again. The in-herent verbosity of such XML structures has led to the need for an e�cientcompaction algorithm that may be useful in reducing the required storagespace for XML documents and also in improving the query performance. To-wards these aims, di�erent requirements should be met to ensure an e�cientalgorithm for the compaction of XML documents, which can be summarizedas follows:

• The resulting output of a compaction algorithm can be queried or pro-cessed without prior decompaction of the complete compacted XML.

• The compaction algorithm should support e�cient query evaluation foran expressive query language such as XPath or XQuery.

• The compaction algorithm should be able to compact XML documentscorrectly.

• The compaction algorithm should generally achieve a good compactionratio and time.


• The compaction algorithm should support the compaction of huge XMLdocuments.

• The compaction algorithm should be able to be used for applicationsthat rely on using a secondary storage system.

In order to meet the requirements, we have developed an XML compactionalgorithm. Its main goal is not only to reduce the required storage space,but also to improve the query performance. It is based on the following mainprinciples:

• The �rst principle, similar to XMill, which is discussed in Chapter 3,is the separation of the XML structure from its data values. Becauseseparation increases the data similarity and because there is more re-dundancy in the structure of the XML documents than in text content,separating the XML structure makes compaction of this portion tremen-dously more e�cient, and thereby incrementally improves compactionrates. Also, this separation typically lets queries scan less data and thusimproves query processing performance.

• The second principle of our approach for XML compacting is usingthe Level-Order labeling scheme, which avoids unnecessary scanningof structures and maintains the connection between structural informa-tion and value information and the relationship between nodes after thecompaction.

• The third principle of our approach for compacting the XML structureis exploiting the repetitive consecutive tags in the structure of XMLdocuments.

Based on the above principles, we want to reduce the size of the XMLstructure in such a way that the result can be used for e�cient queryprocessing. Our approach works in the following manner:The input XML document is labeled using the Level-Order labeling schemeand is scanned using, e.g., a SAX parser. The data values are stored ina special storage structure and the XML tags are compacted using ouralgorithm 4.1, which basically exploits the repetition of similar sibling nodesof XML structure, where �similar� means: elements with the same tag name.These similar nodes are replaced with one compacted node. This compactednode is assigned a label equal to the label of the �rst node of similar nodesand a cardinality counter is added to it, re�ecting the number of repetitions.The other similar sibling nodes are not stored in the storage model. Eachnode in the compacted structure represents one or a set of nodes in the


Figure 4.2: Compacted structure with unique numbers of elements and cardi-nality counters (in parentheses).

uncompacted tree. It is worth noting that our algorithm works while parsingthe tree without extra costs, and it works even if the consecutive branches oftrees do not have exactly the same substructure. We should also point outan interesting property of this technique: it does not rely on a DTD or XMLschema. Returning to Figure 4.1, consider the three book elements. Thesethree nodes are similar, because they are sibling element nodes and have thesame tag name. Therefore, we can replace them by the �rst book element,which has ID value �2� and parent ID �1�, and assign �3� to this elementnode as cardinality counter, re�ecting the repetition and so forth for all otherelement nodes of the given XML structure. Any duplicate information isremoved by this process. Figure 4.2 displays the compacted structure of theXML document of Figure 4.1, where the crossed-out nodes will not be stored.

As we shall show later in the experimental results, our approach cangreatly reduce the storage requirements and at the same time, it can e�ec-tively process and optimize the di�erent types of queries that usually use pathexpressions to navigate the structure of the data. Several papers [Luo 2009,Chen 2001, Freire 2002, Polyzotis 2004a, Polyzotis 2004b, Zhang 2006] studyXML query optimization based on the selectivity estimation of path expres-sions. The majority of this work has focused on de�ning a concise synopsisstructure for summarizing the XML document. However all these approachessu�er from poor estimation accuracy. In contrast to that, our method de�nes


Algorithm 4.1 The SCQX-Algorithm for Compacting an XML Tree1: if SAX-Event is a start-tag event then2: create a new node N for the current tag3: S.push (N)

{N is inserted to stack S}{ stack(S) is an array keeping the tag information }

4: end if

5: if SAX-Event is a end-tag event then6: N:= S.pop ();

{ N is popped from stack S}7: if there are more entries on the stack then8: S.top ().add(N)

{ return top node from stack S and add it N as child node }9: end if

10: if N has children nodes then11: Compaction(N);

{ the Child nodes of N are compacted if they are similar}{ a cardinality counter is assigned to each Child node of N }

12: end if

13: end if

PROCEDURE Compaction (N)

1: init := the �rst child of node N;{ init is an initial node }

2: for i = 1 to Max number of the children nodes of node N do

3: if init is similar to child(i) then4: card(i-1)++;

{card is the cardinality counter of child node}5: Child (i-1)=Child (i);

{ child(i) will not be stored}6: else

7: init := child(i);8: card(i):=1;9: end if

10: end for

4.3. The Storage Model of SCQX 38

an accurate compact summary structure with low storage requirements andwithout loss of any information. In addition, it can e�ectively tackle queriessuch as linear path queries and twig queries. Because of the importance oftwig queries, we provide a simple query example to show the signi�cance ofour compaction algorithm in tackling this type of queries.Consider for example a part of a Hamlet play, as in Figure 4.7 onpage 44. Figure 4.8(a), on page 45, illustrates the tree representationof the XML document, Figure 4.8(b), on page 45, shows its compactedstructure. As a twig query example, consider the following queries//SCENE[/SPEECH]/STAGEDIR that retrieves all STAGEDIR nodes thathave a �SCENE� parent node and at least one �SPEECH� sibling node. Theimportance of our compaction appears in such queries, in which the numberof occurrences of the sibling nodes is immaterial. For instance, if the lastquery is evaluated on the original tree, then it would have 1×4 = 4 as theanswer size, while the twig query, which will be evaluated on our compactedtree, would have an answer size of 1.

4.3 The Storage Model of SCQX

The storage of XML documents plays an important role, because e�-cient query processing is critically dependent on the chosen storage struc-tures [Florescu 1999]. Di�erent storage organizations are needed to achievee�cient support for document retrieval and querying, which can be summa-rized as follows:

• It is important that the implemented storage solution be as robust aspossible.

• The XML structural information should be stored separately from thetext contexts, so that the node operations can be performed indepen-dently.

• The storage model should have enough auxiliary information, e.g., in-dexes on tag names and values, to speed up query processing.

• The storage model should be able to store the XML data and indexespersistently.

• The storage model should have minimum storage space requirements.

In this section, we describe the storage model of SCXQ for representing com-pacted XML documents, which consists of three parts, storage structure, indexmethods and query evaluation.


4.3.1 Storage structure

SCXQ's storage structure contains three tables: Element table, Value tableand Path table.

De�nition 4.1 (Level-Order Label)

The Level-Order label of an element node ui of the structure T is a unique

integer IDU(ui) obtained by the level-order labeling scheme. Node u0 is

always the root node of the XML structure.

De�nition 4.2 (Text ID)

The text identi�er is the label of the text contents of element node ui,

which is the Level-Order label of the parent of ui.

De�nition 4.3 (path)

Let P be the set of all paths in an XML structure. Each path ρ ∈ P is

a sequence IDU(u0), IDU(u1), ..., IDU(uτp(ρ)−1), where τp(ρ) is the length

of the path .

4.3.1.1 Element table

The element table is a collection of records that are the representations of allelement nodes of the compacted XML structure. Each record stores, for anelement node of the compacted structure, the tag name with its Level-Orderlabel, the Level-Order label of its parent, and its cardinality counter:

Figure 4.3: Element Table of SCXQ's storage structure

4.3.1.2 Value table

The value table is a collection of records that are the representations of alltext nodes. Each record is composed of the text contents and the text ID ofa text node:


Figure 4.4: Value Table of SCXQ's storage structure

4.3.1.3 Path table

This table maintains all distinct paths of the XML structure, where pathmeans the sequence of elements from root node to any element node:

Figure 4.5: Path Table of SCXQ's storage structure

4.3.2 Index methods

4.3.2.1 Path index

For the Path table, we create a path index. This index can be implementedby a hash structure, in which the paths are indexed; the index entries arereferenced by integer values (so-called pathID). The pathID is the same formultiple element nodes with the same path. For example, in Figure 4.2,element nodes (ID = 5) and (ID = 7) refer to two di�erent TITLE nodes, butthe paths leading to these nodes are both expressed as �/bib/book/title�. Assuch, they have the same pathID value 3 (see Table 4.1). To achieve betterquery performance, the path index is extended by references to Level-OrderLabels in the Element table, resulting in an inverted list.

4.3.2.2 Tag index

The tag names are indexed in a hash structure, where the index entries arereferenced by integer values.


Table 4.1: Path Index for the example XML documentpathID Path Expr

1 /bib2 /bib/book3 /bib/book/title4 /bib/book/author

4.3.2.3 Value index

The text contents are indexed in a hash structure, where the index entriesare referenced by integer values. To speedup predicate queries, this indexis enhanced with an inverted list, which is a popular data structure for fastinformation retrieval, pointing to the Level-Order labels in Element table.The inverted list for the XML data value of XML example in Figure 4.1 isshown in Figure 4.6.

Figure 4.6: The inverted value index for the XML data value of XML examplein �gure

4.3.3 Query Evaluation

In order to evaluate basic XPath queries, with all XPath axes, node testsand basic text predicates (with textual, numeric, and positional matches),an XPath parser has been implemented. We have chosen di�erent algorithmsfor queries with or without positional predicates and we performed someoptimization steps to simplify and reformulate the XPath query. SCQX


enables the compacted XML structure to be queried with no decompactionexcept when required to show the results.

A full path or simple queries such �/� or �//� can be easily answeredby performing some matching on the Path index. For example, suppose wehave an XML document, where A, B, C and D are elements in this XMLdocument, and we wish to evaluate a full path such as Q1: /A/B/C/D, wecan evaluate this path by performing an exact match lookup for this path inthe path index. If it is found, return its pathID and a list of the Level-Orderlabels referenced by this pathID.

For a path expression containing the //-axis, such as Q2: //B/C, an-swering Q2 is similar to Q1, except that it requires a su�x match for B/Cin the path index. Again we yield a set of pathIDs and set of lists of theLevel-Order labels referenced by these pathIDs.

The situation is similar for path expressions containing a wildcard or//-axis in the middle of the path expression, such as Q3: /A/*/C orQ4:/A/B//D. In the latter case, the exact match for /A/B and su�x matchfor D yield a set of path IDs.

For path expressions containing predicates, such as Q5: /A/B[D =�text�], that �t the pattern �path=value�, �rst we �nd all Level-Order labelsreferenced by the pathID of /A/B/C. After that, we check which one ofthe resulted Level-Order labels has a text value equal to �text� using exactmatching on the Value Table, yielding a set of Level-Order labels. TheElement Table is then used to return the Level-Order label of the parent toobtain the B in the result.

The Value Index is used to accelerate the predicate queries; we applythis index when we �nd exact string comparisons in the query; the inputXPath query is rewritten to call the text content index. Child steps are con-verted to parent steps and descendant steps are converted to ancestor stepsand vice versa [Olteanu 2002]. For example Q5:/A//C[text()=�contents�]/D.�contents� is matched against the text content index. The resulting set ofLevel-Order labels is matched against the C self and the A ancestor and thenD child.

Note that the ancestor and ancestor-or-self axes are also supported us-ing pre�x matching; the Element table also permits the e�cient evaluation ofchild-parent, preceding-sibling, and following-sibling relationships. Inferring

4.4. A Real-Life XML Example 43

the other relationships also uses some matching and /or seeks on Elementand Value Tables.

Through the above storage model, any compacted XML documents,obtained by our compaction method, can be stored and queried e�ciently.

4.4 A Real-Life XML Example

As a real-life example of XML documents, a small randomly chosen part fromHamlet, a tragedy by William Shakespeare marked up in XML for electronicpublication, is used to clarify the bene�t of our method (SCQX), which canbe roughly divided into several phases illustrated in Figures 4.7, 4.8, 4.9and 4.10 respectively.

• Figure 4.7 shows an example XML document that contains a small partof the Hamlet play. In this example, the root element is PLAY; itcontains a sub-element ACT, which contains a sub-element, SCENE;which in turn contains �ve sub-elements: one STAGEDIR and fourSPEECHES, where each one contains a sub-element SPEAKER andseveral sub-elements LINE.

• The tree, shown in Figure 4.8(a), gives the labeled tree obtained bylevel-order labeling scheme described in Section 4.1, which computestwo numbers for each element node in the XML tree. The �rst oneis assigned in the order in which the node is visited during a breadth-�rst search on the XML tree and the other number represents the parentnumber of this element node (not shown in Figure 4.8(a)). Figure 4.8(b)shows the compacted version of previous XML structure, obtained byour compaction algorithm described in Section 4.3, in which the repeti-tive sibling nodes are replaced by the �rst one of them that is assigneda cardinality counter, which re�ects the number of repetitions.

• The storage structures for the part of the Hamlet XML document areshown in Figure 4.9. They store the data values and the compactedstructural information separately in the Value Table and Element Ta-ble respectively and maintain some extra information (i.e., the labelingscheme), while the Path Table maintains all paths in an XML structure.The indexes, which are used in SCQX to reduce the storage requirementsand to access the stored nodes quickly, are show in Figure 4.10.


Figure 4.7: A random part of Hamlet XML document


Figure 4.8: The Level-Order labels (a) and the compacted structure (b) of thePart of Hamlet XML document with Level-order labels(red) and cardinalitycounters (blue)


Figure 4.9: Storage structures for the Part of Hamlet XML document

4.5. Summary 47

Figure 4.10: The Indexes for the Part of Hamlet XML document

4.5 Summary

SCQX [Alkhatib 2008b] is a new approach for Storing, Compacting andQuerying XML documents. This approach compacts the structure of an XMLdocument, without losing any information. It is based on exploiting repetitiveconsecutive tags in the structure. SCQX subsequently stores the compactedXML structure separately in a robust storage structure, which includes a setof access support structures to guarantee fast query performance. The mainidea of SCQX is to take advantage of a particular, level-order XML num-bering scheme and compaction techniques. The result is a storage model,which greatly reduces the storage requirements (as we shall show later in the

4.5. Summary 48

experimental results) and at the same time leverages labeling schemes to sup-port e�cient query processing on compacted XML structures. On that modelwe can e�ciently process queries on the content and structure of documents.However, SCQX has one disadvantage: It cannot deal e�ciently with theupdating problem. But this problem will be addressed in the next section.

Chapter 5

CXQU: A Cluster Labeling

Scheme for Storing, Querying and

Updating Compacted XML

Documents

Contents5.1 The Cluster Labeling Scheme . . . . . . . . . . . . . . 50

5.1.1 The Initial Labeling of the Cluster Labeling Scheme . 52

5.1.2 Inserting new nodes . . . . . . . . . . . . . . . . . . . 52

5.1.3 Byte Representation of Cluster Labels . . . . . . . . . 54

5.2 Compaction Principles of CXQU . . . . . . . . . . . . 57

5.3 The CXQU Storage Model . . . . . . . . . . . . . . . . 59



5.3.3 Support for Updates to Compacted XML Structures . 64

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 66

In the previous chapter we have mainly focused on developing an XMLcompaction method with a focus on directly querying compact XML withoutprior decompaction. However, if XML data need to be updated frequently, thequestion of how to e�ciently support direct updates on compacted XML databecomes an important research topic. To consider this problem, in this sec-tion, we propose a hierarchical labeling scheme to represent compacted XMLdocuments while at the same time maintaining all the compaction propertiesof SCQX described in Chapter 4. We aim at combining the advantages of com-paction and labeling techniques to be able to minimize storage requirementsand to allow e�cient access, query and update of compacted XML documents.First we present the principles, on which CXQU is built. Then we describehow an XML structure is compacted by our compaction method. Finally we

5.1. The Cluster Labeling Scheme 50

also describe how compacted XML documents are stored and managed bygiving an example in detail.

5.1 The Cluster Labeling Scheme

As previously explained, one of the most important approaches for provid-ing robust storage and e�cient query processing is that it is based on la-beling schemes for gathering structural information from XML documentsand storing it in such a way that allows a quick identi�cation of structuralrelationships between nodes. This identi�cation plays a crucial role in e�-cient XML query processing. Hence, a good and compact labeling schemeis a key to the e�cient processing of XML queries. In order to facilitatethe determination of the structural relationships between nodes (e.g., theancestor-descendant relationship), various kinds of encoding schemes suchas region labeling schemes [Li 2001, Grust 2002] and pre�x based labelingschemes [O'Neil 2004, Böhme 2004] have been proposed. The labeling schemesassign labels to the nodes of the XML tree in such a way to determine therelationship of any two nodes from the labels in constant time. If the XMLis static, the region labeling schemes can e�ciently process di�erent queries.However if the XML is changed dynamically, they do not support the dynamicupdate of XML data. Insertion or deletion of new nodes would require rela-beling existing nodes. In contrast to interval labeling, the main advantage ofpre�x based schemes is their dynamic; labels of the existing nodes can remainstable in case of insertions and other updates. In schemes of this type, thesize of labels is variable and depends on the tree depth.Therefore, we aim at preserving the bene�ts of the pre�x labeling schemesand try to avoid their drawbacks. Thus the primary goals of our proposalinclude support for e�cient querying and updating and reduce storage spaceby combining labeling and compaction techniques to bring a solution for man-agement of XML documents that produces a better level of performance thanusing labeling and compaction independently.Let us consider the main characteristics of the numbering schemes that shouldbe met to design a labeling scheme that is well suited for our representationof XML documents.

• When the XML document is changed dynamically, the question of howto e�ciently update the labels becomes an important issue.

• It should maintain the order information of sibling nodes so that querieson sibling relationship are supported e�ectively.


• It should be able to support all the navigational operations as well asthe evaluation of all axes for declarative query processing.

• It should be able to reconstruct the document in its original form andto allow a faithful representation of the XML document after our com-paction.

• It should have compact representation to reduce its label length.

To achieve these goals, we present a hierarchical labeling scheme derived fromthe ORDPATH labeling scheme [O'Neil 2004], retaining all its desirable fea-tures. Furthermore, in our numbering scheme the sibling element nodes areclustered and compacted by an e�ective algorithm based on exploiting repet-itive consecutive tags in the structure of the XML documents.Almost all previous numbering schemes assign a unique identi�er to each nodein an XML tree in such a way that it can clearly show relationships betweenany two given nodes. In contrast to that, we assign a unique identi�er to eachgroup of elements which have the same parent (i.e. sibling element nodes).The assignment of the labels by our approach is similar to ORDPATH; wealso use the positive, odd numbers at the initial labeling of groups. Evenand negative integer component values are reserved for later insertions intoan existing tree.

Figure 5.1: An XML document with cluster labels (CIDs) of element groupsand TIDs of the text nodes


5.1.1 The Initial Labeling of the Cluster Labeling

Scheme

The following rules apply to the initial labeling of cluster labels:

• The root node of the document is itself considered as the �rst elementgroup and it is always labeled by cluster identi�er (CID) �1�.

• The child element groups obtain the CID of their parent element groupand attach a unique odd integer whose value increases in the orderedset of child element groups from left to right. This unique odd integeralso shows the position of parent element nodes in the parent group.

• Each text node of the XML tree is assigned a text identi�er TID, whichconsists of the CID of element group of its parent element and the oddposition of its parent element in its element group.

• The last rule is also applied to the labeling of attribute nodes.

Figure 5.1 displays an XML document with cluster labels (CIDs) of elementgroups and TIDs of the element nodes.

5.1.2 Inserting new nodes

We only consider the insertion of new single element nodes and subtrees intoan XML document, since a new leaf node takes its label from its parentelement node. When inserting a new element node or subtrees into an XMLdocument, the following cases are possible:

5.1.2.1 Insert a node(subtree) after the last child of a parent ele-

ment

The labeling of a new node (subtree) after the last child element of a parentelement works very similarly to initial labeling. In this case, there is no nodeafter the given node in its group. Thus we do not need to assign a new CID forthe inserted node or the root of the inserted subtree (in case of the insertion ofa subtree). The �st child group of the root of the inserted subtree obtains theCID of the root's group and concatenates the odd position of the root elementof the inserted subtree in its group. The other child groups, exactly as in rule2 of the initial labeling of cluster labels, obtain the CID of their parent groupand attach a unique odd integer whose value increases in the ordered set ofchild groups from left to right. Figure 5.2 shows an inserted subtree after the


last child element of the parent element bib in the example XML documentin Figure 5.1.

Figure 5.2: An example for an insertion of subtree after the last child of aparent element

5.1.2.2 Insert a node(subtree) before the �rst child of a parent

element

For the labeling of a new node (subtree) before the �rst child of a parentelement we use the negative values. In such case, there is no node before thegiven node (�rst child), thus the CID for the inserted node or the root ofthe inserted subtree (in case of the insertion of a subtree) shall be the CIDof the given node attached to a unique negative, odd integer. The CIDs forthe child groups of the inserted subtree are obtained exactly as in rule 2 ofinitial labeling. When we need more insertion in the left side, we use theother unique negative, odd integers whose values increase in the ordered setof groups from right to left. An example for an inserted subtree before the�rst child element of the parent element bib in the example XML documentin Figure 5.1 is shown in Figure 5.3.

5.1.2.3 Insert a node(subtree) at any position between two existing

nodes

The remaining case is the insertion of the node (subtree) between two existingnodes. In this case, we use the even number falling between the odd numbersof the two given existing nodes, which represent the odd position of thesenodes in their group. Thus the CID of the inserted node or the root of theinserted subtree (in case of the insertion of a subtree) shall be the CID of the


Figure 5.3: An example of an insertion of subtrees and nodes before the �rstchild of a parent element

given nodes attached to a unique even integer. The CIDs for the child groupsof the inserted subtree are obtained exactly as in rule 2 of initial labeling. Formore insertion before or after the inserted node we follow cases 1 and 2, wherenew nodes are inserted. An example of inserting new subtrees between twoexisting book nodes is given in Figure 5.4.

Figure 5.4: An example of an insertion of subtrees between two existing ele-ment nodes

5.1.3 Byte Representation of Cluster Labels

Similar to ORDPATH and Dewey, the cluster labels are not stored as (.-separated) ordinals. Instead, we use a compressed binary representation of


Table 5.1: Cluster labels formatL0 O0 L1 O1 .... Li Oi

Dewey Order that uses successive variable length Li/Oi bitstrings (see Ta-ble 5.1) to represent the cluster label of each node. One Li/Oi bitstring pairrepresents a component of the cluster label. The Li bitstrings can be rep-resented using the form of pre�x-free encoding shown in Figure 5.5 or otherpossible pre�x-free encoding schemes as shown in Figure 5.6. Note that thenative XML structure determines which form of pre�x-free encoding schemeis most suitable and better than other schemes. Each Li bitstring speci�esthe length in bits of the succeeding Oi bitstring and is generated to maintaindocument order. Li/Oi components can specify negative ordinals Oi as wellas positive ones. For example, if the Li bitstring �01� is assigned length 3,this Li will indicate a 3-bit Oi bitstring. The bitstrings (000, 001, 010, ...,111) can represent Oi values of the �rst eight integers, (0, 1, 2, . . ., 7). Thus�01001� is the bitstring for ordinal �1�.

Figure 5.5: A Pre�x-Free Encoding of the pre�x Bitstrings


Figure 5.6: An Alternative Pre�x-Free Encoding of the pre�x Bitstrings

5.1.3.1 Example for Encoding

The binary encoding of the cluster label (1.3.-9) is produced by locating eachcomponent value in the Oi value ranges and appending the corresponding Libitstring followed by the corresponding number of bits specifying the o�setfor the component value from the minimum Oi value for that range.Let us consider the �rst component value �1� and translate it to a bitstringpair. Note that the �rst component �1� is located in the Oi value range of [0, 7].So that the corresponding L0 bitstring is 01 and the length L0 = 3, indicatinga 3-bit O0 bitstring. We therefore encode the component �1� with L0 = 01 andO0= 001. Similarly, the binary encoding of the component �3� is the bitstringpair L1 = 01,O1 = 011. The component -9 is located in the Oi value range of[-24,-9] and its corresponding L2 bitstring 00011 and the length L2= 4. Thusthe O2 bitstring is 1111, which is the o�set of 15 from -24 speci�ed in 4 bits.As �nal result the bitstring 010010101100011111 is the binary encoding of the

5.2. Compaction Principles of CXQU 57

cluster label (1.3.-9).Using the encoding scheme from Figure 5.5, the binaryencoding of the cluster label (1.3.-9) will be 01101000011100.

5.2 Compaction Principles of CXQU

As XML database sizes grow, the need to reduce the amount of space usedfor storing of the XML documents and to support e�cient query processingand dynamic updates of XML documents has emerged. As we previouslymentioned, most XML database queries tend to focus on structural aspectswith only occasional access to character contents. Therefore supporting XMLstructures is becoming a main factor in query and update performance. Inthis Section we aim at combining the advantages of our compaction princi-ples described in Chapter 4 and the cluster labeling scheme described in theprevious section. The result will be an e�cient approach to represent XMLdocuments, which not only supports queries and updates but also compactsthe structure of an XML document based on the exploitation of repetitiveconsecutive tags in the structure of the XML documents by using the clusterlabeling scheme.To compact an XML document with CXQU, �rst, the structure of an XMLdocument is separated from its data and is then compacted using our algo-rithm 5.1, which basically exploits the repetition of similar sibling nodes ofXML structure, where �similar� means: elements with the same tag name.The algorithm consists of a single depth-�rst traversal of the original docu-ment, which can be easily obtained from a SAX parser. Once each groupof the sibling nodes of XML structure is labeled using our cluster labelingscheme (see Section 5.1), each node of these groups is assigned an Odd Posi-tion Number (OPN), where this odd number presents its position in the set ofpositive odd integers. These nodes are then compared in a left-right way. If anode is similar to its next node, then only the last node of them with its OddPosition Number (OPN) will be stored in our storage model for representingcompacted XML documents (which we present in the next section). We usethe odd position numbers (OPNs) as cardinality counters, through which wecan infer the number of repetitions of similar sibling nodes. Moreover the useof odd numbers (OPNs) allows the insertion of new nodes after compaction,without having to relabel the existing ones.Returning to Figure 5.1, consider the second group with three book nodes.These three sibling nodes are similar. Therefore, we can replace them by thelast node, which has CID=1.1 and OPN=5 and so forth for all other groups ofthe given XML structure. Figure 5.7 displays the compacted structure of theXML document of Figure 5.1, where the crossed-out nodes will not be stored.

5.2. Compaction Principles of CXQU 58

Algorithm 5.1 The CXQU-Algorithm for Compacting an XML Tree1: if SAX-Event is a start-tag event then2: create a new node N for the current tag3: S.push (N)

{N is inserted to stack S}{ stack(S) is an array keeping the tag information }

4: OP.push (OPN){ OPN is the odd position number of N }{ OPN is inserted to stack OP}{ stack(OP) is an integer array keeping the odd position number of N }

5: OPN:=1;6: end if


{ N is popped from stack S}9: OPN:= OP.pop();

{ OPN is popped from stack OP}10: OPN:= OPN+2;11: if there are more entries on the stack then12: S.top ().add(N)

{ return top node from stack S and add it N as child node }13: end if

14: G: = the children of N{ G is list of sibling nodes}

15: if G not empty then16: Compaction (G);

{ the sibling nodes are Compacted if they are similar}17: end if

18: end if

PROCEDURE Compaction (G)

1: for i = 1 to Max number of the sibling nodes of G do

2: if node(i) is similar to node(i+1) then

3: node (i):= null; { node(i) will be not stored}4: end if

5: end for

5.3. The CXQU Storage Model 59

Figure 5.7: Compacted structure of simple XML document in Figure 5.1 withcluster labels (CIDs) and the odd position number (OPN) for each node

5.3 The CXQU Storage Model

In this section we propose a storage model for storing XML documents withthe use of cluster labeling scheme for representing the relationship betweenXML nodes after the compaction by the compaction method of CXQU.The idea of the storage model of CXQU is to store the data values and thecompacted structure separately. As we have shown, an XML document isoften represented as an XML tree, in which the internal nodes represent thetags in the XML document while the leaf nodes represent data values inthe XML document. The internal nodes depict the structure of the XMLdocument and are only useful for document navigation. Storing them in atable, which can provide complete structure information of a compacted XMLdocument, can support e�cient and easy document traversal. While the leafnodes hold all the data of the XML document, they are of little use for XMLdocument navigation. Storing them separately from structure informationin a separate table allows us to improve query processing performance byavoiding scans of irrelevant data values. But for query evaluation, we alsoneed to maintain the connection between structural information and valueinformation. Therefore storing some extra information (i.e., the clusterlabeling scheme) can guarantee reconnection of the two parts of information.



De�nition 5.1 (Cluster Label CL)

The cluster label of an element node ui of the structure T is the cluster

identi�er CID obtained by the cluster labeling scheme.

De�nition 5.2 (Text ID)

The text identi�er is the label of the text contents of element node ui,

which consists of the CID of group of the element node ui and the odd

position of element node ui in its element group.

De�nition 5.3 (Odd Position Number OPN )

The Odd Position Number of element node ui is a positive odd integer that

represents the odd position of element node ui in the group of its sibling

element nodes.

CXQU is built on the data structures that are listed below and guarantees acompact mapping of XML �les:

5.3.1.1 Element Table

Once the nodes in the tree representation are numbered and compacted, thecompacted structure is stored in the Element Table. This table stores, for eachgroup of the compacted structure, its sequence identi�er (Seq.ID), its ownCL, and the OPN and the tag name of each element node of this group(seeFigure 5.8). The tags are uniformly indexed in a hash structure; the indexentries are referenced by integer values. Hence in the internal representations,the tag names, for each element group, are stored by its integer reference inan integer array and the OPN values, for each element group, are also storedin an integer array, while the CL of an element group is binary encoded andstored in a byte array, and the Seq.ID values are given implicitly by the arrayposition.

Figure 5.8: Element Table of CXQU's storage structure


5.3.1.2 Value Table

This table keeps all the leaf node information; it stores, for each leaf node, itssequence identi�er (Seq.ID), its Text ID and its text contents (see Figure 5.9).The text contents are also indexed in a hash structure; the index entries arereferenced by integer values. Thus the text contents are stored, in the internalrepresentations, by their integer reference in an integer array. The Text IDsare binary encoded, which consists of a set of Li/Oi bitstring pairs, and storedin byte arrays, and their Seq.ID values are implicitly stored. Because a CLor a Text ID is binary encoded and stored in a byte array, in some cases,the last byte may be incomplete. Therefore it is padded on the right withzeros to end on an 8-bit boundary. For example, the binary encoding of 1.9is 010011000001 but its total length in bytes is 2 bytes and stored as thefollowing bitstring 0100110000010000.

Figure 5.9: Value Table of CXQU's storage structure

5.3.1.3 Path Table

This table maintains all paths in an XML structure, where path meansthe sequence of elements from the root node to any element node. Forthe Path Table, we create a path index, which contains a unique integer(so-called pathID) for each path. The pathID is the same for multiple groupsof compacted structure with the same path, where pathID is pointed toone upper level path. For example, in Figure 5.7, the element groups withCLs 1.1.1 and 1.1.3, refer to two di�erent groups but the paths leadingto these groups are both expressed as /bib/book, hence, they shall havethe same pathID value. This index can be implemented by a hash struc-ture. To achieve better query performance, the path index is extended byreferences to the Element table's Seq.ID values, resulting in an inverted index.

Using the storage model of CXQU to store compacted XML documentscan provide a �exible storage scheme to satisfy di�erent storage requests.Figure 5.11 shows the XML storage model of CXQU (left) and its internalrepresentation(right).


Figure 5.10: An Inverted Path Index


CXQU supports the evaluation of XPath queries, including all axis types,node tests and predicates, using di�erent algorithms and some query opti-mizations are applied to improve the query performance. The big advantageof CXQU is that it can support e�cient evaluations of queries on compactedXML without prior decompaction with the exception of what is required toshow the results. Here are some examples to explain query processing:When a full path such as Q1: /bib/book/author is speci�ed, �rst an indexsearch is performed on the path index to �nd the pathID for the pathexpression. This only contains the �rst two steps (i.e. /bib/book) of theoriginal path and yields pathID value �2� and a list of the Seq.ID values ofElement table referenced by this pathID. The Seq.ID values are 3, 4 and 5 inthis case. After that, the exact match for �author� in the groups that have anequal Seq.ID, yields three nodes �CID=1.1.1, OPN=3�, �CID=1.1.3, OPN=5�and �CID=1.1.5, OPN=7�. Finally, because each node in the compactedstructure contains a node or set of nodes, it requires decompaction to showthe �nal results. For example the node �CID=1.1.3, OPN=5� has OPN =5and it occupies the second position in the array array. This means that itcontains other nodes, namely the node where �CID=1.1.1, OPN=3�.

For a path expression containing the //-axis, such as Q2: //TITLE, asu�x match for TITLE in the path index yields a set of paths that can beanswered in a way similar to answering Q1.


Figure 5.11: XML Storage Model (top), Internal Representation in CXQU(bottom)


For path expressions containing a wildcard or //-axis in the middle ofthe path expression, such as Q3: /bib/*/TITLE or Q4: /bib//TITLE. Inthis situation, the exact match for /bib and su�x match for TITLE yield aset of paths which again can be answered like Q1.

To answer queries with predicates that �t the pattern �path = value�,the path expression inside the predicate is rewritten. Descendant steps areconverted to ancestor steps and child steps are converted to parent steps andvice versa [Olteanu 2002]. For example Q4: /bib/book[TITLE = �XML�], we�rst answer the path expressions /bib/book/title like Q1 yield set of nodes.After that, we check which node has a text value equal to �XML� using anexact match on the Text table. This yields a set of nodes. To obtain thebook in the result, we have to �nd the parent of the nodes that have beenyielded. This can easily be inferred from their label.

The join, in the case of twig queries, such as Q5: /bib/book[TITLE =�XML�]//X, the part /bib/book[TITLE = �XML�] is �rst answered as Q4yielding a set of nodes, then the ancestor-descendant relationship betweenthe yielded nodes and X is veri�ed using their labels.

Note that CXQU supports the other XPath axes in the same way asin the pre�x numbering scheme, as our approach is based on the originalpre�x numbering scheme ORDPATH.

5.3.3 Support for Updates to Compacted XML Struc-

tures

We consider two types of updates: insertion and deletion, while a modi�cationis considered as the combination of a deletion and an insertion. CXQU allowsfor e�cient updates since nodes, speci�ed by a query, can easily be insertedor deleted from its compacted storage in the same way, as stipulated by thecluster labeling scheme. We use examples to show how to process the nodeinsertion in CXQU.

Example 5.1

We want to insert a sibling node before the book element �CL = 1.1, OPN=

1� in Figure 5.7. In this case, there is no node before the given node, thus

the CL of the new node shall be �1.1.-1� and its odd position number �1�

(see Figure 5.12).


Figure 5.12: Element node insertion, and subtree insertion in CXQU

Example 5.2

We want to insert two subtrees which have the root �article� between the

two book elements with CL �1.1� and OPN �1� and �3� respectively in

Figure 5.7. In this case, we use the even number falling between the odd

numbers of the two given sibling nodes. Thus the CL of both roots of

inserted subtrees is �1.1.2� and their odd position number (OPNs) are 1

and 3 respectively (see Figure 5.12); these root nodes are similar. That is

why only the last node is stored with its OPN. The children of the root

of the �rst subtree will have CL=�1.1.2.1� and OPN=�1� for the �rst child

and �3� for the second child and the children of the root of the second

subtree will have CL=�1.1.2.3� and OPN=�1� for the �rst child and �3� for

the second child and so on.

Example 5.3

We want to insert a sibling node after the book element �CL = 1.1, OPN=

5� in Figure 3. In this case, there is no node after the given node in its

group. Thus the inserted node will take the same CL �1.1�. We need to

continue counting from the last existing OPN to get the new odd position

number of the inserted node thus its OPN is �7� (see Figure 5.12).

Note that an insert will never require relabeling other nodes. For insertion ofa leaf node, the TID of this leaf node is the label of its parent element node.For deletions we can just mark as deleted the corresponding nodes in thecompacted structure without any relabeling. However, since each node in thecompacted structure represents a node or set of nodes in the uncompactedtree, we must consider a deletion case, which might be required for partialdecompaction. For instance, if we want to delete the second book element

5.4. Summary 66

speci�ed by the Xpath expression bib/book[2] which selects the node bookhaving �CL = 1.1, OPN= 3� to be deleted. Because the compacted nodehaving �CL = 1.1, OPN= 5� contains the selected nodes, decompaction isrequired in order to mark the selected node as deleted.

5.4 Summary

In this chapter, we have proposed CXQU [Alkhatib 2008a], an e�cient XMLcompacting and labeling method that supports e�cient query processing andupdates on compacted XML structures. CXQU combines a compaction tech-nique that reduces the size of the structure of an XML document based on ex-ploiting repetitive consecutive tags in the structure with the labeling scheme,derived from ORDPATH and which reduces label length. CXQU has the ca-pabilities to reduce the storage requirements and to handle query processingand updates on compacted XML structures. Thus we feel that CXQU pro-vides a good compromise between storage consumption, query and updateperformance.

Chapter 6

CXDLS: Compacting, Storing,

Querying and Updating XML

Documents Using a Dynamic

Labeling Scheme

Contents6.1 The Pre�x Labeling Scheme . . . . . . . . . . . . . . . 68

6.2 XML Compaction . . . . . . . . . . . . . . . . . . . . . 71

6.3 The Storage Model of CXDLS . . . . . . . . . . . . . . 75



6.3.3 Support for Updates to Compacted XML Structures . 79

6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 81

In this chapter we present an improved technique called CXDLS combiningthe strengths of both labeling and compaction techniques. CXDLS bridgesthe gaps between numbering schemes and compaction technology to providea solution for the management of XML documents that produces better per-formance than using labeling and compaction independently. In this work,we focus on separation of content from the structure of an XML document,coupled with an e�ective compaction method, which compacts the regularstructure of XML e�ciently. At the same time, it works well when appliedto less regular or irregular structures. While this technique has the potentialfor compact storage, it also supports e�cient querying and update processingof the compacted XML documents by taking advantage of the ORDPATHlabeling scheme. In Section 6.1, we introduce the labeling of nodes using theORDPATH labeling scheme. After that, in Section 6.2, we shall describe themain idea behind our approach for compacting the structure of XML docu-ments and we illustrate how the ORDPATH labels are used for maintaining

6.1. The Pre�x Labeling Scheme 68

relationships between nodes after compaction. Section 6.3 describes a Stor-age model for representing compacted XML documents and we show howthe compacted XML documents can be queried and updated without priordecompaction.

6.1 The Pre�x Labeling Scheme

Existing labeling schemes can be classi�ed into interval labeling and pre�xlabeling. While almost all interval labeling schemes support XML queryprocessing e�ciently, their main drawback is that they do not behave well inthe presence of updates. Adding or removing nodes from the XML tree resultsin the relabeling of a large part of the document. In contrast to intervallabeling, the main advantage of pre�x labeling schemes is that the existingnode labels do not change when inserting or deleting nodes. In schemes of thistype, the size of labels is variable, however the overall disadvantage of theseschemes is that the length of the labels increases with the depth of the storedXML tree. To preserve the bene�ts of the pre�x labeling schemes and toavoid their drawbacks, we use the ORDPATH labeling scheme [O'Neil 2004]to gather su�cient structural information from the XML document which issubsequently compacted by a new compaction algorithm (described in thenext section). It is then stored in a way that allows fast access and e�cientprocessing over secondary storage.The ORDPATH numbering scheme [O'Neil 2004] is a particular variant of ahierarchical labeling scheme, which is used in Microsoft SQL Server's XMLsupport. It is essentially an enhanced, insert-friendly version of Dewey treelabeling [Tatarinov 2002]. It speci�es a procedure for expressing a path asa binary bitstring. It aims to enable e�cient insertion at any position of anXML tree, and also supports extremely high performance query plans fornative XML queries. ORDPATH encodes the parent-child relationship byextending the parent's ORDPATH label with a component for the child. Itassigns node labels of variable length and only uses the positive, odd numbersin the initial labeling (see Algorithm 6.1). Even-numbered and negativeinteger component values are reserved for later insertions into an existingtree. The even numbers are used as carets only; they are not counted ascomponents that increase the depth of the nodes. Hence a new node can beinserted under any designated parent node in an existing tree. Its label isgenerated using an additional intermediate �careting� component that fallsbetween the components of its left and right siblings.The ORDPATH encoding for an XML document is generated as follows:


• The root receives label 1.

• The nth (n = 1, 2, . . . ) child of a node labeled p receives the labelp.(2* n - 1).

• Let (V1 , . . . ,Vn) denote a sequence of nodes to be inserted betweentwo existing sibling nodes with labels p.s and p.(s + 2), s odd. Afterinsertion, the new label of Vi is p.(s + 1).(2* i-1); label p.(s + 1) isreferred to as a caret.

• Insert a new child on the left of all existing children, just by adding-2 to the last ordinal of the �rst child, using negative ordinal values ifnecessary.

• Insertions at the rightmost side are very easy to make: the label of theinserted node is p.(2*(n+1)-1), where p is the parent label and n thenumber of children of the parent node.

Algorithm 6.1 Algorithm for ORDPATH Encoding1: if SAX-Event is a start-tag event then2: Iterator itr = mainStack.iterator()3: while there are entries on the Iterator itr do4: Pre�x = Pre�x + itr.next() + �.�;5: end while

6: mainStack.push(currentID);{ currentID is the odd position number of N }{ currentID is inserted to mainStack}{ mainStack is an integer array keeping the odd position number of N}

7: ORDPATH ID =idPre�x + currentID;8: currentID = 1;9: end if

10: if SAX-Event is a end-tag event then11: currentID:= mainStack.pop();

{ currentID is popped from mainStack}12: currentID:= currentID+2;13: end if

An example of an XML document and its ORDPATH labeling is given inFigure 6.1. Note that the document order is exactly the lexicographical ORD-PATH order. Internally, ORDPATH labels, as in the cluster labels and Deweylabels, are not stored as (. -separated) ordinals but the ORDPATH itself is


Figure 6.1: An XML document with ORDPATH labeling

Figure 6.2: An example for the binary encoding of an ORDPATH label

a binary encoding called pre�x free encoding that maintains document orderand allows cheap and easy node comparisons. This encoding consists of aset of Li/Oi bitstring pairs: one for each component of the ORDPATH label.Each Li bitstring speci�es the length in bits of the succeeding Oi bitstringand it is generated to maintain document order. Thus the Li bitstrings canbe represented using the form of pre�x-free encoding shown in Figure 5.5 onpage 55 or using other possible pre�x encoding schemes such as the encodingshown in Figure 5.6 on page 56. Figure 6.2 displays the binary encoding ofan ORDPATH label.In our approach we use the ORDPATH labeling scheme to label the XML

documents, where for each element node of the structure is assigned ORD-

PATHLB that is a label obtained by the ORDPATH labeling scheme, whilewe assign to each text node a text identi�er TID that is the ORDPATHLB of

6.2. XML Compaction 71

its parent element. We use the ORDPATH labeling scheme because it appearsto be suitable for our works and it has nice properties:

• All relationships between nodes can be inferred from the labels alone.

• It is easy to determine the order of the nodes.

• It allows the insertion of new nodes at arbitrary positions in the XMLtree but nevertheless avoids relabeling existing nodes.

• It has an internal representation which is based on a compressed binaryform.

• It allows a faithful representation of the XML document after com-paction.

6.2 XML Compaction

The basic idea of our approach combines the encoding scheme described inthe previous section and a new compaction technique to achieve a compactrepresentation of XML documents for e�cient management. Our compactionmethod helps to remove the redundant, duplicate subtrees and tags in anXML document. It takes advantage of the XMill principle of separately com-pacting structure from data and it also uses the ORDPATH labeling schemefor improving the query and update processing performance on compactedXML structures.De�nition 6.1 (identical structure)

Two subtrees S and S' of XML structure are said to be �identical� if they

are consecutive and have exactly the same structure.

De�nition 6.2 (similar nodes)

Two nodes N and N' of XML structure are said to be �similar� if they are

consecutive elements, i.e. sibling nodes, in the structure and have exactly

the same tag name.

Our compaction algorithm is presented in Algorithm 6.2. The original inputXML document is parsed by the SAX parser [Saxproject ] and labeled usingthe ORDPATH labeling scheme as illustrated in the previous section. TheXML structure is then compacted based on the basic principle of exploitingthe repetitions of similar sibling nodes in the XML structure. These similarnodes are replaced with one compacted node, which is assigned a start labelequal to the label of the �rst node of compacted nodes and an end label


Algorithm 6.2 The CXDLS-Algorithm for compacting an XML tree1: if SAX-Event is a start-tag event then2: create a new node N for the current tag3: S.push (N)

{N is inserted to stack S}{ stack(S) is an array keeping the node information }

4: end if


{ N is popped from stack S}7: if there are more entries on the stack then8: S.top ().add(N)

{return top node from stack S and add it N as child node and thechildren of N as descendant nodes }{It will be computed the hash value of the node N}{It will be the hash value of subtree rooted at the node N}

9: end if

10: compaction (N);{ the subtrees are compacted if they are identical}{ the sibling nodes are compacted if they are similar}

11: end if

PROCEDURE compaction (N)

1: if N is not the �rst child AND N and the previous child node are identicalthen

2: N is assigned a start label equal to the label of the previous child nodeand an end label equal to its ORDPATH label.{ Two nodes are considered identical if the hash values of subtrees rootedat these nodes, are the same, and not identical if their hash values aredi�erent}

3: Each node of the subtree rooted at the node N, is assigned a start labelequal to the label of the node in the subtree rooted at the previous childnode, and an end label equal to its ORDPATH label{the subtree rooted at the previous child node will be not stored}

4: else if N and the previous child node are similar then5: N is assigned a start label equal to the label of the previous child node

and an end label equal to its ORDPATH label.{ Two nodes are considered similar if the hash values of these nodes arethe same and not similar if their hash values are di�erent}{the previous child node will be not stored}

6: else

7: N has only one label that is its ORDPATH label8: end if


equal to the label of the last node of compacted nodes. Another principleis to exploit the repetitions of identical subtrees (see De�nition 6.1). Theseidentical subtrees are also replaced with one compacted subtree, where eachnode in the compacted subtree is assigned a start label equal to the label ofthe node in the �rst subtree of compacted subtrees and an end label equal tothe label of corresponding node in the last subtree of compacted subtrees.

Example 6.1

For the document of Figure 6.3(a) the compacted structure is shown in

Figure 6.3(b); observe that the �rst, second and third subtrees which have

the root �book� �1.1� , �1.3� and �1.5�, are identical. Therefore, we can

replace them by one subtree, where its root is assigned a start label equal

to the label of the root in the �rst compacted subtree i.e �1.1� and an end

label equal to the label of root in the third compacted subtree i.e �1.5�.

The new subtree has the root �book� and its label is �1.1, 1.5�. Each child

node of this subtree is assigned a start label equal to the label of the node

in the �rst compacted subtree and an end label equal to the label of corre-

sponding node in the third compacted subtree, so that the new subtree has

child nodes, which have the following respective labels: publisher �1.1.1,

1.5.1� author �1.1.3, 1.5.3� title �1.1.5, 1.5.5�. Note that the nodes �article�

with the labels �1.7� and �1.9� are similar, we can also replace them by

one node �article� with label �1.7, 1.9� while the labels of their child node

do not change. The process is recursively applied to all repetitive consec-

utive subtrees and tags in the XML structure. Figure 6.3(b) displays the

compacted XML structure, where the crossed-out nodes will not be stored.

Using these labels we fully maintain all nesting information after the com-paction, so the original XML document can be faithfully reconstructed. Weconsider for example the node �publisher� with the label �start = 1.1.1, end=1.5.1�. We can infer that this compacted node contains other nodes. To getthe labels of decompacted nodes, we compare between each component of thestart label with its corresponding component in the end label to �nd all theodd numbers falling between them. Then we combine the resulting numbersfrom the �rst two components with the resulting numbers from the second twocomponents and so on. The process for inferring the labels is as follow: fromthe �rst two components 1 and 1, we get the value 1; from the components 1and 5 we get the values (1, 3 and 5) and from the last components 1 and 1it yields 1. We combine 1 with (1, 3, 5) yields �1.1�, �1.3�, �1.5� and we thencombine this result with 1 yields the �nal decompacted nodes with the labels�1.1.1�,�1.3.1� and �1.5.1�. Figure 6.4 illustrates the process:


Figure 6.3: The XML structure and its compacted form

6.3. The Storage Model of CXDLS 75

Figure 6.4: Example for inferring the labels from label of compacted node

6.3 The Storage Model of CXDLS

Based on our previous approaches in chapters 4 and 5, we strongly believethat improving the performance of XML management requires e�cientstorage structure and access methods. To achieve this goal, we store the com-pacted XML structural information separately from the data information inthe storage structure. This separation allows us to improve query processingperformance by avoiding unnecessary scans of structures and irrelevant datavalues. But for query evaluation, we need to maintain the connection betweenstructural information and value information. The ORDPATH labels can beused to reconnect the two parts of information.


Our storage structure contains three tables: Element Table, Value Table andPath Table. Figure 6.8 shows the storage structure for compacted XML ex-ample shown in Figure 6.3

6.3.1.1 Element Table

The Element table stores the compacted structure of XML documents, wherea sequence identi�er, ORDPATHLB and tag name are stored for each elementnode of the compacted XML structure (see Figure 6.5). We must note atthis point that an ORDPATHLB of compacted node consists of a start labeland an end label, which are stored as one label, where its �rst componentis from the start label and its second component is from the end label and


so on alternatingly. For example: the label �start=1.1.3,end= 1.5.3� of thecompacted node �author� is stored as one label �1.1.1.5.3.3� in the Elementtable. The tag names are indexed in a hash structure; the index entries arereferenced by integer values. Thus each tag name is stored by its integerreference. The sequence identi�er is implicitly given by the array position andeach ORDPATHLB is stored in a byte array in binary form. For example, theencoding of the ORDPATHLB �1.9� is the bitstring 010011000001. Becausethis bitstring is stored in a byte array, note that the last byte is incomplete.Therefore it is padded on the right with zeros to an 8bit boundary. Thus thestored bitstring is 0100110000010000. Note that all ORDPATHLBs start with�1.�, therefore it is unnecessary to store this component explicitly, and thuswe save 5 bits for each node of an XML document.

Figure 6.5: Element table of CXDLS's storage structure

6.3.1.2 Value Table

The Value table stores all the data values of XML document, whereby thesequence identi�er, the own text identi�erTID and the text contents are storedfor each leaf node (see Figure 6.6). The sequence identi�er is implicitly givenby the array position and the TID is stored in a byte array in binary form.The text contents are indexed in hash structures and stored by their integerreferences.

Figure 6.6: Value table of CXDLS's storage structure

6.3.1.3 Path Table

The Path table stores all distinct paths in the structure of an XML document,where path means the sequence of elements from root node to any element


node (see Figure 6.7). The Path table is indexed in a hash structure; the indexentries are referenced by integer values. To achieve better query performance,this index is extended by references to sequence identi�er values of the Elementtable, resulting in an inverted index.

Figure 6.7: Path table of CXDLS's storage structure


CXDLS supports the querying and updating of the compacted structure di-rectly and e�ciently. It can e�ciently process XPath queries, including almostall axis types, node tests and predicates by using di�erent algorithms and in-dexes such as tag index, value index and path index and also by applying somequery optimizations. Moreover the key concept in quickly evaluating a queryis to use the ORDPATH labels to quickly determine the ancestor-descendantrelationship between XML elements and to provide fast access to the desireddata.Example 6.2

An XPath query containing slash �/� or double slash �//� can be easily an-

swered by performing exact-match or pre�x-match on the path index yield

pathId(s) and ORDPATHLB(s) referenced by the resulting pathId(s). Let

the XPath expression to be evaluated be Q1: /BIB/ARTICLE/TITLE.

Thus, we only need to perform an exact match on the path index. The

pathId = 8 is yielded with ORDPATHLBs �1.7.3�, �1.9.5� referenced by

this pathId. For a path expression containing the //-axis, such as Q2:

//TITLE, it requires a su�x match for TITLE in the path index. It

yields the pathIds 5, 8 and the ORDPATHLBs �1.1.1.5.5.5� , �1.7.3�

and �1.9.5� referenced by these pathIds. Note that the ORDPATHLB

equal to �1.1.1.5.5.5� is the label for a compacted node. To show the

�nal results, decompaction is required, using the process we mentioned

in Section 6.2, where three nodes are yielded with the ORDPATHLBs

�1.1.5�, �1.3.5� and �1.5.5�. In the event of path expressions containing

a wildcard or //-axis in the middle of the path expression, such as

Q3: /BIB/*/TITLE or Q4: /BIB//TITLE, an exact-match for BIB

and pre�x-match for TITLE on the path index yield the same result of Q2.


Figure 6.8: The XML storage Model (left) nad its internal representation(right) in CXDLS


Example 6.3

For an XPath query containing predicates, the path expression in-

side the predicate is rewritten. Child steps are converted to par-

ent steps and descendant steps are converted to ancestor steps

and vice versa [Olteanu 2002]. This type of query �ts the pattern

�path = value�. For example, to answer the predicate query such

as Q5: /BIB/ARTICLE[TITLE=�P2P�], �rst the path expression

/BIB/ARTICLE/TITLE is answered in a way similar to answering Q1.

The ORDPATHLBs �1.7.3 � and �1.9.5 � are yielded. Next we check which

one has a text value equal to �P2P� using an exact match on the Text

table. In this case only the ORDPATHLB �1.9.5� has a data value equal

to �P2P�. To obtain the ARTICLE in the �nal result, we �nd the parent's

ORDPATHLB of �1.9.5�, that is, �1.9�, which is easily inferred from the

label only.

Example 6.4

For twig queries, such as Q6:/BIB/ARTICLE[TITLE=�P2P�]//XX, we

�rst answer the path expressions /BIB/ARTICLE[TITLE=�P2P�] like Q5

then the ancestor-descendant relationship between the resulting nodes and

XX can easily be determined by their labels.

Note that CXDLS supports all XPath axes in the same way as in the pre�xlabeling scheme because it is based on ORDPATH labeling, in which we caneasily determine the relative order of nodes, the child-parent and the ancestor-descendant relationships by a byte comparison of two ORDPATH labels. Itis important to note that the compacted labels also retain all such features ofORDPATH labels. For example, in Figure 6.3 article �1.9� is an ancestor oftitle �1.9.5� since �1.9� is a pre�x of �1.9.5�. Also the compacted book �1.1.1.5�is an ancestor of publisher, author and title with the respective compactedlabels �1.1.1.5.1.1�, �1.1.1.5.3.3� and �1.1.1.5.5.5� because book �1.1.1.5� is apre�x.

6.3.3 Support for Updates to Compacted XML Struc-

tures

The update behaviors of CXDLS are identical to those in ORDPATH, whichguarantees the complete avoidance of relabeling for the existing nodes andthe labels of new inserted nodes, as they are generated in accordance with therules described in Section 6.1. In addition, the new nodes or subtrees are com-pacted if they ful�ll the conditions of compaction mentioned in Section 6.2.


Figure 6.9: Element node insertion, and subtree insertion in CXDLS

6.4. Summary 81

Figure 6.9 displays di�erent cases of insertions occurring on compacted struc-tures of the XML document in Figure 6.3.With regard to deletions, we simply mark the corresponding nodes in thestructure as deleted, without any relabeling. However since compacted nodesrepresent a set of nodes in the uncompacted structure, we must consider dele-tion cases that might require partial decompaction. For example, if we wantto delete the node element speci�ed by the XPath expression BIB/BOOK[2]which selects the node BOOK �1.3�, the compacted node BOOK �1.1.1.5�needs decompaction to mark the selected node and any descendant nodesbelow it as deleted.

6.4 Summary

The approach (CXDLS) [Alkhatib 2009] described in this chapter focused onthe separation of content from the structure of an XML document to avoidunnecessary scans of structures or irrelevant data values, coupled with an ef-fective method for XML compacting. The compaction method is based on theexploitation of the similarity of consecutive tags and subtrees in the structureof the XML documents and it uses the ORDPATH labeling scheme for gath-ering su�cient structural information from the compacted XML document.CXDLS stores the compacted XML documents in a way that allows fast ac-cess and supports both update and query processing of the compacted XMLdocument e�ciently and directly. The signi�cant reduction in processing timeand storage space is achieved by a combination of XML compacting and thenode labeling scheme.

Chapter 7

Experimental Evaluation

Contents7.1 Experimental Environment . . . . . . . . . . . . . . . . 82

7.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7.3 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7.4 Experimental Results . . . . . . . . . . . . . . . . . . . 85

7.4.1 Storage Requirements . . . . . . . . . . . . . . . . . . 85

7.4.2 Query Performance . . . . . . . . . . . . . . . . . . . . 99

7.4.3 Update Performance . . . . . . . . . . . . . . . . . . . 105

7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 108

In this chapter, we empirically evaluate the performance of our approachesfor compacting, storing, querying and updating XML documents, as proposedin this thesis, and provide comparisons to other existing approaches in this�eld. The general hypothesis is that they should perform better than thosethat have previously been developed for this task. To verify the validityof this hypothesis, our approaches must be implemented and measured. Toachieve this goal, a comprehensive series of experiments was carried out tomeasure and compare the performance of our approaches. The experimentsconcentrate on the storage requirements and the execution times needed toprocess the queries as well as on the performance of the updates. We also rana performance comparison between di�erent approaches that have previouslybeen developed for this task.

7.1 Experimental Environment

The proposed approaches were implemented in Java. The environment usedto perform the experiments consisted of a PC at 2.00GHz with 2GB RAMand 80 GB hard disk, running on the Windows Vista operating system. TheXML documents used in the experiments are parsed using SAX (Simple APIfor XML parsing) [Saxproject ], which is an event-driven API that calls event

7.2. Data Sets 83

methods while parsing the document. The SAX parser allows faster pars-ing, maintains minimal state information and therefore requires little memoryeven when parsing large documents, compared to other parser models such asDOM [W3C 1998], even for parsing large documents.

7.2 Data Sets

We conducted our experiments using a variety of both synthetic and realdatasets that covered a variety of sizes, application domains, and documentcharacteristics such as: simple without recursion, complex with small degreeof recursion, complex with high degree of recursion, small and large sizes.Table 7.1 displays the di�erent structural properties of the used datasets.They are suitable for a comprehensive evaluation of the performance of ourapproaches. The well-known XML datasets used in our experiment are listedbelow:

• Shakespeare: The Shakespeare dataset is a set of the plays of WilliamShakespeare marked up in XML for electronic publication on the Inter-net. The document contains only element and text nodes. None of theelement nodes have attributes. This document contains only minimalredundancy in the textual part, while certain elements can appear atdi�erent contexts in the document. The original document was obtainedfrom [Bosak 1999].

• SwissProt: The SwissProt dataset is a curated database stored in Swis-sProt XML document, which contains detailed annotation and organi-zation of protein sequences. The tag set in this document is relativelysmall and there is a signi�cant amount of text content. Attributesare common and tend to have short names and values. It provides ahigh level of annotations and a minimal level of redundancy. The orig-inal document was obtained from the University of Washington's XMLrepository [Miklau ] and it was created by Amos Bairoch in 1986 at theDepartment of Medical Biochemistry of the University of Geneva.

• XMARK: This is a synthetic XML dataset from an XML bench-mark [Schmidt 2002], which is becoming widely used for its typical andrepresentative data. It is based on an Internet auction application thatconsists of relatively structured and data-oriented parts. In our ex-periments, the XML document instances of the XMark benchmark aregenerated with scaling factors of 0.01, 0.1 and 1 by means of xmlgen ofthe XML benchmark project.

7.2. Data Sets 84

• TreeBank: The TreeBank dataset is a corpus of encrypted Englishsentences from TheWall Street Journal, tagged with parts of speech. Weobtained TreeBank XML document from the University of WashingtonXML repository [Miklau ]. It has a very deep recursive and totallyirregular structure which makes it an interesting case for experiments.

• TPC-H benchmark: This is the TPC-H benchmark database con-verted into XML format, which simulates a decision support system orbusiness intelligence database environment. It is designed and main-tained by the Transaction Processing Council (TPC). In our experi-ments, we use the Part, Lineitem, Costumer and Orders [Miklau ], whichare the XML versions of TPC-H benchmark data used widely in the �eldof relational databases. They have several numeric typed elements todescribe the key values.

• Religion datasets: This is a group of four religious works as examplesof real documents marked up in XML for electronic publication. TheXML documents of this group making up the set are the Old Testament,the New Testament, the Quran, and the book of Mormon. We obtainedthese XML documents, which are respectively abbreviated as OT, NT,quran and BOM, from [Bosak 1998a].

• TOL: This is an XML document containing the ToL tree structure ofthe Tree of Life web project [MADDISON 2007], which is available tothe public for non-commercial use. The structure of the tree is verydeep (more than 240 levels) and it is contained in a single XML ele-ment, the TREE element. The TREE element contains a single element,NODE, which contains zero or more NODES elements. This pattern re-peats itself out to the tips of the current tree structure of the ToL webproject [web project 1998].

• Mondial: This is the XML version of the CIA world factbook, whichis a publicly available database created and maintained by the CentralIntelligence Agency (CIA). Mondial contains comprehensive statisticalinformation on every country and territory in the world. It has beenmade available by The University of Goettingen as part of their Mondialdatabase project [May 1999].

• Medline: MedLine (Medical Literature Analysis and Retrieval SystemOnline) is a commercial bibliographic database in the biomedical, medi-cal domain. It contains a large number of citations and author abstracts.MEDLine is available in XML format to subscribers [NLM ]. For ourexperiments, three XML documents are chosen arbitrarily among the

7.3. Queries 85

�les of Medline 2006. Their size and structure characters are shown intable 7.1.

• Nasa dataset: This contains astrophysics data obtained from UWDatabase [Miklau ].

• NCBI dataset: This is XML-formatted biological data downloadedfrom the The National Center for Biotechnology Information (NCBI)Web site [NCBI ].

7.3 Queries

In order for our test to be a fair representation of real query performance,we evaluated the query performance of our proposed approaches for severaltypes of queries. However the current versions of our approaches are limitedto XPath 1.0 queries. Due to this limitation, we chose a set of queries that arecompatible with the XPath supported by our approaches. We used XMarkand Shakespeare datasets and ran di�erent kinds of queries, such as shortqueries, long queries, simple twig queries and complex twig queries, etc. Ta-bles A.1, A.2 and A.3 list all the queries used in the experiments, where the�rst character in a query name indicates the dataset on which the query isexecuted: �X� denotes XMark, �S� is for Shakespeare.

7.4 Experimental Results

Having described the experimental environment, I shall begin my discussionof the experimental results by presenting the results of the three sets of experi-ments that were conducted, namely, storage requirements, query performanceand update performance for all our approaches in comparison with di�erentapproaches that have previously been developed for those tasks.

7.4.1 Storage Requirements

The storage requirements that are needed to store the XML documents areusually measured according to the size of the storage structures for the rep-resentation of XML documents. The typical data model presents an XMLdocument as a tree structure, in which each node has a name pointer, threepointers to its parent, its �rst child, and the next sibling.

7.4. Experimental Results 86

Table7.1:

XMLdatasets

used

intheexperiments

Datasets

File

name

Topics

size

No.

ofelem

ents

Max

depth

No.

ofdist.

elem

ents

D1

Mondial

Geographicaldatabase

1,77MB

22423

623

D2

OT

Religion

3,32MB

25317

621

D3

NT

Religion

0,99MB

8577

615

D4

Quran

Religion

897K

B6709

616

D5

BOM

Religion

1,47MB

7656

622

D6

XMark

XMLbenchm

ark

113M

B1666315

1274

D7

NCBI

Biologicaldata

427,47

MB

2085385

524

D8

SwissPort

DBof

proteinsequences

112M

B2977031

684

D9

Medlin

e02n0378

Bibliography

medicinescience

120M

B2790422

878

D10

medlin

e02n0001

bibliography

medicinescience

58,13M

B1895193

848

D11

Part

TPC-H

benchm

ark

6,02MB

200001

411

D12

Lineitem

TPC-H

benchm

ark

30,7MB

1022976

418

D13

Customer

TPC-H

benchm

ark

5,14MB

135001

410

D14

Orders

TPC-H

benchm

ark

5,12MB

150001

411

D15

medlin

e02n0078

Bibliography

medicinescience

38,71M

B1079702

867

D16

TOL

Organismson

Earth

5,36MB

80057

243

4D17

Nasa

Astronomical

Data

24,4MB

476646

961

D18

Shakespeare

Shakespeare'splay

7,53MB

179727

959

D19

XMark

XMLbenchm

ark

1,12MB

17132

1274

D20

XMark

XMLbenchm

ark

11,1MB

167865

1274

D21

XMark

XMLbenchm

ark

1.09GB

16703210

1274

D22

Treebank

WallStreet

Journal

85,4MB

2437666

36250

D23

medlin

e01660167

Bibliography

medicinescience

196M

B5123499

975


7.4.1.1 Storage Requirements of SCQX

In our �rst set of experiments we measure storage requirements of the level-order labeling scheme used in SCQX and examine the impact of SCQX'scompaction method (described in Chapter 4) on the storage requirements,which are needed to store the compacted structure of an XML document,where a name pointer, parent pointer, and cardinality are needed to storeeach node of a compacted XML structure in the storage structures of SCQX.We also compare the storage requirements of SCQX with those of the treestructure representation. As shown in Figures 7.1, 7.2, 7.3, 7.4 and 7.5, thestorage requirements of SCQX are much smaller than that of level-order la-beling scheme without compaction or than that of the tree representation forall the datasets. These results con�rm that the proposed method (SCQX) hasa dramatic e�ect on reducing the storage size of XML documents.

Figure 7.1: Comparison of storage requirements for SCQX, Level-Order LSand Tree


7.4.1.2 Storage Requirements of CXQU

The next experiments measure the storage requirements of the cluster label-ing scheme proposed in Chapter 5 and test the impact of the new CXQUcompaction method on reducing the storage size needed to store the XMLdocuments. The results shown in Figures 7.6, 7.7, 7.8, 7.9 and 7.10 indicatethat the compaction method of CXQU has e�cient capabilities to reduce thestorage requirements.

Figure 7.6: Comparison of storage requirements for CXQU, Cluster LS andORDPATH


7.4.1.3 Storage Requirements of CXDLS

The following experiments are conducted on the method described in Chap-ter 6 to determine the in�uence of their compaction method on reducing stor-age requirements. We use di�erent XML datasets to test this in�uence andin order to show the importance of such a compaction method, we also com-pared these results with the storage requirements of the ORDPATH labelingscheme used to store the XML documents. It can be seen clearly from Fig-ures 7.11, 7.12, 7.13, 7.14 and 7.15, that this method improves performancesigni�cantly in terms of storage space consumption for almost all the datasets.It can be observed that the storage requirements are very small for the doc-uments such as PART, Lineitem, Order and Customer because they have aregular structure. At the same time the storage requirements are still rela-tively small for other documents that have either an irregular structure or lessregular structure.

Figure 7.11: Comparison of storage requirements for CXDLS and ORDPATH


7.4.1.4 Comparison of storage requirements for all proposed ap-

proaches vs. competitive approaches

The last group of experiments is to compare our approaches with other label-ing schemes, such as OrdPath, Dewey and pre/post labeling, in terms of labelstorage requirements. The aim of this experiment is to show the advantagesof our approaches for compacting XML structures. To carry out a fair com-parison of the last labeling schemes, we use the same pre�x-free encoding (seeFigure 5.5 on page 55) for all mentioned pre�x labeling schemes. Also, we donot store the pre labels in the pre/post labeling because they can be implicitlygiven by the array position. The results from the last experiment shown inFigures 7.16, 7.17, 7.18, 7.19, 7.20 and 7.21 demonstrate that the success rateof the use of our approaches is very high. We demonstrate that our approachcan dramatically reduce the storage requirements for various XML data sets,when compared to other existing approaches in this �eld.

Figure 7.16: Comparison of storage requirements for di�erent approaches



7.4.2 Query Performance

The second set of experiments is performed to examine the impact of theproposed methods on the query performance. We executed performance mea-surements that compare MonetDB/XQuery [Boncz 2006b] with our methodsfor all query sets presented in Tables A.1, A.2 and A.3. The experimentsfocus on the pure query evaluation time, excluding the time for parsing, com-piling and optimizing the queries as well as serialization times. To gain abetter insight into the query performance of all approaches, for each querywe recorded the averages of the elapsed time for 10 repetitions as a result.Figures 7.22, 7.23, 7.24, 7.25, 7.26 7.27, 7.28, 7.29, 7.30 and 7.31 contain theresults of our performance measurements (elapsed time in milliseconds) andshow the e�ectiveness of the use of our proposed methods in query processing.


Figure 7.22: Query Performance of our approaches vs. MonetDB/XQuerySystem (XMark Queries)




Figure 7.27: Query Performance of our approaches vs. MonetDB/XQuerySystem (Shakespeare Queries)


Table 7.2: Update queries for ShakespeareName QueriesUSQ1 SHAKESPEARE/ALL−WELL/PLAY/TITLEUSQ2 SHAKESPEARE/DREAM/PLAY

//LINE/STAGEDIR[text()=�Sings�]USQ3 SHAKESPEARE/HEN−IV−2/PLAY/ACT/EPILOGUE

7.4.3 Update Performance

Due to the fact that SCQX, described in Chapter 4, uses a level-order la-beling scheme, which does not facilitate updates e�ciently, and because theCXQU and CXDLS approaches, described in Chapter 5 and 6) use labelingschemes that adapt to updates, the last set of experiments reported in thissection will only deal with the e�ciency of the CXQU and CXDLS approachesin terms of update on compacted XML documents, which is expected to beone of the primary bene�ts of these approaches. In order to evaluate updateperformance, an insertion experiment was set to measure update processingtime by inserting nodes and subtrees at di�erent positions of synthetic andreal datasets.We measured the time for single node insertions at di�erent positions of theHamlet XML document, which is a play by Shakespeare. Hamlet has 5 ACTs.We add a new ACT before ACT[1], between ACT[4] and ACT[5], and afterACT[5] using �//ACT� as an XPath expression and with a predicate that se-lects the target node. We also measured the time for subtree insertions intodi�erent positions to the SHAKESPEARE XML document. We added a sub-tree in three insertion positions in the SHAKESPEARE selected by the XPathexpressions shown in Table 7.2. This subtree consists of 73 element nodes, 52text nodes and has 6 levels. Another insertion experiment was conducted onXMark to measure the update processing time by inserting nodes and a sub-tree at di�erent positions using the insertion queries shown in Table 7.3. Fig-ures 7.32, 7.33, 7.34 and 7.35 compare the average insertion times (in millisec-onds) of CXQU and CXDLS approaches with Monet/XQuery [Boncz 2006b].


Table 7.3: Update queries for XMarkName QueriesUXQ1 /site/regions/africaUXQ2 //open−auction[2]UXQ3 //item[40]UXQ4 /site/closed−auctions

Figure 7.32: The performance of inserting a subtree to Shakespeare

Figure 7.33: The performance of inserting a node to Hamlet


Figure 7.34: The performance of inserting node to XMark

Figure 7.35: The performance of inserting a subtree to XMark

7.5. Summary 108

7.5 Summary

Many observations can be made from the results of our experiments. Some ofthese observations are described as follows:In terms of storage requirements, the e�ciency of storage requirements forthe labeling schemes used in our approaches compared with other labelingschemes depends on the data. However, only the pre/post labeling schemecan compete with our approaches in some rare cases. The other labelingschemes have larger storage requirements for almost all datasets in compari-son to our approaches.Thanks to the proposed compaction methods, CXQU and CXDLS incurthe smallest storage requirements among all labeling schemes for almost alldatasets particularly those with regular structures although insert-friendlypre�x-based schemes are used, in which long labels are assigned for elementsat a deep level.As we have seen above in the comparison of di�erent approaches, our workindicates that it is possible to provide signi�cant bene�ts in terms of en-hanced query and update performance by reducing the storage requirementsof XML documents and using the labeling scheme techniques. Although Mon-etDB/XQuery currently provides remarkably good performance for query-ing and updating XML documents, our approaches can compete with Mon-etDB/XQuery and are even superior to it in many cases since they basicallysupport simple and e�cient computations for all kinds of structural relation-ships. Therefore our approaches represent suitable strategies for querying andupdating compacted XML documents directly and are able to o�er signi�cantimprovements to performance.

Chapter 8

Conclusions and Future Work

Contents8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 109

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 111

8.1 Conclusions

The ability to e�ciently manage XML data is essential due to the potentialbene�ts of using XML as a representation method for any kind of data. How-ever, XML is by nature verbose, which raises doubts as to its e�ciency as adata representation format. The use of native XML databases would be thebest alternative as they would be speci�cally designed to handle XML.In this thesis, we have presented di�erent techniques and algorithms that havebeen developed especially to deal with XML. We have seen how we were ableto manage XML data e�ciently and minimize storage requirements. I haveexploited three main ideas to improve the performance of XML. These ideashave proven bene�cial in this thesis and may also be helpful in other work inthis �eld.

• First, the separation of the XML structure from its data values is the keyin improving performance. As we know, most query evaluation tendsto focus on the XML structure with only localized access to the datavalues. Therefore this separation typically means that queries scan lessdata, thereby improving query-processing performance.

• Second, XML compaction can be useful in reducing the required storagespace because of the self-describing nature of XML that makes it verboseand its storage requirements often excessive.

• Third, XML labeling schemes are the most bene�cial technique to fa-cilitate XML processing when XML data is stored in databases. Theygather structural information from XML documents by assigning uniquecodes to the nodes in the tree and storing them in such a way that the

8.1. Conclusions 110

hierarchical structure of the XML documents is preserved. There is noneed to access the actual documents themselves. The labeling schemesallow a quick identi�cation of the structural relationships between nodesby performing a comparison of their labels in constant time. This iden-ti�cation plays a crucial role in e�cient XML processing.

In the �rst part of this thesis, we have demonstrated how XML documentscan be compacted e�ciently using our proposed SCQX approach. SCQXcompacts the XML structure without losing any information, by exploitingrepetitive consecutive tags and is based on a particular level-order XML la-beling scheme to support query processing. SCQX also stores the compactedXML structure and the data separately in a robust storage structure.In the second part of this thesis, we have proposed CXQU, an e�cientXML compaction and labeling method. CXQU stores the compacted XMLstructure separately from the data. It combines a compaction technique,which reduces the size of the structure of an XML document, by exploitingrepetitive consecutive tags in the structure with a new labeling schemethat supports e�cient query processing and updates on compacted XMLstructures.In the third part of this thesis, we have proposed CXDLS, which combines thestrengths of both labeling and compaction techniques. It exploits repetitiveconsecutive subtrees and tags for compacting the structure of XML docu-ments taking the advantages of the ORDPATH labeling scheme to representthe XML structure and retaining all the relationships between the nodes afterthe compaction. CXDLS also stores the compacted structure separately fromthe data values and supports both update and query processing e�ciently. Itcan reduce storage space dramatically especially for XML documents with aregular structure.In the last part of this thesis, the experiments with synthetic and real-lifedatasets show that our approaches improve performance signi�cantly in termsof storage space consumption, query processing and update execution time.

In summary, we have successfully addressed several important researchissues on improving XML performance through the ability to manage XMLdata e�ciently and minimize their storage requirements. We have exploitedthree main ideas to improve the performance of XML namley, the XMLseparation, the XML labeling schemes and the XML compaction techniques.Our primary focus has been on dealing with more complex XML typesand structural properties such as largness, static, dynamic, regularity, lessregularity, irregularity, etc.Our research presents approaches for e�ciently managing di�erent types of

8.2. Future Work 111

XML documents. Each approach achieves management and compaction byusing a di�erent method for speci�c type of XML document.The �rst approach is based on �xed labeling scheme and targets static XMLdocuments. It can e�ectively compact and query the compacted static XMLdocuments. It can also be used for non-static XML documents, but it cancompratively take longer time for relabeling.On the other hand, our second approach targets XML documents that areupdated frequently by using the cluster labeling scheme. It exploits thestructural properties and compacts the document signi�cantly. The secondapproach works well when applied to less regular or irregular structuresbecause it only compacts the repetitive consecutive tags in the structure evenif they do not exactly have the same substructure.Lastly, for improving the compaction of XML documents that have regularstructures, we present our third appraoch that is specially meant to com-pact XML documents with regular structures. This approach also targetsfrequently updated XML documents by using dynamic labeling schemes. Itexploits the similarity of consecutive tags and subtrees in the structure of theXML documents.

8.2 Future Work

There are many interesting and open research issues for further study thatcould build on the results and ideas developed in this thesis.

• One possible direction for future work is to further improve the perfor-mance of our approaches by extending them in terms of more indexingstructures and query processing algorithms.

• Complex operators such as aggregation, join and complex expressionssuch as present in XQuery are not considered in this work. EvaluatingXQuery on compacted XML documents would be a good set of openresearch questions for further improvements.

• A further avenue of possible future research is to explore the possibilityof using new labeling schemes that could help us to achieve even morecompact representation of XML. As shown in the second approach, thecluster labeling scheme played an important role in decreasing the stor-age requirements. While in the third approach, the compaction methodwas key to minimize the storage requirements. Therefore our plan is tocombine the ideas from the second and third approach to exploit theirbene�ts and may perform better than previous approaches.

8.2. Future Work 112

• One other possible future direction is to continue our investigations onvariations of binary encoding forms. This could o�er an extra degreeof freedom for query optimization and would provide opportunities tofurther minimize the storage costs.

• Compaction of large XML documents in a parallel and distributed en-vironment is also exciting direction for future work. Storage for XMLdocuments in a distributed database provides the basic infrastructureand motivation that the compaction process can be carried out for partsof one XML document in a parallel and/or distributed manner.

• Evaluating the abilities to update the XML documents is crucial to knowthe update performance. But as of currently, there is no XML bench-mark for updating the XML documents that could measure the impactof our approach, particularly in case of deletions. In our approach, thedeleted nodes are marked as deleted. At this point, it is uncertain howthe marked deletions a�ect our approach. This area needs to be ex-plored as it presents a potentially exciting �eld of research for futurework.

Appendix A

The used Query sets

Table A.1: The Query set for XMarkQueryId Queries

XQ1 site/regionsXQ2 site/closed−auctions/closed−auctionXQ3 site/people/personXQ4 site/regions/europe/itemXQ5 site/open−auctions/open−auction/initialXQ6 site/regions/asia/item/paymentXQ7 site/regions/namerica/item/nameXQ8 site/closed−auctions/closed−auction/annotation/ descrip-

tion/parlist/listitemXQ9 site/regions/africa/item/description/parlist/listitem/text/keywordXQ10 site/regions/australia/item//parlist/listitem/text/emph/keywordXQ12 site//personXQ13 site/closed−auctions//emphXQ14 //item/paymentXQ15 //category/description/parlist/listitemXQ16 //closed−auction//parlistXQ17 //item//mailXQ18 //annotation//listitem//textXQ19 //closed−auction//description//keywordXQ20 site//closed−auction//description/parlist/listitem

/parlist//text/emph/keywordXQ21 site//closed−auction//description/parlist/listitem/parlist//keywordXQ22 site/regions/europe/item/*XQ23 site/*/personXQ24 site/closed−auctions/closed−auction/annotation/ descrip-

tion/parlist/*/parlist/listitem/text/*/keyword

114

Table A.2: The Query set for XMarkQueryId Queries

XQ11 //itemXQ25 site//item//mailbox/*XQ26 //closed−auction//parlist/*XQ27 //closed−auction//*//keywordXQ28 //*//description//listitem/*XQ29 site/*//description/*/listitem/parlist//text//*/keywordXQ30 site//*/description/parlist/listitem/*//keywordXQ31 site/regions[asia]//item//nameXQ32 site/closed−auctions/closed−auction[annotation /descrip-

tion[parlist/listitem/text[keyword[bold]]]]/priceXQ33 site/people[person[pro�le[education]/age]]/person/phoneXQ34 //australia/item[payment='Cash']XQ35 //address[zipcode='16']XQ36 //africa//item[location = 'United States']XQ37 site/people/person/pro�le[education='College']XQ38 //item[payment = 'Creditcard']XQ39 site/people/person[pro�le/business = 'Yes']XQ40 //person[pro�le/business = 'Yes']XQ41 site/closed−auctions/closed−auction[price >= '40']XQ42 site//closed−auction//description/parlist/listitem/parlist //key-

word[text()='sleeping']XQ43 site/regions/europe/item/location[text()='Latvia']XQ44 site//open−auctions//initial[text()='78.57']XQ45 //initial[text()='78.57']XQ46 //person/address/city[text()='Orange']XQ47 //regions//name/text()XQ48 site/regions/namerica/item/name/text()XQ49 site/open−auctions/open−auction/bidder/increase/text()

115

Table A.3: The Query set for ShakespeareQueryId Queries

SQ1 /SHAKESPEARE/A−AND−C/PLAY/ACT/SCENE/SPEECH/SPEAKER

SQ2 /SHAKESPEARE/TAMING/PLAY//SCENE//SPEAKERSQ3 //SPEAKERSQ4 //FM/PSQ5 //SCENE/STAGEDIRSQ6 //PROLOGUE/STAGEDIRSQ7 //PLAY/INDUCT/SPEECH/SPEAKERSQ8 //PLAY/ACT/SCENE/SPEECH/SPEAKERSQ9 //PLAY/ACT/SCENE/SPEECH/LINE/STAGEDIRSQ10 //SCENE//TITLESQ11 //ACT//TITLESQ12 //PLAY//EPILOGUE/STAGEDIRSQ13 //PLAY//SCENE//STAGEDIRSQ14 //PLAY/*/*SQ15 //PLAY/ACT[2]SQ16 //PLAY/ACT[4]SQ17 //PLAY/ACT/SCENE/SPEECH[2]SQ18 //PLAY/ACT/SCENE/*[2]SQ19 //PLAY/INDUCT/SPEECH[SPEAKER='RUMOUR']SQ20 //PLAY/ACT/EPILOGUE/SPEECH[SPEAKER='KING']SQ21 //PROLOGUE/SPEECH[SPEAKER='Chorus']SQ22 //SPEECH[LINE='Amen.']SQ23 //LINE[STAGEDIR='Awaking']SQ24 //LINE[STAGEDIR='Aside']SQ25 //PLAY/ACT/SCENE/SPEECH[SPEAKER ='ANTONY']SQ26 //PERSONAE/PGROUP[GRPDESCR='senators.']SQ27 //SPEECH[SPEAKER='PHILO']/LINESQ28 //PLAY/ACT/SCENE/SPEECH[SPEAKER='Steward']/LINESQ29 /SHAKESPEARE//PLAY[TITLE = 'The Tragedy of Antony and

Cleopatra']//PERSONASQ30 //*[TITLE = 'The Tragedy of Antony and Cleopa-

tra']/ACT//LINESQ31 //*[TITLE = 'The Tragedy of Antony and Cleopa-

tra']//PERSONASQ32 ////SPEECH/SPEAKER[text() = 'Mark ANTONY']//LINESQ33 /SHAKESPEARE/A−AND−C/PLAY/ACT/SCENE/

SPEECH/SPEAKER[text() != 'ANTONY']

Bibliography

[Abiteboul 2001] Serge Abiteboul, Haim Kaplan and Tova Milo. Compact

labeling schemes for ancestor queries. In SODA, pages 547�556, 2001.25

[Agrawal 1989] Rakesh Agrawal, Alexander Borgida and H. V. Jagadish. Ef-�cient Management of Transitive Relationships in Large Data and

Knowledge Bases. In James Cli�ord, Bruce G. Lindsay and DavidMaier, editeurs, SIGMOD Conference, pages 253�262. ACM Press,1989. 25

[Ailamaki 2001] Anastassia Ailamaki, David J. DeWitt, Mark D. Hill andMarios Skounakis.Weaving Relations for Cache Performance. In Aperset al. [Apers 2001], pages 169�180. 29

[Alkhatib 2008a] Ramez Alkhatib and Marc H. Scholl. CXQU: A compact

XML storage for e�cient query and update processing. In Pit Pichap-pan and Ajith Abraham, editeurs, ICDIM, pages 605�612. IEEE, 2008.66

[Alkhatib 2008b] Ramez Alkhatib and Marc H. Scholl. E�cient Compression

and Querying of XML Repositories. In DEXA Workshops, pages 365�369. IEEE Computer Society, 2008. 47

[Alkhatib 2009] Ramez Alkhatib and Marc H. Scholl. Compacting XML

Structures Using a Dynamic Labeling Scheme. In Alan P. Sexton,editeur, BNCOD, volume 5588 of Lecture Notes in Computer Science,pages 158�170. Springer, 2009. 81

[Apers 2001] Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano Para-boschi, Kotagiri Ramamohanarao and Richard T. Snodgrass, editeurs.Vldb 2001, proceedings of 27th international conference on very largedata bases, september 11-14, 2001, roma, italy. Morgan Kaufmann,2001. 116, 120

[Batory 1979] Don S. Batory. On Searching Transposed Files. ACM Trans.Database Syst., vol. 4, no. 4, pages 531�544, 1979. 29

[Boag 1999] Scott Boag, Don Chamberlin, Mary F. Fernández, Daniela Flo-rescu, Jonathan Robie and Jérôme Siméon. XML Path Language

(XPath) Version 1.0. World Wide Web Consortium, 16 November1999. http://www.w3.org/TR/xpath. 14, 18

http://www.w3.org/TR/xpath

Bibliography 117

[Böhme 2004] Timo Böhme and Erhard Rahm. Supporting E�cient Stream-

ing and Insertion of XML Data in RDBMS. In Zohra Bellahsene andPeter McBrien, editeurs, DIWeb, pages 70�81, 2004. 25, 50

[Boncz 2005] Peter A. Boncz, Torsten Grust, Maurice van Keulen, StefanManegold, Jan Rittinger and Jens Teubner. Path�nder: XQuery -

The Relational Way. In Klemens Böhm, Christian S. Jensen, Laura M.Haas, Martin L. Kersten, Per-Åke Larson and Beng Chin Ooi, editeurs,VLDB, pages 1322�1325. ACM, 2005. 20

[Boncz 2006a] Peter A. Boncz, Jan Flokstra, Torsten Grust, Maurice vanKeulen, Stefan Manegold, K. Sjoerd Mullender, Jan Rittinger and JensTeubner. MonetDB/XQuery-Consistent and E�cient Updates on the

Pre/Post Plane. In Yannis E. Ioannidis, Marc H. Scholl, Joachim W.Schmidt, Florian Matthes, Michael Hatzopoulos, Klemens Böhm, Al-fons Kemper, Torsten Grust and Christian Böhm, editeurs, EDBT,volume 3896 of Lecture Notes in Computer Science, pages 1190�1193.Springer, 2006. 20

[Boncz 2006b] Peter A. Boncz, Torsten Grust, Maurice van Keulen, StefanManegold, Jan Rittinger and Jens Teubner. MonetDB/XQuery: a fast

XQuery processor powered by a relational engine. In Surajit Chaudhuri,Vagelis Hristidis and Neoklis Polyzotis, editeurs, SIGMOD Conference,pages 479�490. ACM, 2006. 20, 99, 105

[Bosak 1998a] Jon Bosak. XML-tagged religion. Oct 1998. http://xml.

coverpages.org/bosakXMLReligion200.html. 84

[Bosak 1998b] Jon Bosak, Tim Bray, Dan Connolly, Eve Maler, GavinNicol, C. Michael Sperberg-McQueen and0 Lauren Wood and JamesClark. W3C XML Speci�cation DTD (�XMLspec�). World WideWeb Consortium, September 1998. http://www.w3.org/XML/1998/

06/xmlspec-report-19980910.htm. 10, 11

[Bosak 1999] Jon Bosak. Shakespeare 2.00. July 1999. http://www.cs.wisc.edu/niagara/data/shakes. 83

[Brantner 2005] Matthias Brantner, Sven Helmer, Carl-Christian Kanne andGuido Moerkotte. Full-�edged Algebraic XPath Processing in Natix.In ICDE [DBL 2005], pages 705�716. 21

[Bray 2008] Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler,Sun Microsystems and François Yergeau. Extensible Markup Language

http://xml.coverpages.org/bosakXMLReligion200.html

http://xml.coverpages.org/bosakXMLReligion200.html

http://www.w3.org/XML/1998/06/xmlspec-report-19980910.htm

http://www.w3.org/XML/1998/06/xmlspec-report-19980910.htm

http://www.cs.wisc.edu/niagara/data/shakes

http://www.cs.wisc.edu/niagara/data/shakes

Bibliography 118

(XML) 1.0 (Fifth Edition). World Wide Web Consortium, 26 Novem-ber 2008. http://www.w3.org/TR/xml. 7, 9

[Buneman 2003] Peter Buneman, Martin Grohe and Christoph Koch. Path

Queries on Compressed XML. In VLDB, pages 141�152, 2003. 29

[Buneman 2005] Peter Buneman, Byron Choi, Wenfei Fan, Robert Hutchison,Robert Mann and Stratis Viglas. Vectorizing and Querying Large XML

Repositories. In ICDE [DBL 2005], pages 261�272. 29

[Chamberlin 2000] Don Chamberlin, Jonathan Robie and Daniela Florescu.Quilt: an XML Query Language for Heterogeneous Data Sources.In Lecture Notes in Computer Science. Springer-Verlag, December2000. Also available at http://www.almaden.ibm.com/cs/people/

chamberlin/quilt_lncs.pdf. See also http://www.almaden.ibm.

com/cs/people/chamberlin/quilt.html. 18

[Chen 2001] Zhiyuan Chen, H.V. Jagadish, Flip Korn, Nick Koudas,S. Muthukrishnan, Divesh Srivastava and Raymond Ng. Counting

Twig Matches in a Tree. Data Engineering, International Conferenceon, vol. 0, page 0595, 2001. 36

[Clark 1999] James Clark and Steve DeRose. XML Path Language (XPath)

Version 1.0. World Wide Web Consortium, 16 November 1999. http://www.w3.org/TR/xpath. 14

[Cohen 2002] Edith Cohen, Haim Kaplan and Tova Milo. Labeling Dynamic

XML Trees. In Lucian Popa, editeur, PODS, pages 271�281. ACM,2002. 25

[DBL 2005] Proceedings of the 21st international conference on data engineer-ing, icde 2005, 5-8 april 2005, tokyo, japan. IEEE Computer Society,2005. 117, 118

[Esbudellat ] Esbudellat. The XPath Axes. http://www.esbudellat.net.16

[Fallside 2004] David C. Fallside and Priscilla Walmsley. XML Schema Part

0: Primer Second Edition. World Wide Web Consortium, 28 October2004. http://www.w3.org/TR/xmlschema-0. 10, 12

[Fiebig 2002] Thorsten Fiebig, Sven Helmer, Carl-Christian Kanne, GuidoMoerkotte, Julia Neumann, Robert Schiele and Till Westmann. Natix:A Technology Overview. In Akmal B. Chaudhri, Mario Jeckle, Erhard

http://www.w3.org/TR/xml

http://www.almaden.ibm.com/cs/people/chamberlin/quilt_lncs.pdf

http://www.almaden.ibm.com/cs/people/chamberlin/quilt_lncs.pdf

http://www.almaden.ibm.com/cs/people/chamberlin/quilt.html

http://www.almaden.ibm.com/cs/people/chamberlin/quilt.html



http://www.esbudellat.net

http://www.w3.org/TR/xmlschema-0

Bibliography 119

Rahm and Rainer Unland, editeurs, Web, Web-Services, and DatabaseSystems, volume 2593 of Lecture Notes in Computer Science, pages12�33. Springer, 2002. 21

[Florescu 1999] Daniela Florescu and Donald Kossmann. Storing and Query-

ing XML Data using an RDMBS. IEEE Data Eng. Bull., vol. 22, no. 3,pages 27�34, 1999. 38

[Franklin 2002] Michael J. Franklin, Bongki Moon and Anastassia Ailamaki,editeurs. Proceedings of the 2002 acm sigmod international conferenceon management of data, madison, wisconsin, june 3-6, 2002. ACM,2002. 119, 122

[Freire 2002] Juliana Freire, Jayant R. Haritsa, Maya Ramanath, Prasan Royand Jérôme Siméon. StatiX: making XML count. In Franklin et al.[Franklin 2002], pages 181�191. 36

[Grün 2006] Christian Grün, Alexander Holupirek, Marc Kramis, Marc H.Scholl and Marcel Waldvogel. Pushing XPath Accelerator to its Limits.In Philippe Bonnet and Ioana Manolescu, editeurs, ExpDB. ACM,2006. 22

[Grün 2007] Christian Grün, Alexander Holupirek and Marc H. Scholl. Vi-

sually Exploring and Querying XML with BaseX. In Kemper et al.[Kemper 2007], pages 629�632. 22

[Grust 2002] Torsten Grust. Accelerating XPath location steps. In Franklinet al. [Franklin 2002], pages 109�120. 21, 22, 25, 26, 50

[Grust 2003a] Torsten Grust and Maurice van Keulen. Tree Awareness for Re-lational DBMS Kernels: Staircase Join. In Henk M. Blanken, TorstenGrabs, Hans-Jörg Schek, Ralf Schenkel and Gerhard Weikum, edi-teurs, Intelligent Search on XML Data, volume 2818 of Lecture Notesin Computer Science, pages 231�245. Springer, 2003. 21

[Grust 2003b] Torsten Grust, Maurice van Keulen and Jens Teubner. Stair-case Join: Teach a Relational DBMS to Watch its (Axis) Steps. InVLDB, pages 524�525, 2003. 21

[Härder 2007] Theo Härder, Michael Peter Haustein, Christian Mathis andMarkus Wagner 0002. Node labeling schemes for dynamic XML docu-

ments reconsidered. Data Knowl. Eng., vol. 60, no. 1, pages 126�149,2007. 25

Bibliography 120

[Holupirek 2009] Alexander Holupirek, Christian Grün and Marc H. Scholl.BaseX & DeepFS joint storage for �lesystem and database. In Kerstenet al. [Kersten 2009], pages 1108�1111. 22

[Jagadish 2001] H. V. Jagadish, Laks V. S. Lakshmanan, Divesh Srivastavaand Keith Thompson. TAX: A Tree Algebra for XML. In GiorgioGhelli and Gösta Grahne, editeurs, DBPL, volume 2397 of LectureNotes in Computer Science, pages 149�164. Springer, 2001. 22

[Jagadish 2002] H. V. Jagadish, Shurug Al-Khalifa, Adriane Chapman, LaksV. S. Lakshmanan, Andrew Nierman, Stelios Paparizos, Jignesh M.Patel, Divesh Srivastava, Nuwee Wiwatwattana, Yuqing Wu and CongYu. TIMBER: A native XML database. VLDB J., vol. 11, no. 4, pages274�291, 2002. 21

[Kemper 2007] Alfons Kemper, Harald Schöning, Thomas Rose, MatthiasJarke, Thomas Seidl, Christoph Quix and Christoph Brochhaus, edi-teurs. Datenbanksysteme in business, technologie und web (btw 2007),12. fachtagung des gi-fachbereichs "datenbanken und informationssys-teme" (dbis), proceedings, 7.-9. märz 2007, aachen, germany, volume103 of LNI. GI, 2007. 119, 121, 122

[Kersten 2009] Martin L. Kersten, Boris Novikov, Jens Teubner, VladimirPolutin and Stefan Manegold, editeurs. Edbt 2009, 12th internationalconference on extending database technology, saint petersburg, russia,march 24-26, 2009, proceedings, volume 360 of ACM International

Conference Proceeding Series. ACM, 2009. 120

[Li 2001] Quanzhong Li and Bongki Moon. Indexing and Querying XML Data

for Regular Path Expressions. In Apers et al. [Apers 2001], pages 361�370. 25, 50

[Liefke 2000] Hartmut Liefke and Dan Suciu. XMILL: An E�cient Com-

pressor for XML Data. In Weidong Chen, Je�rey F. Naughton andPhilip A. Bernstein, editeurs, SIGMOD Conference, pages 153�164.ACM, 2000. 29

[Luo 2009] Cheng Luo, Zhewei Jiang, Wen-Chi Hou, Feng Yu and Qiang Zhu.A sampling approach for XML query selectivity estimation. In Kerstenet al. [Kersten 2009], pages 335�344. 36

[MADDISON 2007] DAVID R. MADDISON, KATJA-SABINE SCHULZ andWAYNE P. MADDISON. The Tree of Life Web Project. ZOOTAXA,pages 19�40, 20 Nov. 2007. 84

Bibliography 121

[May 1999] Wolfgang May. Information Extraction and Integration with

Florid: The Mondial Case Study. Rapport technique 131, Uni-versität Freiburg, Institut für Informatik, 1999. Available from http:

//dbis.informatik.uni-goettingen.de/Mondial. 84

[Miklau ] Gerome Miklau. XML Repository. http://www.cs.washington.

edu/research/xmldatasets. 83, 84, 85

[NCBI ] NCBI. National Center for Biotechnology Information(NCBI) XML

Data Format. http://www.ncbi.nlm.nih.gov/index.html. 85

[NLM ] NLM. National Library of Medicine (NLM) XML Data Format.http://xml.coverpages.org/nlmXML.html. 84

[Olteanu 2002] Dan Olteanu, Holger Meuss, Tim Furche and François Bry.XPath: Looking Forward. In Akmal B. Chaudhri, Rainer Unland,Chabane Djeraba and Wolfgang Lindner, editeurs, EDBT Workshops,volume 2490 of Lecture Notes in Computer Science, pages 109�127.Springer, 2002. 42, 64, 79

[O'Neil 2004] Patrick E. O'Neil, Elizabeth J. O'Neil, Shankar Pal, IstvanCseri, Gideon Schaller and Nigel Westbury. ORDPATHs: Insert-

Friendly XML Node Labels. In Weikum et al. [Weikum 2004], pages903�908. 50, 51, 68

[Polyzotis 2004a] Neoklis Polyzotis, Minos N. Garofalakis and Yannis E.Ioannidis. Approximate XML Query Answers. In Weikum et al.[Weikum 2004], pages 263�274. 36

[Polyzotis 2004b] Neoklis Polyzotis, Minos N. Garofalakis and Yannis E. Ioan-nidis. Selectivity Estimation for XML Twigs. In ICDE, pages 264�275.IEEE Computer Society, 2004. 36

[Raggett 1999] Dave Raggett, Arnaud Le Hors and Ian Jacobs. HTML 4.01

Speci�cation. World Wide Web Consortium, 24 December 1999. http://www.w3.org/TR/html401. 7

[Rittinger 2005] Jan Rittinger. Path�nder/MonetDB: A High-Performance

Relational Runtime for XQuery. In Stefan Brass and Christian Gold-berg, editeurs, Grundlagen von Datenbanken, pages 104�106. Instituteof Computer Science, Martin-Luther-University, 2005. 20

[Rittinger 2007] Jan Rittinger, Jens Teubner and Torsten Grust. Path�nder:A Relational Query Optimizer Explores XQuery Terrain. In Kemperet al. [Kemper 2007], pages 617�620. 20

http://dbis.informatik.uni-goettingen.de/Mondial

http://dbis.informatik.uni-goettingen.de/Mondial

http://www.cs.washington.edu/research/xmldatasets

http://www.cs.washington.edu/research/xmldatasets

http://www.ncbi.nlm.nih.gov/index.html

http://xml.coverpages.org/nlmXML.html

http://www.w3.org/TR/html401

http://www.w3.org/TR/html401

Bibliography 122

[Saxproject ] Saxproject. Simple API for XML (SAX). http://www.

saxproject.org. 13, 71, 82

[Schmidt 2002] Albrecht Schmidt, Florian Waas, Martin L. Kersten,Michael J. Carey, Ioana Manolescu and Ralph Busse. XMark: A

Benchmark for XML Data Management. In VLDB, pages 974�985.Morgan Kaufmann, 2002. 83

[Tatarinov 2002] Igor Tatarinov, Stratis Viglas, Kevin S. Beyer, Jayavel Shan-mugasundaram, Eugene J. Shekita and Chun Zhang. Storing and

querying ordered XML using a relational database system. In Franklinet al. [Franklin 2002], pages 204�215. 25, 27, 68

[Teubner 2006] Jens Thilo Teubner. Path�nder: XQuery Compilation Tech-

niques for Relational Database Targets. In PhD thesis, LMU Munichin Germany, September 2006. 26

[Teubner 2007] Jens Teubner. Path�nder: XQuery Compila-tion Techniques

for Relational Database Targets. In Kemper et al. [Kemper 2007], pages465�474. 20

[W3C 1998] W3C. Document Object Model (DOM). World Wide Web Con-sortium, 1998. http://www.w3.org/DOM. 13, 83

[web project 1998] ToL web project. the ToL Tree Structure. 1998. http:

//tolweb.org/tree/home.pages/downloadtree.html. 84

[Weikum 2004] Gerhard Weikum, Arnd Christian König and Stefan Deÿloch,editeurs. Proceedings of the acm sigmod international conference onmanagement of data, paris, france, june 13-18, 2004. ACM, 2004. 121

[Wu 2008] Yuqing Wu, Stelios Paparizos and H. V. Jagadish. Querying XML

in Timber. IEEE Data Eng. Bull., vol. 31, no. 4, pages 15�24, 2008.21

[Zhang 2001] Chun Zhang, Je�rey F. Naughton, David J. DeWitt, Qiong Luoand Guy M. Lohman. On Supporting Containment Queries in Rela-

tional Database Management Systems. In SIGMOD Conference, pages425�436, 2001. 25

[Zhang 2006] Ning Zhang, M. Tamer Özsu, Ashraf Aboulnaga and Ihab F.Ilyas. XSEED: Accurate and Fast Cardinality Estimation for XPath

Queries. In Ling Liu, Andreas Reuter, Kyu-Young Whang and JianjunZhang, editeurs, ICDE, page 61. IEEE Computer Society, 2006. 36

http://www.saxproject.org

http://www.saxproject.org

http://www.w3.org/DOM

http://tolweb.org/tree/home.pages/downloadtree.html

http://tolweb.org/tree/home.pages/downloadtree.html

Bibliography 123

[Zukowski 2005] Marcin Zukowski, Peter A. Boncz, Niels Nes and Sándor Hé-man. MonetDB/X100 - A DBMS In The CPU Cache. IEEE DataEng. Bull., vol. 28, no. 2, pages 17�22, 2005. 21