xml clustering and its application to xml transformation · 2013-11-13 · xml clustering and its...

XML Clustering and Its Application to XML

Transformation

Tien Tran

Discipline of Computer Science

Faculty of Science and Technology at

Queensland University of Technology

Brisbane, Australia

Principal Supervisor: Dr. Richi Nayak

Associate Supervisor: Professor Peter Bruza

Abstract

The continuous growth of the XML data poses a great concern in the area of XML data

management. The need for processing large amounts of XML data brings complications

to many applications, such as information retrieval, data integration and many others.

One way of simplifying this problem is to break the massive amount of data into smaller

groups by application of clustering techniques. However, XML clustering is an intricate

task that may involve the processing of both the structure and the content of XML data

in order to identify similar XML data.

This research presents four clustering methods, two methods utilizing the structure of

XML documents and the other two utilizing both the structure and the content. The two

structural clustering methods have different data models. One is based on a path model

and other is based on a tree model. These methods employ rigid similarity measures

which aim to identifying corresponding elements between documents with different or

similar underlying structure.

The two clustering methods that utilize both the structural and content information vary

in terms of how the structure and content similarity are combined. One clustering method

calculates the document similarity by using a linear weighting combination strategy of

structure and content similarities. The content similarity in this clustering method is

based on a semantic kernel. The other method calculates the distance between documents

by a non-linear combination of the structure and content of XML documents using a

semantic kernel.

ii

ABSTRACT iii

Empirical analysis shows that the structure-only clustering method based on the tree

model is more scalable than the structure-only clustering method based on the path model

as the tree similarity measure for the tree model does not need to visit the parents of an

element many times. Experimental results also show that the clustering methods perform

better with the inclusion of the content information on most test document collections.

To further the research, the structural clustering method based on tree model is extended

and employed in XML transformation. The results from the experiments show that the

proposed transformation process is faster than the traditional transformation system that

translates and converts the source XML documents sequentially. Also, the schema match-

ing process of XML transformation produces a better matching result in a shorter time.

Table of Contents

Abstract ii

List of Figures viii

List of Tables x

Statement of Original Authorship xi

Acknowledgements xii

Chapter 1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Research Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Chapter 2 Background and Related Work 13

2.1 XML Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 XML Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.1 Structure-based Clustering . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.1.1 Tree-based Approaches . . . . . . . . . . . . . . . . . . . . 22

2.2.1.2 Path-based Approaches . . . . . . . . . . . . . . . . . . . . 26

2.2.1.3 Graph-based Approaches . . . . . . . . . . . . . . . . . . . 28

2.2.2 Content-based Clustering . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2.2.1 Feature Reduction . . . . . . . . . . . . . . . . . . . . . . . 33

2.2.2.2 Semantic Kernel . . . . . . . . . . . . . . . . . . . . . . . . 34

2.2.3 Content and Structure-based Clustering . . . . . . . . . . . . . . . . 35

2.2.3.1 Non-Linear Approaches . . . . . . . . . . . . . . . . . . . . 36

2.2.3.2 Linear Approaches . . . . . . . . . . . . . . . . . . . . . . . 38

2.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.3 XML Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

iv

TABLE OF CONTENTS v

2.3.1 Schema Matching Approaches . . . . . . . . . . . . . . . . . . . . . . 42

2.3.1.1 Schema-Matching Systems . . . . . . . . . . . . . . . . . . 44

2.3.1.2 Schema Matching for XML Clustering . . . . . . . . . . . . 46

2.3.1.3 Schema Matching for Transformation Approaches . . . . . 47

2.3.2 Transformation Approaches . . . . . . . . . . . . . . . . . . . . . . . 48

2.3.2.1 XSLT for XML Transformation . . . . . . . . . . . . . . . 49

2.3.2.2 Other Manipulation Languages for XML transformation . . 50

2.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Chapter 3 The Proposed Clustering Methods 53

3.1 The Proposed Clustering Methods: Overview . . . . . . . . . . . . . . . . . 54

3.2 The Structure-Only Clustering Methods . . . . . . . . . . . . . . . . . . . . 55

3.2.1 The XCTree Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.2.1.1 The Tree Model . . . . . . . . . . . . . . . . . . . . . . . . 56

3.2.1.2 The Tree Similarity Measure: TSim . . . . . . . . . . . . . 58

3.2.2 The XCPath Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.2.2.1 The Path Model . . . . . . . . . . . . . . . . . . . . . . . . 64

3.2.2.2 The Path Similarity Measure: CPSim . . . . . . . . . . . . 65

3.3 The Content and Structure-based Clustering Methods . . . . . . . . . . . . 68

3.3.1 The XCLComb Method . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.3.1.1 The Tree Model and The Text Vector Model . . . . . . . . 70

3.3.1.2 The Linear Similarity Measure: LCSim . . . . . . . . . . . 70

3.3.2 The XCTPath Method . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.3.2.1 The Text-Path Vector Model . . . . . . . . . . . . . . . . . 71

3.3.2.2 The Non-Linear Measure: TPVSim . . . . . . . . . . . . . 73

3.3.3 The Kernel Construction Approach . . . . . . . . . . . . . . . . . . . 74

3.4 The Hybrid Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.4.1 The Incremental Clustering Stage . . . . . . . . . . . . . . . . . . . 79

3.4.2 The Iteration Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.4.3 The Pair-wise Clustering Stage . . . . . . . . . . . . . . . . . . . . . 84

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Chapter 4 Empirical Evaluation of the Clustering Methods 86

4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.2 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.3 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.3.1 Purity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.3.2 Normalized Mutual Information . . . . . . . . . . . . . . . . . . . . 94

TABLE OF CONTENTS vi

4.3.3 F1-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.5 Results of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.5.1 Analysing the Structure-only Clustering Methods . . . . . . . . . . . 99

4.5.1.1 Clustering Threshold . . . . . . . . . . . . . . . . . . . . . 99

4.5.1.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.5.1.3 Path Threshold . . . . . . . . . . . . . . . . . . . . . . . . 102

4.5.1.4 Three Stages of the Hybrid Clustering Algorithm . . . . . 105

4.5.1.5 Methods Comparison . . . . . . . . . . . . . . . . . . . . . 108

4.5.2 Analysing the Content and Structure-based Clustering Methods . . 110

4.5.2.1 Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.5.2.2 Weighting in the XCLComb method. . . . . . . . . . . . . 113

4.5.2.3 Content-Only Comparison . . . . . . . . . . . . . . . . . . 114

4.5.2.4 Path Length in the XCTPath Method . . . . . . . . . . . . 115

4.5.2.5 Content and Structure-based Methods Comparison. . . . . 117

4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

Chapter 5 XML Transformation Approach 124

5.1 The XML Transformation Approach: Overview . . . . . . . . . . . . . . . . 125

5.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.3 Element Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.3.1 Discovery of Corresponding Leaf Elements . . . . . . . . . . . . . . . 132

5.3.2 Discovery of All Corresponding Elements . . . . . . . . . . . . . . . 135

5.4 Transformation Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.5 XSLT Script Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

5.6 Results of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

5.6.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

5.6.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

5.6.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

5.6.4 Element Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

5.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

Chapter 6 Conclusion 152

6.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

6.2 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

6.3 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 155

TABLE OF CONTENTS vii

Chapter 7 Appendix 157

7.1 DTD Definitions of the Data Collections for XML Clustering Methods . . . 157

7.2 DTD definitions for the XML Transformation Approach . . . . . . . . . . . 157

Publications 164

Bibliography 165

List of Figures

1.1 The current approach for XML transformation process. . . . . . . . . . . . 7

1.2 The proposed approach for XML transformation process. . . . . . . . . . . 8

2.1 The classification of XML data. . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 An example of a conference XML document . . . . . . . . . . . . . . . . . . 16

2.3 An example of a conference DTD definition . . . . . . . . . . . . . . . . . . 16

2.4 An example of a conference XSD definition . . . . . . . . . . . . . . . . . . 17

2.5 A generic XML data clustering process . . . . . . . . . . . . . . . . . . . . . 19

2.6 Tree representation of the XML document structure . . . . . . . . . . . . . 23

2.7 Complete paths extracted from the tree model in Figure 2.6. . . . . . . . . 26

2.8 Graph representation of an XML definition . . . . . . . . . . . . . . . . . . 29

2.9 Bipartite graph representation . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.10 The VSM model with term frequency . . . . . . . . . . . . . . . . . . . . . 32

2.11 The classification of the XML clustering approaches for XML data. . . . . . 40

2.12 The transformation process for XML data. . . . . . . . . . . . . . . . . . . . 42

3.1 An overview of the proposed clustering methods. . . . . . . . . . . . . . . . 55

3.2 An example of a tree structure (a) and its corresponding summary tree

structure in depth-first string tree encoding format (b). . . . . . . . . . . . 57

3.3 An example of the treeMatching algorithm from tx to ty. . . . . . . . . . . . 63

3.4 An example of the treeMatching algorithm from ty to tx. . . . . . . . . . . . 64

3.5 CNC matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.6 An example of a conference XML document . . . . . . . . . . . . . . . . . . 73

3.7 The hybrid XML clustering approach overview . . . . . . . . . . . . . . . . 78

4.1 The effect of the clustering threshold on the XCTree and XCPath methods. 100

4.2 The processing time of the structure-only clustering methods. . . . . . . . . 102

4.3 The effect of the path thresholds with the clustering threshold of 0.9. . . . . 104

4.4 The effect of the path threshold with different clustering thresholds on the

XCPath method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.5 The accuracy of the clustering solution at the three stages of the XCTree

method at clustering threshold 0.9. . . . . . . . . . . . . . . . . . . . . . . . 107

viii

LIST OF FIGURES ix

4.6 The accuracy of the clustering solution at the three stages of the XCPath

method at the clustering threshold of 0.9 and path threshold of 0.7. . . . . 108

4.7 The comparison of different structure-only clustering methods. . . . . . . . 109

4.8 The sensitivity of the k value on the kernel. . . . . . . . . . . . . . . . . . . 112

4.9 The effect of the lambda of the XCLComb method. . . . . . . . . . . . . . . 114

4.10 The comparison of the different content clustering methods. . . . . . . . . . 116

4.11 The comparison of the different path length of the XCTPath method. . . . 117

4.12 The comparison of the clustering methods utilizing semantic kernel. . . . . 118

4.13 The comparison of all methods on the Niagara collection. . . . . . . . . . . 121

4.14 The comparison of all methods on the Publication collection. . . . . . . . . 121

4.15 The comparison of all methods on the DBLP collection. . . . . . . . . . . . 122

4.16 The comparison of all methods on the IEEE collection. . . . . . . . . . . . . 122

5.1 The XCTrans approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.2 An example of source document structures in the same cluster. . . . . . . . 130

5.3 An example of a source summary structure format. . . . . . . . . . . . . . . 130

5.4 An example of a target structure definition represented in a tree formats. . 131

5.5 Element mapping algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

5.6 Element mapping result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

5.7 An example of an XSLT Script. . . . . . . . . . . . . . . . . . . . . . . . . . 143

5.8 XML transformation process time on the dataset. . . . . . . . . . . . . . . . 147

5.9 The processing time in second in relation to the number of documents in

the DBLP collection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

5.10 The processing time in seconds with the different numbers of clusters on

the DBLP collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

5.11 The mapping accuracy based on recall measure. . . . . . . . . . . . . . . . . 149

5.12 The mapping accuracy based on precision measure. . . . . . . . . . . . . . . 149

7.1 An example of the IEEE article DTD definition . . . . . . . . . . . . . . . . 158

7.2 An example of the Berkeley article DTD definition . . . . . . . . . . . . . . 158

7.3 An example of the HCI article DTD definition . . . . . . . . . . . . . . . . 159

7.4 An example of the DBLP article DTD definition . . . . . . . . . . . . . . . 160

7.5 The source Bibliography article DTD definition . . . . . . . . . . . . . . . . 161

7.6 The target Bibliography article DTD definition . . . . . . . . . . . . . . . . 161

7.7 The source Movies DTD definition . . . . . . . . . . . . . . . . . . . . . . . 162

7.8 The target Movies DTD definition . . . . . . . . . . . . . . . . . . . . . . . 162

7.9 A portion of the source DBLP DTD definition . . . . . . . . . . . . . . . . 163

7.10 A portion of the target DBLP DTD definition . . . . . . . . . . . . . . . . . 163

List of Tables

2.1 An overview of the structure-only clustering approaches . . . . . . . . . . . 41

2.2 An overview of the content and structure-based clustering approaches . . . 41

4.1 Data collections for XML clustering . . . . . . . . . . . . . . . . . . . . . . 88

4.2 The classification of the data collections for XML clustering . . . . . . . . . 89

4.3 Details of the pre-processed data collections . . . . . . . . . . . . . . . . . . 92

4.4 The number of clusters generated at the incremental clustering stage with

different clustering thresholds. . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.1 Quantifier mapping between XSD and DTD . . . . . . . . . . . . . . . . . . 131

5.2 The leaf element mapping result . . . . . . . . . . . . . . . . . . . . . . . . 136

5.3 Transformation operators for corresponding elements . . . . . . . . . . . . . 139

5.4 Data collections for XML transformation . . . . . . . . . . . . . . . . . . . . 145

x

Statement of Original Authorship xi

Statement of Original Authorship

The work contained in this thesis has not been previously submitted to meet requirements

for an award at this or any other higher education institution. To the best of my knowledge

and belief, the thesis contains no material previously published or written by another

person except where due reference is made.

Signature:

Date:

Acknowledgements xii

Acknowledgements

I would like to express my sincere gratitude and deep appreciation to my principal super-

visor, Dr. Richi Nayak. This thesis would not have been possible without the guidance,

encouragement and support of my principal supervisor.

I also thank my associate supervisor, Prof. Peter Bruza, for his valuable time and support.

Special thanks to Helen for her careful proofreading of most chapters in this thesis in a

very short time.

I would like to acknowledge the High Performance Computing and Research Support

(HPC) services at Queensland University of Technology (QUT) for providing me the su-

percomputer account for running most of my clustering methods on.

I am indebted to my many colleagues, especially to Sangeetha Kutty, for their cheer and

support throughout the course of my PhD research.

Finally, I would like to thank my family and friends for their encouragement and support.

Chapter 1Introduction

The growth in the amount of XML data is inevitable as an increasing number of organiza-

tions are starting to take advantage of the Web for data distribution [2, 74]. Consequently,

the need for better managing and analysing large collections of XML data is indisputable.

For better management, many researchers have focused their attentions on the clustering

of XML data [2]. Clustering is a data mining technique for grouping objects into smaller

groups according to their feature commonality [25]. XML clustering has played a crucial

role for many application domains, such as information retrieval, data integration, docu-

ment classification, Web mining and query processing [37, 5, 70]. Therefore, the first part

of the thesis is to explore methods for the clustering of XML documents. The first key hy-

pothesis for this research is that the clustering methods utilizing both the content

and structural information of XML documents produce a better clustering

solution in terms of accuracy than the clustering methods solely utilizing the

content-only or structure-only information of XML documents.

1

2

The second part of the research is to use XML clustering for XML transformation. The

XML transformation is a process of converting the structural representation of an XML

document which is the source document into another given structure which is the tar-

get document. One problem with this process is that the generation of a transformation

script is time consuming [31, 55, 54, 65, 67]. For example, if there are ten source docu-

ments which need to be converted into the same target document then the transformation

process has to be executed ten times. However, if among these source documents, there

are some similar substructures, such as in publication articles, then these substructures

repeatedly perform by the schema-matching process in finding corresponding structures

in the target document. One way of reducing the work of the schema matching pro-

cess is by integrating the structures of these documents into a global summary structure

known as schema integration [37]. Integration of similar substructures raises the problem

of document heterogeneity which occurs when documents which are semantically the same

may have different structures and elements names. One way of simplifying the document

heterogeneity of large amounts of XML documents is through a clustering process which

groups the XML documents into groups based on their similar data and/or structure [37].

Thus, the second key hypothesis of this research is that XML clustering based on

the structural information of XML documents can improve the transforma-

tion process in terms of time and accuracy for the conversion of more than

two source documents into the same target document.

To clearly understand the objective of this research, this chapter outlines the motivation,

objectives, aims and contributions of the research in this thesis. The chapter is concluded

with the structure of this thesis.

1.1. Motivation 3

1.1 Motivation

XML1 has become a popular data exchange language due to its flexibility of allowing users

to define their own XML schema definitions. However, such flexibility gives rise to the

problem of document heterogeneity because each organization or application can create its

own XML data according to specific requirements. The documents heterogeneity problem

constantly appears in the area of XML transformation [8]. XML transformation is an

important process for data distribution and message exchange of XML over the Web. For

instance in e-business, different companies may have different structures (schemas) for

representing the same information such as the invoice data. In order for the companies

to process the invoice data sent by their suppliers, a transformation process is necessary.

The transformation process is used to extract the invoice data, which is represented in

the structural format of the suppliers, and store the data in the structures in which the

application systems of the companies can process upon.

With the continuous growth of XML data on the Web, the problem of document het-

erogeneity becomes more difficult to manage. To simplify the document heterogeneity

problem in large XML data a process such as clustering is used. XML clustering, or clus-

tering in general, is the task of partitioning large amounts of data or objects into small

groups of similar characteristic data [25]. The clustering process is useful for many appli-

cation processes such as schema/data integration, data warehouses, information retrieval,

etc. [2, 74]

The Following is a discussion of the background of XML clustering, XML transformation,

1http://www.w3.org/XML/

1.1. Motivation 4

and the related issues for a better understanding of this research.

XML Clustering

In general, there are three tasks in the clustering process: data modelling, data similarity,

and data partitioning [25]. In order to cluster the XML data, a data model such as the

Vector Space Model (VSM) [60] or tree-based model is employed to capture the semantic

content and/or structural relationships in the XML data. Based on the data model, a

data similarity measure is defined to calculate the distance between the data instances.

Finally, based on the similarity value, a clustering algorithm can be applied to group XML

data. For example, if the data is represented as a tree then the tree edit distance [89, 12,

52, 68, 16] can be used to measure the data distance. However, if a data model such as the

VSM model is used then similarity measures such as the cosine or Euclidean distance [10]

can be used. After data similarity is defined, clustering algorithms such as hierarchical or

partitioned clustering can be used to group the XML data.

Due to the popularity of XML in document representation a myriad of XML clustering

methods can be found in the literature. However, many existing clustering approaches [84,

86, 80, 21] have not been able to efficiently combine the structure and content for the

clustering of XML documents. Approaches such as Yang et al. [80] and Yoo et al.[86]

use complex models to represent the structural and content information. Such approaches

consume an inordinate amount of memory space. On the other hand, Yao et al. [84] and

Doucet et al. [21] approaches have utilized the VSM model to combine the structural and

content information contained within XML documents. These latter approaches are less

complex than the former approaches, however, they may result in a lack of accuracy as

1.1. Motivation 5

only one dimension is used to represent both the structural and content information.

This thesis presents two clustering methods which utilize both the content and structure

of XML documents. The first clustering method uses two different models and similar-

ity measures for the content and structure. For the content, a semantic kernel is used,

whereas for the structure a tree model is used a tree similarity measure. The term ‘con-

tent’ in this thesis refers to the data of the XML documents, which does not include the

elements defined in the schema definitions. In terms of memory space, the first clustering

method requires less memory space than approaches such as Yang et al. [80] and Yoo et

al.[86] because the used of two different models for the content and structure is easier

to process. The document similarity is ascertained by non-linearly combining the con-

tent similarity value and the structural similarity value with different weightings. The

non-linear combination measure is applicable for homogeneous, as well as heterogeneous

collections. Homogeneous collections contain XML documents conforming to the same

schema definition, whereas heterogeneous collections contain XML documents conforming

to different schema definitions.

The second proposed method represents the content and structural information together

as a collection of text paths similar to Yao et al. [84]. However, instead of using the VSM

model, the proposed method calculates the document similarity using a semantic kernel.

XML transformation

The two most important tasks in the XML transformation process to transform the data

from the source document format to the target document format are (1) the generation

of mapping rules by finding the matching elements between the source document and

1.1. Motivation 6

the target document, known as schema matching, and (2) the generation of a script that

processes these mapping rules with a manipulation transformation language such as the

eXtensible Stylesheet Language Transformation (XSLT)2.

The XML transformation is a complicated task. For example, XML designers can define

their own tags, therefore, the structure of the XML documents representing the same

information may not contain the same structure. Moreover, XML language contains car-

dinality operators which are defined by the schema to determine how many instances of an

element type are permitted in an XML document. These operators mean that the XML

documents derived from the same schema may contain varied lengths of structures.

Not many existing XML transformation approaches [31, 55, 54, 65, 67] address the prob-

lem of XML transformation using multiple XML documents. Furthermore, most XML

transformation approaches only address the transformation problem between one source

document and one target document. The problem of dealing with multiple XML sources

has been addressed by researchers [51, 59] in the area of schema integration to resolve

structural conflicts such as nesting discrepancies and backward path representations; how-

ever, the work in the area of schema integration has not gone further and applied the

mediated schema in the XML transformation application.

Schema integration is desirable in situations where the target document changes regularly,

the transformation process between the source document and a new target document needs

to be executed repeatedly.

To solve the above problem, schema integration (or structural integration) can be per-

2www.w3.org/TR/xslt

1.1. Motivation 7

S1

S2

S3

S4

S5

Transforma!on

Process

T1

T2

T3

T4

Figure 1.1: The current approach for XML transformation process.

formed on the source documents to create a global summary structure of the structure in

the source documents. In this case, the schema matching in the transformation process

only needs to perform between the global summary structure and the target document.

However, schema integration is a complex task when the source documents are very dif-

ferent in structure. Thus, a task such as clustering can be employed which first groups

the source documents into similar structures before performing schema integration. For

instance, Figure 1.2 illustrates how the five input source documents are processed in the

proposed approach. Let us assume, based on the five input source documents, that three

clusters can be formed according to their structural similarity. The concept of schema in-

tegration can then be applied by simply combining the structure of the source documents

held within each cluster into a global summary structure. The global summary structure

acts like a source document definition which can then be used in the transformation pro-

cess, thus, the number of times it needs to be executed with the changing of four different

target documents is twelve times. The clustering process needs to be executed only once

if the source documents do not change. The saving in term would be significant if there

were a large number of source documents that needed to be transformed.

1.2. Research Questions 8

S1

S2

S3

S4

S5

Transforma!on

Process

T1

T2

T3

T4

Clustering

Process

G1

G2

G3

Figure 1.2: The proposed approach for XML transformation process.

1.2 Research Questions

Due to the existing limitations of the XML clustering and transformation process, this

thesis therefore addresses two main questions:

1. Can the accuracy of the clustering solution be improved by using both the structure

and content of XML documents?

2. Given a collection of source XML documents and a target document, can the group-

ing of the source documents into small sets of similar structures improve the pro-

cessing time and accuracy of the XML transformation?

The first question responds to the first key hypothesis of this research which is to study

the impact of a clustering solution using the structure as well as the content of XML

documents. The second question responds to the second key hypothesis that is to explore

whether structural clustering can improve the accuracy of the schema matching task as

well as the length of time of the transformation process.

1.3. Research Aim 9

1.3 Research Aim

The first objective of this research, in response to the first research question, is to develop a

number of clustering methods by utilizing the different features, structure and/or content,

of XML documents in the clustering process. These clustering methods are then analysed

to see the impact of the different features on the clustering solutions.

The second objective, corresponding to the research second question, is to develop an XML

transformation approach which utilizes structural clustering as a pre-processing stage. The

structural clustering is expected to reduce the complexity in the structural integration of

the source documents and in the generation of the transformation script for transforming

multiple source documents into a target document simultaneously.

1.4 Contributions

This research has the following contributions.

1. A hybrid clustering algorithm has been proposed which utilizes both the partitioning

clustering and hierarchical clustering process. The proposed clustering algorithm

aims to balance the drawback of these two existing processes. Empirical results show

that the proposed clustering algorithm has been able to improve the scalability of

the pair-wise clustering and to improve the accuracy of the clustering solution in the

incremental clustering.

2. A number of clustering methods have been developed for the grouping of the XML

1.4. Contributions 10

documents: two structure-only clustering methods and two structural and content

clustering methods. The two structure-only clustering methods are based on two

different data models, the tree model and the path model. Two structural similarity

measures based on the tree model and the path model have been included in this

thesis. For the two structural and content clustering methods, the first clustering

method is based on a linear combination of the structural similarity, defined for

the tree model, and the content similarity, using a semantic kernel, with different

weightings. The second method is based on text paths, paths which also contain

their content information, which are measured using a semantic kernel. This is

non-linear combination of the content and structure. The clustering methods that

utilize both the structure and content of XML documents performs better than the

structure-only clustering methods in the experimental results.

3. A transformation approach has been proposed which employs one of the structure-

only clustering methods for the pre-processing stage. The proposed approach can be

used for the conversion of more than two XML source documents to another XML

structure. After the grouping of the source documents, the structure of the source

documents in each group is then integrated (or combined) into a global summary

tree structure. Each group has a global summary tree structure which is used in

the schema matching process. Results show that by using the clustering process,

this approach can improve the scalability as well as the accuracy in comparison to

the traditional XML transformation system for the conversion of multiple source

documents into the same target document.

1.5. Thesis Structure 11

1.5 Thesis Structure

The following is an overview of this thesis. It is broken into the following chapters:

Chapter 2: Background and Related Work

This chapter begins with a brief description of the XML data and its structure. Next

is the background knowledge and related work of XML clustering followed by the back-

ground knowledge and related work of XML transformation. This chapter provides the

fundamental information for the rest of the chapters in this thesis.

Chapter 3: The Proposed Clustering Methods

This chapter describes the two structure-only clustering methods and the two structural

and content clustering methods which have been proposed in this thesis. This chapter

defines the different data models and similarity measures employed by the proposed clus-

tering methods. Each data model has a different similarity measure. This chapter also

introduces a new clustering algorithm which is used by the proposed clustering methods

for the grouping of XML documents.

Chapter 4: Empirical Evaluation of the Clustering Methods

This chapter empirically analyses the clustering methods on different data collections

and evaluation metrics. It compares all the proposed clustering methods which have

been developed in this thesis. The chapter starts with the evaluation of the structure-

only clustering methods. Following that is the evaluation of the structural and content

clustering methods. Finally, there is a discussion and comparison of all the proposed

1.5. Thesis Structure 12

clustering methods in this research.

Chapter 5: The XML Transformation Approach

This chapter investigates a solution to the second research question of XML transfor-

mation. A transformation approach has been developed which incorporates a clustering

process to improve the transformation process dealing with a collection of input source

documents. A number of experiments have been conducted to analyze the performance

of the proposed approach in terms of scalability. The chapter also analyses the quality of

the proposed element mapping with another existing element-mapping technique.

Chapter 7: Conclusion

This chapter concludes the thesis with a summary discussion of the obtained results

throughout the course of this research. It also includes the limitations of this thesis

and work that needs to be done in the future.

Chapter 2Background and Related Work

This chapter discusses the background knowledge and the related work of XML clustering

and XML transformation. To understand why traditional clustering methods for text

documents are not sufficient in the grouping of XML data, this chapter begins with an

introduction to the XML data. Following that is the related work of XML clustering.

The existing XML clustering approaches are addressed according to the structure-only

approaches, content-only approaches, and structure and content approaches. After the

discussion of XML clustering, this chapter continues the related work in the area of XML

transformation which includes the schema matching approaches and the transformation

languages for XML data. This chapter concludes by addressing the limitations of the

related work and how some of these limitations are approached in this research.

13

2.1. XML Data 14

2.1 XML Data

Over the decade, XML, the eXtensible Markup Language1, has become the standard for

data distribution and message exchange over the Internet and among various organizations

and computing applications [2, 74]. It is an extensible language because it allows users to

define their own markup symbols and to define the structure for representing the XML

data. It is a meta-language which can be used to define other new mark-up languages as

well. The common uses of XML include:

Information Identification - user defined mark-ups have meaningful names which can

be used to identify the text content of a document;

Information Storage - XML can be used to store textual information across any

platform and application;

Information Structure - any kind of hierarchical structure can be defined for storing

any data whether it is simple or complex in structure;

Publishing - a style language such as XSL, the eXtensible Stylesheet Language 2,

can be used to publish the data of an XML document to another format such as

HTML for web viewing, PDF for electronic paper viewing and many others; and

Web Services - it provides a common language for inter-process communication. The

majority of web services such as weather services, e-commerce sites, blog newsfeeds,

and thousands of other data-exchange services use XML for data management and

transmission.1http://www.w3.org/XML/2http://www.w3.org/Style/XSL/

2.1. XML Data 15

XML Data

XML Schema XML Document

XML-Schema

Defini"on (XSD)

Document Type

Defini"on (DTD)

Data Structure

Ill-formed

Document

Well-formed

Document

Invalid

Document

Valid

Document

Figure 2.1: The classification of XML data.

Figure 2.1 illustrates the different categories of XML data. The two forms of XML data

are document(Figure 2.2) and schema (Figures 2.3 and 2.4). XML schema data contains

the grammar for restricting syntax and structure of accompanying XML documents. The

two most popular languages for defining an XML schema are Document Type Definition

(DTD) (Figure 2.3) and XML-Schema Definition (XSD)3 (Figure 2.4). The XSD is the

enhancement of the DTD which has additional features such as namespace support and

more data types. Many documents can conform to the same XML schema definition.

There are two types of XML document collection: (1) a collection that contains documents

conforming to the same schema definition is called a homogeneous collection; and (2) a

collection that contains documents conforming to different schema definitions is called a

heterogeneous collection.

3http://www.w3.org/XML/Schema

2.1. XML Data 16

<?xml version="1.0"?>

<!DOCTYPE conf SYSTEM "conf.dtd">

<conf id=”IE06”>

<title> The 16th ACM SIGKDD Conference on Knowledge Discovery and Data mining (KDD-2010)</title>

<year> 2010 </year>

<editor>

<person>

<name>”Peter Gavin”</name>

<email>[email protected]</email>

<phone>61-9828712</phone>

</person>

</editor>

<paper>

<title>”]Mining the structure for XML document clustering”</title>

<author>

<person>

<name>”Susan Smith”> </name>

<email>”[email protected]”</email>

</person>

</author>

<reference>

<paper>

<title>”A Survey of XML Similarity Measures” </title>

<author>

<person>

<name>”David MacDonald”> </name>


</person>

</author>

</paper>

</reference>

</paper>

</conf>

Figure 2.2: An example of a conference XML document

<!ELEMENT conf(title,year, editor?, paper*)>

<!ATTLIST conf id ID #REQUIRED>

<!ELEMENT title (#PCDATA)>

<!ELEMENT year (#PCDATA)>

<!ELEMENT editor (person*)>

<!ATTLIST editor eids IDREFS #IMPLIED>

<!ELEMENT paper (title,author,references?)>

<!ELEMENT title(#PCDATA)>

<!ELEMENT author (person*)>

<!ELEMENT person(name,email,phone?)

<!ELEMENT name(#PCDATA)>

<!ELEMENT email(#PCDATA)>

<!ELEMENT phone(#PCDATA)>

<!ELEMENT references (paper*)>

Figure 2.3: An example of a conference DTD definition

2.1. XML Data 17

<?xml version="1.0" encoding="UTF-8"?>

<xsd:schema xmlns:xsd=http://www.w3.org/2001/XMLSchema,

targetNamespace=http://www.conferences.org,xmlns=http://www.conferences.org,

elementFormDefault="qualified">

<xsd:element name="conf">

<xsd:complexType>

<xsd:sequence>

<xsd:element ref="title" minOccurs="1" maxOccurs= "1"/>

<xsd:element ref="year" minOccurs="1" maxOccurs= "1"/>

<xsd:element ref="editor" minOccurs="0" maxOccurs= "unbounded"/>

<xsd:element ref="paper" minOccurs="1" maxOccurs= "unbounded"/>

</xsd:sequence>

<xsd:attribute ref="id" use="required"/>

</xsd:complexType>

</xsd:element>

<xsd:element name="editor">

<xsd:complexType>

<xsd:sequence>

<xsd:element ref="person" minOccurs="1" maxOccurs="unbounded"/>

</xsd:sequence>

</xsd:complexType>

</xsd:element>

<xsd:element name="paper">

<xsd:complexType>

<xsd:sequence>

<xsd:element ref="title" minOccurs="1" maxOccurs="1"/>

<xsd:element ref="author" minOccurs="1" maxOccurs="1"/>

<xsd:element ref="references" minOccurs="0" maxOccurs="unbounded"/>

</xsd:sequence>

</xsd:complexType>

</xsd:element>

<xsd:element name="author">

<xsd:complexType>

<xsd:sequence>

<xsd:element ref="person" minOccurs="1" maxOccurs="unbounded"/>

</xsd:sequence> </xsd:complexType> </xsd:element>

<xsd:element name="person">

<xsd:complexType>

<xsd:sequence>

<xsd:element ref="name" minOccurs="1" maxOccurs="1"/>

<xsd:element ref="email" minOccurs="1" maxOccurs="1"/>

</xsd:sequence>

</xsd:complexType>

</xsd:element>

<xsd:element name="references">

<xsd:complexType>

<xsd:sequence>

<xsd:element ref="paper" minOccurs="1" maxOccurs="unbounded"/>

</xsd:sequence>

</xsd:complexType>

</xsd:element>

<xsd:attribute name="id" type="xsd:string"/>

<xsd:element name="title" type="xsd:string"/>

<xsd:element name="year" type="xsd:string"/>

<xsd:element name="name" type="xsd:string"/>

<xsd:element name="email" type="xsd:string"/>

</xsd:schema>

Figure 2.4: An example of a conference XSD definition

2.1. XML Data 18

An example of XML document is given in Figure 2.2. The XML document contains two

main information:(a) the markup and (b) the content. The main markup which defines the

logical component of an XML document is called an element, i.e., title, year, person, etc. A

markup construct that begins with “<” and ends with “>” is a tag. The text between the

start-tag and end-tag of an element is the content. Another component worth mentioning

is the attributes of an element which exists within the start-tag of the element, e.g., the

id attribute of the element conf.

There are many relationships that can exist between elements [89, 12, 68]. A child-parent

relationship occurs when an element is contained within another element and they are only

one level apart. Consider the document in Figure 2.2 for example, the element title is the

child of the element conf and the element conf is the parent of the element title. A sibling

relationship exists when two elements have the same parent, for instance, the elements

street, city, state, postal are siblings. An element is contained within another element

and whether they have the child-parent relationship or not they will have the descendant-

ancestor relationship. For instance, all the elements that exist within the start-tag and

end-tag of the element conf, such as title, year, editor, person, paper and etc. are its

descendants and the element conf is their ancestor.

The relationships between elements within an XML document define its structure. The

structure of an XML document is either ill-formed or well-formed [64]. An ill-formed

document does not follow the XML syntax, i.e., has no XML declaration statement or

no ending tag. However, a well-formed document follows the XML syntax which has the

following properties: has one root element; has unique opening and closing tags; and has

tags that are properly nested. A well-formed document, which is conformed to its schema

2.2. XML Clustering 19

definition, is known as a valid document. It indicates that this document does not contain

any rules that are not permitted by the schema definition.

2.2 XML Clustering

With the structure embedded in the XML data, both the data and structure are important

for the process of XML clustering in order to obtain a good clustering solution. Clustering

is a data mining task that groups or segments a collection of objects into subsets or

“clusters” that share similar characteristics [25].

XML

Data

Data

Model

Data

Similarity

Data

Par!!oning

Cn

C2

C1

Applica!on

Domain 1

Applica!on

Domain 2

Applica!on

Domain M

Figure 2.5: A generic XML data clustering process

Figure 2.5 illustrates a generic clustering process. The clustering process consists of three

main tasks: data modelling, data similarity, and data partitioning [25].

Data Modelling

The input data is represented using a common data model that can capture the semantic

and/or structure information inherent in the input data collection. Some of the most

popular models for XML data are the tree-based model, graph-based model, vector-based

model, and path-based model.


Data Similarity

After conducting the data modelling task the data similarity task is conducted to apply

the most appropriate measure to compute the degree of similarity between objects in the

data collection utilizing the data model. Selection of the measure depends on the data

model, for instance, if the tree-based model is used then a measure such as the tree edit

distance is commonly employed [52, 68, 16].

Data Partitioning

Once the data model and data similarity is determined for the input data collection, the

next step is to choose a clustering algorithm that can partition the data taking similarity

into consideration. The two most popular types of clustering algorithms are incremental

clustering and pair-wise clustering.

Incremental Clustering - A simple incremental clustering segments the input data

collection as follows: (1) the first data in the collection becomes the centroid for the

first cluster; (2) the second data in the collection is computed with the existing cluster

centroid (or cluster representation) using a similarity measure; and (3) the second

data initiates a new cluster and becomes the cluster centroid if the degree of similarity

between the second data and the existing cluster centroid is not greater than a

clustering threshold value. The clustering threshold is the lowest possible value of

similarity required to join two objects in one cluster. This value is determined by the

user. The next object in the collection is processed in the same way as the second

data. The clustering solution of this method is sensitive to the order of the input

data collection. An incremental clustering is a type of partitioning clustering [74, 29].


Another popular partitioning clustering method is the k-means method. Given a

set of objects (o1, o2, ..., on) where each observation is a d-dimensional real vector,

k-means clustering aims to partition the objects into k sets where k < n. In this

method, the number of clusters is pre-defined. K-means clustering is much preferred

over the hierarchical clustering as it is faster and easy to implement. One of the

drawbacks with k-means clustering is the selection of k. The time complexity for

the partitioning methods with only one pass through the input data collection is

O(n(logn)) where n is the number of input data in the collection.

Pair-wise Clustering - pair-wise clustering partitions input data collection based on a

similarity matrix which is obtained by calculating the similarity between all possible

pairs of input data in the collection using a similarity measure. This clustering

is a type of hierarchical agglomerative clustering method [29]. One best known

hierarchical method is the single link method. The single link method operates by

joining, at each step, the two most similar objects, either between two input data

or between an input data and an existing cluster. The time complexity of the

hierarchical clustering is at least O(n ∗ n) where n is the number of input objects.

Therefore, this method is only limited to smaller collections.

The rest of this section discusses the existing XML clustering approaches which are cat-

egorized into structure-based, content-based, and structure and content-based clustering

approaches according to the features of the XML documents used.


2.2.1 Structure-based Clustering

There are many data models that can represent the structure of XML data. This section

will present the approaches according to the data models that they use for XML clustering

such the tree, path and graph models.

2.2.1.1 Tree-based Approaches

The most popular model for representing the structure of XML data is a tree model [89,

12, 68]. A tree is denoted as T = (V, v0, E, f), where V is the set of nodes, v0 is the

root node which does not have incoming edges, E is the set of edges, and f is a mapping

function f : E → V ×V . In a tree model, the components such as elements and attributes

that define the structure of an XML document are referred to as nodes. There are many

different nodes in an XML tree structure such as the element nodes, data (or text) nodes,

comment nodes, a document node and many others. The edges are the child-parent

relationships between the nodes in the tree. An example of the rooted label tree model

corresponding to the XML document in Figure 2.2 is shown in Figure 2.6. The Figure

shows the relationships between the nodes in the tree structure. The immediate children

of the conf node are the title, year, editor, and paper. The conf ’s children of the conf

immediate children are its descendants. The dotted line is the attribute id of the conf

node.

There are a number of similarity measures for the tree-based approaches, namely, the

tree edit distance, frequent subtree mining, and level similarity. The tree edit distance

calculates the minimum cost (or distance) of transforming from one tree structure to


conf

!tle year

email name

editor

person

phone

name email

paper

person

!tle reference

paper

author

name email

author !tle

person

id

Figure 2.6: Tree representation of the XML document structure

another. The frequent subtree mining is the extraction of the most common sub-structures

existing in a collection of tree structures. The trees are clustered based on these sub-

structures. Finally, the similarity measures based on level similarity which take into

account the level similarity of the nodes. Level similarity is based on the assumption

that associated nodes should appear in the same level. In the following sub-sections, the

approaches based on these similarity measures are discussed in more detail.

Tree Edit Distance

As XML documents can be easily modelled as a tree, many researchers [89, 12, 52, 68, 16]

have adapted the tree edit distance for finding the distance between trees. Tree edit

distance is usually based on dynamic programming techniques for string-to-string correc-

tions [76]. An edit script is is a sequence of tree edit operations such as insert node, delete

node, replace node and others. to transform one tree into another tree. The tree edit

distance between two trees is the minimum cost between the costs of all possible tree edit


sequences.

Zhang and Shasha [89] approach allows the edit operations to perform anywhere in a

tree. The complexity of this approach is O(|t1||t2|depth(t1)(depth(t2)) where t1 and t2

are two trees. Nierman et al. [52] and Telki at el. [68] approaches have expanded the

work of Chawathe [11] which restricts the insertion and deletion at the leaf nodes only.

Nierman et al. [52] introduces two new operations which are to insert tree and delete tree

to allow insertion and deletion of whole sub-trees. The complexity of the latter approaches

is O(|ND|) where |N | is the total number of nodes in the two trees and D is the number

of misaligned nodes. On the other hand, Telki at el. [68] extended the edit operations to

measure the semantics of the labels of nodes which also take into consideration the depth

of the nodes in a tree. The work of Dalamagas [16] claims that real XML documents tend

to have many repeated nodes which affect the performance of the tree edit algorithms.

The authors introduce a summary tree structure in which the repeated nesting nodes are

reduced (or removed) from the rooted labelled trees.

Frequent Subtree Mining

Computing between each pair of trees is expensive for XML clustering, therefore other

approaches [69, 88, 33, 32, 39] have been developed that extract and mine subtrees from

the whole tree structure. Termier et al. (2002) approach first clusters the trees based

on the occurrence of the same pairs of labels in the ancestor relation using the Apriori

algorithm. After the trees are clustered, a maximal common tree is computed to measure

the commonality of each cluster to all the trees. This algorithm cannot find all frequent

patterns in the subtrees of the ordered labelled tree. To fill the gap, Zaki (2002) proposed

an algorithm to discover all subtrees in a forest (meaning in a large collection of ordered


trees). Other methods such as Kutty et al. [33, 32] and Lin et al. [39] have extended the

frequent sub-tree mining for finding common sub-trees. The output of the common sub-

trees is used in the clustering of XML documents. These researchers claim that clustering

the XML documents by extracting the subtrees rather than using entire structure of the

datasets is more efficient in terms of scalability and accuracy.

Level Similarity

XCLS [47] extends the transactional data clustering algorithm such as Clope [82] to XML

documents by defining a new concept called the level similarity. The level similarity

is measured by calculating structural similarity between two objects (tree-tree, cluster-

cluster, tree-cluster) by considering their common items in the corresponding levels and

giving different weight in different levels. Unlike other approaches that are based on pair-

wise similarity between two trees, XCLS computes the level similarities between a tree and

existing clusters and moves the tree to the cluster which has the maximum level similarity

with the tree. Using this approach, the computation time is reduced significantly. The

limitation of the XCLS approach is it does not reserve the child-parent relationship as well

as the sibling relationship. Thus, XCLS+ [4] improves the limitation of XCLS by using

edges rather than using the node only. Another study which extends the XCLS further is

the XEdge [6]. The XEdge not only use the edges for the node representation but it also

extends the clustering algorithm by using the k-means algorithm. It claims that it can

cluster both homogeneous as well as heterogeneous XML collections.


2.2.1.2 Path-based Approaches

In recent years, a great number of approaches has represented the XML data by break-

ing down the tree structure into paths. A path model represents the structure of XML

documents as a collection of paths (or transactions as used in database communities [7]).

An XML path can be of two types: complete path and partial path. A complete path

contains the nodes from the root to the leaf node in sequence order. Consider the example

tree model in Figure 2.6, the corresponding complete paths of the tree model are shown

in Figure 2.7. The number of complete paths is equal to the number of leaf nodes in the

tree model.

conf/title, conf/year, conf/id,conf/editor/person/name,conf/editor/person/email,conf/editor/person/phone,conf/paper/title,conf/paper/author/person/name,conf/paper/author/person/email,conf/paper/reference/paper/title,conf/paper/reference/paper/author/person/name,conf/paper/reference/paper/author/person/email

Figure 2.7: Complete paths extracted from the tree model in Figure 2.6.

A partial path contains the nodes from node m to node n in sequence order in which

node m is the ancestor of node n, and nodes m and n appear in the same complete

path. A complete path can have many partial paths. For example, consider a com-

plete path conf/paper/reference/paper/author/person/name, it can have the following

partial paths with varied lengths: conf/paper, conf/paper/reference/, paper/reference, pa-

per/reference/paper/author/person/name, and etc.


Similarity measures for a path model can be categorized into sequential pattern mining

and schema matching techniques. A sequential mining technique is similar to a subtrees

mining approach, however it extracts all the common paths of documents. These common

paths are used for clustering of XML data. On the other hand, a schema matching tech-

nique finds corresponding elements between two schema definitions. It has been employed

by a few researchers in calculating the distance between paths. Research relating to these

similarity measures is described in the following sub-sections.

Sequential Pattern Mining

For finding the common paths between documents, many techniques [38, 35, 27, 48, 3] have

incorporated the idea of sequential pattern mining to extract the frequent paths from XML

documents. Considering an XML document as a transaction and paths of documents as

items of the transaction, these techniques find the complete set of frequent sequences from

the set of paths.

Techniques such as Leung et al. and Lee et al. [38, 35] have utilized the idea of sequential

pattern mining to extract common paths from a collection of XML trees to measure the

structural similarity. However, these methods did not go further in term of clustering.

Hwang and Ryu [27] take a step further. they use sequential pattern mining to extract the

frequent paths from the XML documents, assuming that an XML document as a transac-

tion and the frequent structure of documents as the items of the transaction. Hwang and

Ryu [27] then use CLOPE [82] as well as the notion of Large Items [26] clustering methods

for transactional data to cluster a collection of XML documents. However, XMine [48]

uses sequential pattern mining to infer the similarity between elements. This approach


is for clustering and modelling the relationship between DTD schema definitions using a

pair-wise schema distance matrix. On the other hand, XProj [3] uses the frequent sub-

structures as the cluster representation for clustering the XML documents using k-means

algorithm.

Schema Matching

Besides using sequential pattern mining, some researchers have employed schema match-

ing techniques in finding the similarity between paths. Schema matching is the process

of finding corresponding elements between two schema definitions. The clustering meth-

ods [37, 50, 48, 45] which employ the schema matching technique are generally used in

the data integration application. These clustering methods adapt a complex measure for

determining similarity between XML structures as well as the leaf nodes of two XML data.

They do not only calculate the similarity between the elements names of XML nodes but

also other properties of XML nodes such as the data type and constraints. The most

important thing these clustering methods try to measure is the similarity between the leaf

nodes.

2.2.1.3 Graph-based Approaches

Often the tree-based model is used for representing XML documents, whereas a graph-

based model is more suitable for representing XML schema definition to show the acyclic

relationship of elements in the schema. An example of a graph model is shown in Figure 2.8

which is corresponding to the conference DTD definition in Figure 2.3. In contrast to the

tree representation, the graph representation in the Figure also shows the cardinality


year

conf

person

?

author

* +

name email

?

phone

!tle

eids

editor

*

paper

reference

? * !tle

id

Figure 2.8: Graph representation of an XML definition

operators. Unlike the tree representation, the author node in the graph has two parents

which are the paper node and the person node. A graph can be defined as a triple (V,

E, f) where V represents the set of vertices and E represents an edge set with a mapping

function f : E 7→ V V . The vertices are the elements in the schema and the edge set

consists of the links that connect the vertices that represent parent-child relationships.

Chawathe [12] computes the distance between XML documents using the concept of edge

cover with a bipartite graph. A bipartite graph G is defined as G = U, V,E, where U

and V are the two disjoint sets which contain the nodes such that every edge connects

a node in U to one in V and no edge connects nodes in the same set. An example of

bipartite graph is shown in Figure 2.9. In the Figure, U and V and stand for two different


U V

Figure 2.9: Bipartite graph representation

XML documents and the dots represent the nodes in the XML documents. Chawathe’s

approach [12] establishes a bipartite graph by representing one tree structure as U and

another tree structure as V, and then an operation is defined to convert the node from one

tree another. Once all the possible edges to transform the nodes from U into the nodes

in V, the approach calculates the set of edges that connects all the nodes between the two

graphs at the lowest possible cost. This is similar to tree edit distance approaches.

A recent work of Yuan et al. [87] also employs the bipartite graph model to map common

paths between XML documents, where U now is a set of documents and V is a set of

paths. Documents that are closely related should have the most common paths shown

in the bipartite graph. It uses Jaccard coefficient to compute the similarity between

documents dx and dy which is defined as:

Sim(dx, dy) =N(dx) ∩N(dy)

N(dx) ∪N(dy)(2.1)

where N(dx) and N(dy) are the number of paths contained by documents dx and dy

respectively. Based on the Jaccard coefficient, a pair-wise similarity matrix is generated

for the clustering of XML documents.


2.2.2 Content-based Clustering

The most popular model for representing text content is the Vector Space Model (VSM) [60].

It is widely used in information retrieval, information filtering, indexing and relevancy

rankings. In the VSM model, the data content of XML data is broken down into a set

of index terms. A document di is represented as vector di = (t1, t2, ..., tm), where m is

the number of unique index terms in the input data collection. An example of the VSM

model for a collection of input data is shown in Figure 2.10. The vector-based model can

represent the terms by their frequency (as seen in Figure 2.10), binary value (1 or 0) where

1 means that the feature exists in the document otherwise 0, or by weights. There are

several ways to compute the weights of features. A popular scheme is term frequency-

inverse document frequency(TF-IDF) weighting. For XML data, the “terms” refer to as

“feature”. Thus, the weight vector for document di is di = (w1i, w2i, ..., wmi) is defined

by:

wti,dj = tfi · log|D|

|ti ∈ |D||(2.2)

tfi is the term frequency of term ti in input data dj divided by the total number of term

frequencies in dj and log |D||ti∈|D|| is inverse document frequency. |D| is the total number

of input data in the collection;|ti ∈ |D|| is the number of input data D containing the

term ti. Another weight is Okapi-BM25 similar to the TF-IDF weighting that is employed

in XML clustering [75]. It has two tuning parameters which are K1 and b. K1 influences

the effect of the term frequency, whereas b affect the influence of the document length.


t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14

d1 1 0 2 0 0 0 0 5 0 0 3 0 2 2

d2 2 2 0 0 2 1 0 0 8 7 0 7 0 0

d3 2 3 0 9 0 6 4 0 0 0 0 0 0 0

d4 3 0 0 5 3 2 1 3 0 0 0 0 0 0

Figure 2.10: The VSM model with term frequency

The Okapi-BM25 weighting for a given term ti is defined as:

wti,dj =(CFW × tfi × (K1 + 1)))

(K1× ((1− b) + (b×NDLj) + tfi))(2.3)

where CFW is the collection frequency weight for term ti is calculated as log(|D|)−log(tfi)

and the NDLj is the normalization document length for document dj , where NDL =

|dj |avg(|dj |) , where |dj | is the length of document dj in terms of words and avg(|dj |) is the

average document length in the text collection from which documents are drawn.

After the modelling of the content and weightings, there are a number of similarity mea-

sures for finding the distance between instances of the vector-based model. Two of the

most commonly used measures are cosine and Euclidean. Other distances are discussed

in Cha’s survey paper [10] such as Jaccard, Manhattan and many others. The cosine and

Euclidean between two vectors, dx and dy, are defined as:

Cosine(dx, dy) =dx · dy

∥dx∥ ∥dy∥(2.4)

Euclidean(dx, dy) =

√√√√ m∑i=1

(dx,i − dy,i)2 (2.5)


where m is the number of terms in the XML document collection. The difference between

the cosine measure and the Euclidean measure is that Euclidean measure takes into account

the magnitude of the vectors, whereas the cosine measure calculates the angle between

the two vectors.

The above similarity measures for the VSM model can be used for clustering of XML

documents using partitioning clustering or hierarchical clustering.

2.2.2.1 Feature Reduction

Clustering of XML data can be expensive due to the presence of large numbers of terms

present in the documents and due to the presence of many outliers or insignificant terms.

Therefore, a number of XML clustering approaches [28, 40, 71] use features reduction

methods such as ICA (Independent component analysis), PCA (principal component anal-

ysis) [56] or LSI (Latent Semantic Indexing) [34].

Given a dataset of XML documents d1, d2, ..., dn, an original term-document matrixX of

m×n can be derived, where m and n are the number of unique terms (or mark-up tags, or

paths) and the number of documents in the dataset respectively. The LSI method applies

Singular Value Decomposition(SVD) on the term-document matrix which is decomposed

into 3 matrices (equation 2.6), where U and V have orthonormal columns of left and right

singular vectors respectively and S is a diagonal matrix of singular values ordered in the

decreasing magnitude.

X = USV T . (2.6)


The SVD process optimally approximates matrix X in a k−dimensional document space,

where n > k, by selecting k largest singular values and setting the rest of the values to

zero. Matrix Uk of size m × k and matrix Vk of size n × k is redefined along with k × k

singular value matrix Sk (equation 2.7).

Xm×n = UkSkVTk . (2.7)

The difference between the PCA and ICA methods is that the PCA method maximises the

variance and the projections onto the basis vectors are mixtures, whereas ICA correctly

finds the two vectors onto which the projections are independent. ICA is like an extension

of the PCA method. In ICA, the independent components can be derived from:

Sk×n = Wk×m · Xm×n (2.8)

where W is the inverse of matrix Uk which is known as unmixing matrix. The independent

components Sk×n is used to represent the new document collection matrix. Using the

reduced document collection matrix, a clustering algorithm such as K -means is used to

cluster the document collection.

2.2.2.2 Semantic Kernel

Term weightings are not sufficient enough in learning the latent relationships between

terms that may be important in document similarity. Some approaches [37, 44, 50, 45]


have used a semantic dictionary such as WordNet [22] to measure the synonym sense of

keywords. However, using WordNet to find the synonym of words is expensive in terms of

processing time. A few works [80, 71] have utilized the idea of kernels for XML clustering.

The work of Yang et al. [80] build kernels for learning the terms of the documents in their

true groups. The kernel is an m × m kernel matrix which captures both the similarity

between a pair of XML elements as well as the contribution of the pair to the overall

document similarity. An entry in the kernel being small means that the two XML elements

should be semantically unrelated and the same words appearing in the two elements should

not contribute to the overall similarity and vice versa. This kernel is then used later for the

clustering of the documents. The kernel is a supervised learning approach. The proposed

method in this thesis [71] presented in Chapter 3, on the other hand, builds a semantic

kernel based on latent semantic indexing [15]. For example, given two vectors, dx and

dy, the closeness of the semantic similarity of the two is measured as the cosine similarity

using the Uk generated from the LSI method defined in equation 2.7:

Sim(dx, dy) =dTxUkU

Tk dy

|UTk dx||UT

k dy|(2.9)

A semantic kernel can be applied to unknown data and is much more flexible than using

the WordNet dictionary.

2.2.3 Content and Structure-based Clustering

There are two ways to combine the content and structure of XML documents for XML

clustering - linear and non-linear. Non-linear approaches combine the content and the


structure of XML data in one model. On the other hand, linear approaches calculate the

similarity values of the content and structure separately. These values are then combined

with weightings to calculate the document similarity.

2.2.3.1 Non-Linear Approaches

A model for representing the content and structure of XML data together is the Structural

Link Vector Model (SLVM) [81]. SLVM represents both structure and content information

of XML documents using vector linking so that the structure and the content features are

not in one vector space model. The SLVM model of an XML document dx is a document

feature matrix ∆x ∈ Rn×m, given as

∆x = [∆x(1),∆x(2), ...,∆x(m)]

where m is the number of distinct XML elements, ∆x(i) ∈ Rn is the TF-IDF feature vector

representing the ith XML element, given as ∆x(i) = TF (tj , dx, ei) ∗ IDF (tj) for all j = 1

to n, where TF (tj , dx, ei) is the frequency of the term tj in the element ei of dx. The basic

document similarity between the two models can be measured using the cosine measure.

The SLVM is employed by Yang et al. [80] to represent documents as vectors of terms,

structures, and neighbouring documents. Yang et al. [80] use a kernel matrix to calculate

the document similarity:

Sim(dx, dy) =∑n

i=1 dTx(i) •Me • dy(i)


where Me is a m×m kernel matrix which captures the semantic similarity between pairs

of elements.

Another method is given by Yoo et al. [86] which models the relationships of documents,

paths and terms in a 3-dimensional matrix called BitCube . An XML document dx is

defined in BitCube model by BC(dx) = [(dx, p1, v1), (dx, p2, v2), ..., (dx, pn, vm))], where pi

is a path in dx, vj is a word or content of pi, and (dx, pi, vj)=1 or 0, 1 show that the word

vj in the path pi is in dx otherwise 0 is given. The approach simply uses popularity of

common features to cluster the documents in order to optimize query operations. The

similarity between a document and a query is defined through the Hamming Distance:

Sim(dx, dy) = |xOR(BC(dx), BC(dy))| (2.10)

where xOR is a bit-wise exclusive OR operator applied on the representations of the two

documents in the BitCube.

The methods discussed above are complex and might not be scalable enough to handle

large amounts of XML documents. Therefore, other approaches [21, 84, 85, 32] most of

which are from the INitiative for the Evaluation of XML Retrieval (INEX)4 have utilized

the simple VSM model approach. The INEX contains an XML Mining Track which allows

participants to compare their methods and results on the classification and clustering of

XML documents. In the first attempt in the INEX by Doucet et al. [21] the content is

treated as a set of words and the structure is set of nodes labels. These two features,

the content and the structure, are combined in one vector space model. This approach is

4http://inex.is.informatik.uni-duisburg.de/


known as naive approach as it is a simple way of combining the structure and content. The

relationships between the features are lost in the representation. Other recent attempts

are by Yao et al. [84] and Yongming [85]. Instead of representing the features of the content

and the structure separately, they incorporate the content and structure into a collection

of term-paths. Each term-path contains the labels from an element node to a term in

a text node in which the element node is the ancestor of the text node. HCX [32], on

the other hand, takes a different approach. It first extracts frequent subtrees from XML

documents in a collection. These frequent subtrees are then used to extract the content

and only the content that appears in the frequent subtrees is extracted from the XML

documents. The content is then represented in the VSM model for the clustering of the

XML documents. This approach produces a better clustering solution than Yao et al. [84].

However, the drawback of the HCX method is that it may miss the true classes that have

only one or two documents which are unique in structure.

2.2.3.2 Linear Approaches

Using the non-linear computation might degrade the performance of the clustering de-

pending on the nature of the XML documents, for example, the Wikipedia collections

are more distinguishable in their data content than their structure. Therefore, by mea-

suring the structural and the content similarity of the XML documents using one data

model might result in degrading the accuracy of the clustering process. Thus, linear

approaches [85, 73, 43] have also been proposed to linearly combine different similarity

measures to find similarity between the XML documents. Yongming et al. [85] use differ-

ent vector space models to represent the content and structure separately. The structure


feature is a collection of complete paths. The distance measure is the product between two

vectors. The document similarity is the integration of the structural similarity value and

the content similarity value with weights. A clustering method proposed in this thesis,

takes a similar approach, however, instead of using vectors for the structure, the similarity

between leaf nodes is calculated by finding the common ancestors of the leaf nodes [73].

Later, Nagwani et al. [43] employs the path similarity measure for computing the paths as

seen in the work of Tran et al. [73]. It also adds one attribute to the document similarity

and that is the similarity between style-sheets.

2.2.4 Discussion

Figure 2.11 outlines the different XML clustering approaches which have been discussed

so far. The last level in the diagram shows the different types of similarity methods.

The second last level shows the data models which are used to represent the content and

structure of the XML data.

Tables 2.1 and 2.2 compare the different structure-only clustering approaches, and the

content and structure clustering approaches, respectively. From the literature review, the

following limitations can be ascertained from the existing XML clustering approaches.

The first limitation is that not many clustering approaches have efficiently combined the

structure and content measures for the clustering of XML documents. Depending on the

type of XML document collections, combining both the structure and content measures

can, sometimes, degrade the quality of the clustering solutions. The second limitation is

the two clustering algorithms, incremental and pair-wise clustering, both have drawbacks.

2.3. XML Transformation 40

XML

Clustering

Content Structure Structure and Content

Tree Graph Path VSM VSM SLVM BitCube

Tree Edit

Distance

Frequent

Sub-tree

Mining

Level

Similarity

Graph

Distance

Sequen!ally

Pa"ern

Mining

Schema

Matching

Vector-based

Distance e.g.

cosine and

Euclidean

Feature

Reduc!on

Seman!c

Kernel

Hamming

Distance

Figure 2.11: The classification of the XML clustering approaches for XML data.

The pair-wise clustering is more expensive in terms of memory and computational time

when dealing with large collections. In contrast, the incremental clustering can deal with

large document collections but it suffers with the problem of poor accuracy due to its

dependence on the input ordering. In this research, a number of clustering methods have

been proposed that utilize both the content and structure. The thesis also proposes a

clustering algorithm that balances the scalability problem in pair-wise clustering and the

accuracy of incremental clustering.

2.3 XML Transformation

Figure 2.12 illustrates the basics of an XML transformation process using the eXtensible

Stylesheet Language Transformation (XSLT)5 language. Before executing the transforma-

tion process, corresponding nodes between a source schema definition and a target schema

5www.w3.org/TR/xslt


Table 2.1: An overview of the structure-only clustering approachesMethod Name Data Model Similarity Clustering Method

Measure

Nierman et al. [52] tree tree edit distance hierarchical

Dalamagas et al. [16] summary tree tree edit distance hierarchical

Kutty et al. [32] tree frequent mining k-means

Lin et al. [39] tree frequent tree hierarchical

XCLS [47] tree level similarity partitioning

XCLS+ [4] tree level similarity partitioning

XEdge [6] tree level similarity k-means clustering

Hwang and Ryu [27] paths sequential Items clusteringpattern mining

XProj [3] paths sequential k-meanspattern mining

XClust [37] paths schema matching hierarchical(path similarity)

XMine [48] paths sequential hierarchicalpattern mining

PCXSS [45] paths path similarity partitioning

Yuan et al. [87] bipartite graph Jaccard measure hierarchical

Table 2.2: An overview of the content and structure-based clustering approachesMethod Name Data Model Approach Similarity Cluster

Type Measure Algorithm

Doucet et al. [21] term-paths non-linear Euclidean k-meansVSM model

Vries et al. [75] terms and links non-linear Euclidean k-tree [23]vector-based

Nagwani et al. [43] terms linear Euclidean k-meanscomplete paths on similarity

matrix

Kutty et al. [32] terms non-linear Cosine partitioningVSM

Yao et al. [84] term-paths non-linear Cosine partitioningVSM

Yang et al. [80] semantic kernel non-linear Euclidean hierarchicalSLVM

Yoo et al. [86] BitCube non-linear Hamming partitioning

definition are determined through a schema-matching process. Schema matching is a pro-

cess of determining a set of correspondences that identify similar elements in two different

schemas. The result of the schema-matching process is the element mappings between the


Source

Schema

Target

Schema

Schema

Matching Element

Mapping

Results

Transformation

Operation Transformation

Rules Transformation

Processor

Source XML

Documents

Target

XML

Documents

XSLT

Script

Generator

XSLT

Scripts

Figure 2.12: The transformation process for XML data.

target and source documents. The transformation operation is the process of assigning

different transformation operators to the different mapping relationships. The result is

used by the XSLT generator in a process, to create XSLT transformation script(s). A

transformation processor, in this case the XSLT processor, is used to convert the content

of the XML documents which are conformed to the source definition format and to the

target definition format using the XSLT script(s) generated by the XSLT generator.

The following sections discuss the related work in the area of schema matching and XML

transformation.

2.3.1 Schema Matching Approaches

Schema matching is the process of finding corresponding elements between two XML data

objects. It is a crucial step in XML transformation as well as in many other applications

such as schema integration, data integration, electronic commerce, data warehousing, and

semantic query processing and optimization.

The input to the schema matching system is two XML data objects. XML data, in here,


refers to both XML document and document schema. A schema-matching approach will

usually first model the XML data to a representation such as a tree structure that can

capture the semantics and structure of the XML data. Then elements of the objects data

are compared using different matchers. According to Smiljanic et al. [63], element match-

ers can be divided into two groups depending on the type of information that is used by

different schema matching systems to compute element similarity: localized matchers and

structure matchers. Localized matchers compute element similarity by considering prop-

erties such as element names, element types or instance values [18]. Structure matchers

on the other hand compute element similarity by considering the structural properties of

elements such as the relationships between elements in the hierarchical level [41]. There

are a number of challenges associated with schema matching:

Schemas developed for different applications are heterogeneous in nature i.e. al-

though the data they describe are semantically similar, the structure and the em-

ployed syntax might differ significantly.

To resolve schematic and semantic conflicts, schema matching often relies on ele-

ment names, element datayptes, structure definition, integrity constraints, and data

values.

A number of surveys have been conducted about this matching problem by researchers

such as Algergawy et al. [5], Shvaiko and Euzenat [62], and Dorneles et al. [20]. The

rest of this section discusses some of the most popular schema-matching systems, and

the schema-matching approaches for XML clustering process and for XML transformation

application.


2.3.1.1 Schema-Matching Systems

There are a number of schema-matching systems and approaches available, some of the

most popular ones are discussed below.

COMA. COMA [18] is a composite approach that uses different matchers, simple and

hybrid, to find the correspondence elements. These results are then combined to arrive to

the final results in determining the degree of similarity between the elements. The COMA

results show that the use of different matchers produces a more accurate result than using

only a single matcher. Schemas are transformed to rooted directed acyclic graphs on which

all match algorithms operate. Furthermore, each schema element is uniquely identified by

its complete path from the root of the schema graph to the corresponding node.

Cupid. This system [41] uses multiple criteria to perform element matching. In partic-

ular, it combines an element name matcher with a structural element matcher to derive

the elements similarity coefficient based on the match criteria of their components thereby

emphasizing the linguistic, structural and context-depended similarity. Cupid is biased to-

wards the similarity of the leaf elements based on the assumption that much of the schema

semantic is captured in the leaf elements rather than in the internal structure. Hence, this

technique will fail to distinguish the varying element context that are commonly defined

in XSD schema. For example, Cupid will fail to distinguish between the varying element

contexts such as book/name and book/author/name.

Similarity Flooding. The Similarity Flooding (SF) [42] takes a different approach. It

uses different graphs to represent the schemas in order to extract auxiliary data from

one graph into another. It finds the correspondence elements using these graphes. The


SF approach computes the similarity between two nodes based on the assumption that

two elements are similar if their adjacent elements are similar. This technique is not

appropriate for comparing schema from heterogeneous domains as the structure of the

graphs will be totally different.

S-Match. The S-Match [24] is a schema-based matching system which takes two graph-

like structures (e.g. XML schemas or ontologies) and returns semantic relationships be-

tween the nodes of the graphs that correspond semantically to each other. The relation-

ships are determined by analysing the meaning (concepts, not labels) of the elements and

the structures of schemas/ontologies. In particular, labels at nodes, written in natural

language, are translated into propositional formulas which explicitly identify the label’s

intended meaning. This allows for a translation of the matching problem into a proposi-

tional unsatisfiability problem, which can then be efficiently resolved using state of the art

propositional satisfiability deciders. S-Match was designed and developed as a platform for

semantic matching, namely a highly modular system with the core of computing semantic

relations where single components can be plugged, unplugged or suitably customized.

Doan et al. [19]. This matching system is similar to COMA [18] in that the proposed

method uses a composite approach to combine different matchers. It uses machine learning

for element mappings. In addition, it extends matching learning techniques by introducing

a novel learner that exploits the hierarchical structure of the XML data to improve the

matching results. A drawback of this technique is that it heavily depends on the user

at the training stage because initially the users have to provide some semantic mappings

between the input schemas and mediated schemas then these mappings are further refined

during the training stage.


Xu et al. [79]. This approach is to find direct as well as indirect matchings between

source and target schema. It is based on the assumption that both source and target

schemas can be described using rooted conceptual-model graphs and each element node

is associated with a data value or object identifiers. This technique can identify indirect

matching between elements from two schemas using the structure matching and data value

characteristics techniques.

2.3.1.2 Schema Matching for XML Clustering

A few research studies [37, 50, 45, 48] have discussed the schema matching concept in the

clustering of XML schema definitions. XClust [37] introduces a complex computational

technique to map the element similarity between the schema of XML data by considering

the semantics, immediate descendant and leaf-context information. The main focus of this

approach is to cluster the DTD schema into similar groups in order to facilitate the schema

integration process. Unlike XClust, Nayak et al. [50, 45] propose a method to find element

mappings between XSD. These methods introduce a rigid function called NCN (number

of common nodes) to measure the similarity between leave nodes using node paths. The

drawback of this function is it does not compute if the leaf element of one path does not

match with the leaf element of another path. Similar to Cupid [41], this approach fails

to distinguish between the varying element contexts that are commonly defined in XSD

schema definitions. XMine [48] on the other hand, computes a complex schema matching

for DTD. It measures the structural similarity between DTDs by finding the maximal

similar paths between schemas using sequential frequent mining. This approach gener-

ates a similarity matrix between XML trees. The XMine approach uses the hierarchical


clustering algorithm [30] to perform clustering based on the similarity matrix.

2.3.1.3 Schema Matching for Transformation Approaches

Approaches such as Su et al. [66], Boukottaya et al. [8] and Lee et al. [36] are designed

specifically for the transformation of XML documents. Su et al. [66] proposed a schema

matching approach to deal with a XML Schema language. It represents the schemas as a

schema graph where it contains the schema properties such as nodes and edges representing

different relationships between elements within the schema (i.e. containment, of-property

and association relationships) and constraints (i.e. ordered composition, exclusive dis-

junction and referential constraint). For the matching, it considers matchings such as

linguistic, data type and type hierarchy. Besides the semantic matching, it also considers

structural matching. It is based on relaxation matching that allows matching paths even

when nodes are not embedded in the same manner or in the same order. They allow two

elements within each path to be matched, even if they are not identical but their linguistic

similarity exceeds a fixed threshold. Instead of generating similarity scores between source

and target schemas, this approach uses the schema graph to discover matching nodes and

edges, and the necessary transformation operators for the transformation of source and

target schemas. Similarly, Boukottaya et al. [8] do not produce similarity scores between

source and target schema nodes. The authors suggest using conceptual modelling to model

the XML schemas. They use two views to represent the XML-schemas: semantic view

and logical view. The matching process is executed on these views. On the other hand,

Lee et al. [36] introduces schema matching based on domain ontology update. The on-

tology used in this approach is dynamically updated by user feedbacks from the previous


matching results. The proposed ontology is represented by a set of trees in which nodes

and edges correspond to concept and relationships respectively. There are two steps in the

schema matching. First it creates preliminary matchings between leaf nodes based on do-

main ontology, lexical similarity and data type similarity. This step creates many-to-many

matchings. Therefore the second step is to extract final matching (one-to-one matching)

using the path similarities.

2.3.2 Transformation Approaches

The schema matching process is one of the stages in the XML transformation. It is used to

find corresponding nodes between two XML data. After corresponding nodes are found,

these corresponding nodes are then used to generate a transformation script. One of the

widely used transformation languages for XML data is eXtensible Stylesheet Language

Transformation (XSLT) [1]. An XSLT script is composed of one or more transformation

rules called templates that recursively operate on a single input document. An XSLT

program called stylesheet is composed of one or more transformation rules called templates

that recursively operate on a single input document. Transformation rules in XSLT are

guarded by XPath expressions. XPath6 uses path expressions to select nodes or sets of

nodes in an XML document. It is operated on XML documents using a tree-based model.

Thus, XPath uses path expressions to navigate through elements and attributes in an

XML document.

6http://www.w3.org/TR/xpath20/


2.3.2.1 XSLT for XML Transformation

The majority of existing research [31, 55, 54, 65, 67] translates each matching relationship

between nodes into an XSLT template, resulting in a script with many templates. Shin et

al. [61] state that XSLT scripts with many templates slow down the transformation process

when the XSLT script has to apply repeatedly to a large volume of XML documents.

Therefore, the Shin et al. [61] approach generates an XSLT script where the number of

templates is proportional to the number of matches between recursive nodes, regardless

of the number of matching relationships between internal nodes. A recursive node is the

node that indicates the reference to its corresponding ancestor. This approach focuses

on cardinality operators (defining how many instances of an element type are permitted

in a document) since cardinality operators make XML documents from the same schema

to having different structures/representations. The disadvantage of this approach is the

XSLT script generated cannot be re-used to transform XML documents of other XML

schemas with similar structures.

On the other hand, Wustner et al. [78] suggest that by processing XSLT on the content

of the XML documents instead of using XML structure may improve the accuracy of

the XML transformation. Using the content, some structural problems that cannot be

solved by simply transforming DTDs or XSD would easily be resolved. Approaches such

as [77, 53] propose new methods to generate the XSLT script automatically. Given a source

XML document and a desired output XML/HTML document, an XSLT stylesheet is

automatically generated to transform the source into the output. The generated stylesheet

contains rules needed to transform the source document into the output document and


can also be applied to other source documents having the same structure.

2.3.2.2 Other Manipulation Languages for XML transformation

Since XSLT language has a number of disadvantages, new manipulation languages for

XML transformation have been proposed in recent years. Approaches such as Streaming

Transformations for XML (STX) 7 are proposed to generate a template-based XML trans-

formation language that operates on the streams of SAX (Simple API for XML) events.

Unlike XSLT where the input documents and the result tree need to be built in-memory,

STX adopts some of the concepts from XSLT but using SAX as the underlying interface

to the XML documents where it does not need to be stored in memory. This approach

can be used to process large XML files more efficiently.

Approaches such as MTRANS [57] state that writing a long XSLT program is painful when

needing to have a good understanding of the XML specification. Thus, the MTRANS

language is developed in the abstraction level above XSLT where XML documents are

modelled as a class diagram. It uses UML (Unified Modeling Language) to transform a

class diagram (source document) into another (target document). Some approaches [54, 55]

also go in the similar direction, where a high-level language is developed and used to

specify XML data for transformations. They are based on the tree-based model and

use XPath expressions. Authors in [54] use an unranked tree transducer approach for

XML transformation. XDTrans [55] specifies transformations by means of rules which

involve XPath expressions, node variables and non-terminal symbols denoting fragments

of a constructed result. These two approaches are developed at the abstraction level in

7http://www.pair.com/lisovsky/transform/stx/


which the transformation rules generated by them can be easily transformed to XSLT for

XML transformation.

Along with many XML specifications, XQuery8 is also introduced by W3C for querying

XML documents. Both XSLT and XQuery use XPath expressions to navigate the XML

documents. Even though, the main purpose of XQuery is to query XML documents,

however, it has the functionality to manipulate and transform XML data [9, 55] into

another required XML format. In the case of using XQuery for XML transformation,

its mapping specifications are translated into appropriate XQuery queries over the input

document. The result of the query is the expected output document, and the result

must satisfy the output schema. Approaches such as Bruno et al. [9] extend XQuery with

transformation operators. The main purpose of the work of Bruno et al. [9] is to extend the

XQuery language for the transformation of XML data. It has shown that using XQuery

language for a transformation language can be more manageable and easier than using

XSLT, as XSLT complexity lies in the generation of template rules.

2.3.3 Discussion

The existing XML transformations which have been mentioned in this section have the

following limitations: (1) Not many XML transformation approaches have addressed the

problem of transformation using XML documents that are based on the XML schemas;

and (2) To the best of our knowledge none of the existing XML transformation approaches

attempts to convert more than one XML schema definitions of similar structure to the

same target document together at the same time. Having said that, there exist some

8www.w3.org/TR/xquery/

2.4. Summary 52

work [51, 59] in the area of schema integration which can be used to resolve structural

conflicts such as nesting discrepancies and backward path representations to integrate

XML sources into a mediated schema. However, these work do not go further and apply

the mediated schema in the transformation application. This thesis aims to address the

above limitations. To simultaneously translate a large number of source documents into

the same target document, this research proposes an XML transformation approach that

utilizes a structure-only clustering method as a pre-processing stage to group the source

documents into clusters having similar structures. The XSLT language is used instead

of other existing manipulation languages because it is the standard and most commonly

used language for XML transformation.

2.4 Summary

This chapter has reviewed the literature on XML clustering and XML transformation.

A number of different clustering approaches based on different data models, similarity

measures and clustering algorithms have been analysed. Furthermore, the chapter outlines

the drawbacks of the existing approaches and what gap in the literature this research tries

to fill.

The limitations of the current approaches which have been discussed in this chapter have

lead the research in this thesis. The next chapter of this thesis will describe the clustering

methods proposed in this research.

Chapter 3The Proposed Clustering Methods

This chapter describes the clustering methods which have been proposed in this research

to investigate the first key hypothesis of this thesis. That is the clustering methods

utilizing both the content and structure of XML documents produce a better

clustering solution in terms of accuracy than the clustering methods solely

utilizing the content-only and structure-only information of XML documents.

The proposed clustering methods are divided into two types. The first type of clustering

is the structure-only type. This type of clustering utilizes only the structure of the XML

documents. The second type is the content and structure-based clustering. It utilizes both

the content and structure of the XML documents.

This chapter begins with an overview of the proposed clustering methods. The methods

are then described in detail according to their data modelling and data similarity tasks.

A hybrid clustering algorithm, utilizing the clustering methods for the partitioning of the

XML documents, is introduced later in the chapter.

53

3.1. The Proposed Clustering Methods: Overview 54

3.1 The Proposed Clustering Methods: Overview

Figure 3.1 is an overview of all the proposed clustering methods. The input for the

clustering methods is a collection of XML documents. The clustering methods proposed

in this thesis are classified into two types: structure-based clustering, and content and

structure-based clustering.

There are two structure-only clustering methods. The first method is XML clustering

based on a Tree model (XCTree). This method utilizes a tree model and a tree similarity

measure (TSim) to compute the degree of similarity between XML documents. The second

method is XML clustering based on a path model (XCPath). A path model and a path

similarity measure (PSim) are defined and used for the grouping of XML documents by

the XCPath method.

In addition, there are two content and structure-only clustering methods. The first method

is XML clustering based on the linear combination of the structural and content similarity

measures (XCLComb). This method uses a linear combination measure (LCSim) to com-

bine the similarity values from a structure measure and a content measure for the overall

document similarity. The second method is XML clustering based on a text-path model

(XCTPath). It is a non-linear method which uses text-paths for representation of both

the structure and content of the XML documents. A text-path vector similarity measure

(TPVSim) is defined to compute the similarity between two sets of text-paths.

The proposed clustering methods use the same clustering algorithm called hybrid clus-

tering to group the XML documents into k number of clusters. The rest of this chapter

3.2. The Structure-Only Clustering Methods 55

explains each of the proposed clustering methods in more detail.

XML

Documents

CPSim

Hybrid Clustering

Path Data Modelling

Data Similarity

C1 C2 Ck

XCTree XCPath

Tree

TSim

XCLComb

LCSim

Tree Text Vector

TPVSim

Text-Path Vector

XCTPath

Structure-Only Clustering Method Content and Structure-based Clustering Method

Figure 3.1: An overview of the proposed clustering methods.

3.2 The Structure-Only Clustering Methods

The structure in XML documents is used to annotate content of the documents which

makes the XML documents different from normal text documents. The structure written

in XML is very flexible as it can be defined by the user. Thus, the same information may

not annotate in the same structure. There are many applications in which the structure-

only clustering methods can be utilized in such as schema integration, data warehouse and


message exchange.

With many applications utilizing the structure for the clustering of XML documents,

this thesis proposes two structure-only clustering methods: the XCTree and the XCPath.

They differ according to the underlying data model and similarity measure. The XCTree

represents the structure of XML documents using a tree model, whereas the XCPath

represents the structure using a path model. In this research, the path and tree models

are used for the representation of the structure of XML documents because they are the

most commonly used models and are less complex than the graph model. The rest of

this section describes the two structure-only clustering methods in terms of their data

modelling and data similarity.

3.2.1 The XCTree Method

The XCTree method is one of the two structure-only clustering methods that is proposed

in this research. The method groups the XML documents using a tree model to capture

the structure embedded in the XML documents. A new tree similarity measure is then

defined in order to compute the degree of similarity between XML documents.

3.2.1.1 The Tree Model

To capture the structure of the XML documents, the XCTree method uses a tree model

called the summary tree structure. The summary tree structure encodes in the depth-first

string tree encoding format [13]. It is based on a rooted label tree structure. A rooted

label tree T is defined as T=V, E, L, r, where V is a set of nodes that exist in T, E is


a set of edges, L is a set of node labels using certain letters of the alphabet, and r is the

root node. If nodes (ni, nj) ∈ E and ni = nj then (ni, nj) is an edge in which the node ni

is the parent of the node nj . A rooted label tree has the following properties: (1) There

is exactly one r where r ∈ V and r has no parent; (2) Every node, except r, has exactly

one parent; and (3) A node in V is reachable via edges from r.

An example of the rooted label tree structure is shown in figure 3.2 (a). The nodes that

do not have any children nodes or have a text node are called leaf nodes. The nodes that

contain other nodes are referred to as the internal nodes. This thesis will focus on the

labels of the element nodes and the element attributes as they are the most important

components in the structure of XML documents. Attributes are modelled and treated the

same way as the element leaf nodes.

company

address cname personnel

name

person

address name

person

address

(a)

company address -1 cname -1 personnel person name -1 address -1 -1 -1

(b)

Figure 3.2: An example of a tree structure (a) and its corresponding summary tree struc-ture in depth-first string tree encoding format (b).

Definition 1. A summary tree structure is a tree structure that records only the unique


nodes. Two nodes that have the same label and the same type (a leaf or an internal)

will be replaced by a single occurrence in the summary tree structure. For a summary

tree structure T with only the root node r, the depth-first string of T is S(T ) = lr − 1,

where l is the label of r. Every node has a “-1” to represent backtracking. For each T

with many nodes, let the children nodes of r be r1, r2, ..., rk, the depth-first string of T is

S(T ) = lrS(r1)S(r2)...S(rk)− 1.

Figure 3.2 shows an example of a rooted label tree structure in (a) and its corresponding

summary tree structure in (b). Notice that the node person and its children only appear

once in the summary tree structure. The summary tree structure does not keep the

occurrence information (cardinality) of the nodes. Utilizing the occurrence information

of the nodes in the XCTree method might cause two similar documents to have a low

similarity value [16]. For instance, in two documents having exactly the same structure,

one document has a node repeated many times and the other document has the same node

but it occurs only once. In this scenario, by taking the occurrence information of the nodes

into consideration, the structure similarity between these two documents will be lower than

using the summary tree structure without considering the occurrence information of the

nodes.

3.2.1.2 The Tree Similarity Measure: TSim

Based on the summary tree structure, a new measure called the Tree Similarity (TSim) is

proposed to calculate the similarity between two summary tree structures which is defined

as follows:


TSim(tx, ty) = max(SimTreeMatching(tx), SimTreeMatching(ty)) (3.1)

SimTreeMatching(tx) =

∑|tx|i=1 nodeSim[i]

|tx|(3.2)

The TSim measure is the best similarity value of the two SimTreeMatching values between

trees tx and ty. The SimTreeMatching(tx) is calculated by computing a treeMatching

algorithm from source tree tx to target tree ty. The output of the algorithm is an array

called the nodeSim which contains the similarity values of the nodes in tx that match

with the nodes in ty. The SimTreeMatching(tx) is the sum of the similarity values in

the nodeSim divided by the number of nodes in tx. The SimTreeMatching(ty), on the

other hand, is calculated by computing the treeMatching algorithm from source tree ty

to target tree tx in the same way.

The detail algorithm of the treeMatching is shown in Algorithm 1. The algorithm is not

transitive as the tree matching from source tree tx to target tree ty is different from source

tree ty to target tree tx. The similarity value of SimMatchingMatching(tx) therefore can

be different from the similarity value of SimMatchingMatching(ty). The treeMatching

algorithm starts the tree matching at the first node i in a source tree to the first node j in

a target tree and works its way down the source tree. If the label labeli of i is not equal

to the label labelj of j, the algorithm moves to the next node j + + in the target tree

structure and starts the node matching with i.

When labeli equals labelj , the similarity value similarity i of i with j is calculated by taking


Algorithm 1 treeMatching

Input: Source tree tx, target tree ty, node similarity array nodeSim;

Output: nodeSim;

1. while node i ∈ tree tx /*starting with the first node in tx*/

2. double similarity i=0;

3. while node j ∈ tree ty /*starting with the first node in ty*/

4. if labeli and labelj the same

5. similarity i = lower(leveli,levelj)/higher(leveli,levelj);

6. treeMatching(subTreei, subTreej , nodeSim);

7. for each node s ∈ subTreei

8. add the similarity value of s ∈ nodeSim to similarity i;

9. reset the similarity value of s ∈ nodeSim to zero;

10. end for

11. if similarity i is larger than the similarity value of i ∈ nodeSim

12. set similarity value of i ∈ nodeSim to similarity i;

13. end if

14. process the next sibling of j ∈ ty;

15. else process the next node j++ ∈ ty;

16. end if

17. end while

18. if i finds a match with any node ∈ ty

19. process the next sibling of i ∈ tx;

20. else process the next node i++ ∈ tx;

21. end if

22. end while

23. return nodeSim;


into account the node levels in the tree structure (Line 5 in Algorithm 1). If the matching

nodes are at the same level, a maximum similarity value of 1 is assigned. Otherwise a

penalty value is assigned according to the difference in level. The root node is in the first

level of a tree structure and its immediate children are in the second level and so on. The

penalty value is calculated by considering the lower level of the two node levels divided

by the higher level of the two nodes. For instance, given two nodes with the same label,

one node is at level 3 and the other is at level 2. The node similarity value of the two

nodes is 0.66 (2/3). The node similarity value of two matching nodes is stored in the

array nodeSim. The nodeSim contains the similarity values of the nodes in the source tree

which find a match with the nodes in the target tree. Therefore, the length of nodeSim

is equal to the number of nodes in the source tree.

Each time labeli is equal to labelj , the treeMatching algorithm starts again for the children

(referred to as subTree in Algorithm 1) of i and j. If either i or j does not have any children,

the treeMatching algorithm starts the node matching of i to the next sibling of j. If j

does not have any sibling, the treeMatching algorithm starts the node matching of i

to the next sibling of j’s ancestor. When the treeMatching algorithm finishes the node

matching for the children of i and j, the similarity values of the children of i are stored

in the nodeSim. The sum of the similarity values of the children of i is added to the

similarity i values of i. The similarity values of the children of i in the nodeSim are reset

to zero. If the similarityi is larger than the similarity value of i in the nodeSim, the

similarity value of i in the nodeSim is set to similarityi.

After the treeMatching algorithm finishes the node matching for i to the nodes in the

target tree and i finds a match with any of the nodes in the target tree, the treeMatching


algorithm starts the node matching for the next sibling of i if there is any, otherwise it

moves to the next sibling of i’s ancestor. However, after finishing the node matching for i

and no match is found, the algorithm starts the node matching for the next node i++ in

the source tree structure. The treeMatching algorithm ends when the algorithm reaches

the end of the source tree structure.

The treeMatching can discover structural conflict such as nesting discrepancies. For

example, consider the following two paths, movie/title and actor/movie/title. These two

paths are similar but because of the nesting they may not be the same. The treeMatching

algorithm can resolve this type of conflict because the algorithm continues to execute the

next node if the first node in the hierarchical does not have a match. Also because the

treeMatching algorithm performs the tree matching from source tree tx to target tree ty

and from source tree ty to target tree tx, the algorithm can discover more structural con-

flicts. Although the treeMatching algorithm may not accurate discover structural conflict

such as backward path representations, for example paths title/movie and movie/title, the

algorithm still produces a similarity value which is greater than 0 between these two paths

such as 0.5.

Example 3. To understand the treeMatching algorithm further, consider the match-

ing between trees tx and ty as given in Figures 3.3 and 3.4. The arrows with numbers

in the figures show the sequence in which the treeMatching algorithm is progressed.

Using the example in Figure 3.3, at Step 1, the labels of the two root nodes are the

same. The treeMatching algorithm then processes the children of the two root nodes.

At Step 2, the node person in tx is compared with the node person in ty and their

labels are the same. However, it does not process further because the node person in


ty does not have any children. At Step 3, the node person is compared to the next

node person in ty. As the two nodes match, the treeMatching algorithm processes

the children of the nodes. The nodes that do not have any arrows pointing in or out

in Figures 3.3 and 3.4 have not been processed by the treeMatching algorithm. The

output similarity value from the treeMatching(tx, ty) algorithm for the example in Fig-

ure 3.3 is three. Even though the treeMatching(tx, ty) has four matches, the node person

of the source tree tx matches twice to the nodes in the target tree ty therefore only

the best match similarity value is used. The best match value of a node is the highest

sum of the similarity values of the node’s descendants. In other words, the best match

of a node is when there are more matches in the node’s descendants. The same pro-

cess is repeated for finding treeMatching(ty, tx) which yields the value of 4. Finally,

SimTreeMatching(tx) = 3/5 and the SimTreeMatching(ty) = 4/5, the maximum value

of these two SimTreeMatching values is the TSim value for tx and ty.

personnel

person

name

firstName lastName

personnel

person

name address

person

1

2

4

3

tx ty

5

Figure 3.3: An example of the treeMatching algorithm from tx to ty.


personnel

person

name

firstName lastName

personnel

person

name address

person

1

2

4

3

tx ty

5

Figure 3.4: An example of the treeMatching algorithm from ty to tx.

3.2.2 The XCPath Method

The second structure-only clustering method is the XCPath method. This method employs

a path model to capture the structure of the XML documents. A path similarity called

CPSim (Common Path Similarity) is defined to compute the degree of similarity between

XML documents using the path models.

3.2.2.1 The Path Model

The XCPath method represents the structure of an XML document using a set of complete

paths. A complete path contains the labels of the nodes from the root to the leaf node.

The complete paths can be extracted from the summary tree structure. Based on the

example in Figure 3.2, the summary tree structure can be broken down into the following

complete paths:


company/address,

company/cname,

company/personnel/person/name,

company/personnel/person/address

3.2.2.2 The Path Similarity Measure: CPSim

The similarity between two documents dx and dy, represented by their sets of paths Px

and Py, is calculated using the CPSim measure which is defined as follows:

CPSim(Px, Py) =

∑|Px|i=1 max(Psim

|Py|j=1(pi, pj))

max(|Px| , |Py|)(3.3)

PSim(pi, pj) =max(CNC(pi, pj), CNC(pj , pi))

max(|pi| , |pi|)(3.4)

where |Px| and |Py| are the number of paths in Px and Py, respectively. The CPSim is

the sum of the best path similarity from the PSim measure for all paths in dx with the

paths in dy divided by the maximum number of paths of the two sets of Px and Py. Only

the path similarity value from PSim that exceeds a path threshold are considered in the

CPSim measure. The path threshold determines the lowest path similarity value that two

paths should have for them to be considered as a matching pair. The path threshold value

is a user-defined value and ranges from 0 to 1, where 1 is the highest indicates that the

structure of two paths is an exact match.

PSim of paths pi and pj is the best similarity value derived from two CNC (Common Node


Coefficient) algorithms divided by the maximum number of nodes in the two paths. The

CNC (Common Node Coefficient) is the sum of the common nodes - that is the number

of nodes having the same label - by considering the hierarchical order of the nodes in

the paths. The way the CNC algorithm works is similar to the treeMatching algorithm.

The difference is that the CNC algorithm finds the common nodes between two paths

starting from the leaf node. However, the CNC algorithm is more time consuming than

the treeMatching algorithm since it operates between two paths in which the ancestors of

the leaf nodes need to be revisited a number of times. The aim of the CNC algorithm is

to find corresponding leaf nodes by considering the node labels as well as their ancestors.

This algorithm is appropriate for a schema matching system where all corresponding leaf

nodes between two data need to be identified.

The CNC algorithm is detailed in Algorithm 2. The CNC algorithm starts by two paths

matching at the leaf node. Each time a node in the source path finds a match with a

node in the target path, the two nodes parents are then processed, otherwise the CNC

algorithm will process the current unmatched node in the source path with the parent of

the unmatched node in the target path. The algorithm continues until all the nodes in

the source path find a match or when the target path reaches the root node. A match

in CNC is when the labels of two nodes are the same. Each time a match is found the

similarity value in the CNC algorithm between the source path and the target path is

incremented by 1. The CNC algorithm does not process the ancestors of a node in the

source path if the node cannot find a match with the nodes in the target path.

Also similar to the treeMatching algorithm, CNC algorithm is not transitive and it

can discover structural conflict such as nesting discrepancies. Like the treeMatching,


Algorithm 2 CNC

Input: Paths pj and pj ;

Output: Int similarity;

1. int similarity = 0;

2. int z = 0;

3. for(int t = 0; t < |pi|; t++)

4. while z < |pj |5. if(nt == nz)

6. similarity+=1;

7. z++;

8. break from ’while’ loop;

9. else

10. z++;

11. end if

12. end while

13. end for

14. return similarity;

the CNC algorithm may not accurate discover structural conflict such as backward path

representations; however, it produces a similarity value which is greater than 0 because it

can still discover one element that is the same in the backward path representations.

Consider the examples in Figure 3.5. The CNC algorithm starts at the leaf node (the

node on the right hand side is the leaf node). In Figure 3.5(a) the leaf node name of the

source path py is compared to the leaf node lastName in the target path py. The arrows

in the figure number the sequence in which the matching process in the CNC algorithm

is executed. The output from the CNC algorithm for example (a) in Figure 3.5 is 4, and

the output for example (b) is 0.

3.3. The Content and Structure-based Clustering Methods 68

company personnel person name lastName

company personnel person name

px

py

1 2 3 4 5

(a)

company personnel person name lastName

company personnel person name

px

py

1 2

3 4

(b)

Figure 3.5: CNC matching

3.3 The Content and Structure-based Clustering Methods

The structure-only clustering methods described in the previous section utilize only the

structure of the XML documents. However, for the clustering of XML documents, the

content of the documents can also play an important role. For instance, documents having

the same structure might not contain the same content and vice versa. A good example

is that of two journal articles that have the same structure but different content; one is

about health science and the other is about data mining. This example shows that the

clustering based on the structure-only information might not produce a desirable content

and structure-based clustering solution.

Therefore, this section introduces two methods which utilize both structure and content

for the clustering of XML documents. The first method is the XCLComb which utilizes a

linear measure to combine the structure similarity values and content similarity value to


compute the overall document similarity. The structure and content are represented using

different data models. The second method is the XCTComb which represents the content

and structure using the same data model. This is a non-linear method for the clustering

of the XML documents using both the structure and the content.

3.3.1 The XCLComb Method

Not many clustering methods can be applied efficiently to both the homogeneous as well

as heterogeneous XML document collections. A homogeneous collection generally does

not vary drastically in terms of structure, but mostly, it varies according to the content.

Different types of collections need different ways of measuring the document similarity.

For instance, documents from a homogeneous collection can be differentiated better in the

content than in the structure. On the other hand, a heterogeneous collection differs in

terms of the structure and the content. Based on the characteristics of homogeneous and

heterogeneous collections, it is not easy to propose an approach that works efficiently with

both types of collections.

In order to have a clustering method that can be applied on homogeneous as well as

heterogeneous collections, this thesis proposed the XCLComb method which uses different

data models and similarity measures for the content and structure. The similarity values

of the content and the structure are calculated separately then these values are combined

with different weightings to adjust the importance of the content similarity value and

the structure similarity value. For instance, homogeneous collections will obtain a higher

weight in the content similarity value than in the structure similarity value.


3.3.1.1 The Tree Model and The Text Vector Model

The XCLComb method represents the structure of XML documents using the tree model

employed by the XCTree method, whereas the content of XML documents is based on a

text vector model.

The content of a document dj is represented using a text vector tvj = w1j , w2j , ..., wmj,

where m is the number of terms in the XML document collection in which document dj

is drawn and wi,j is the TF-IDF weighting of term ti in document dj defined as:

wi,j = TFi · log|D|

|ti ∈ |D||(3.5)

TFi is the term frequency of term ti in document dj divided by the total number of term

frequencies in document dj , log|D|

|ti∈|D|| is inverse document frequency (IDF), |D| is the

total number of documents in the collectionD, and |ti ∈ |D|| is the number of documents

in D containing the term ti.

3.3.1.2 The Linear Similarity Measure: LCSim

Given two documents dx and dy, the document similarity which is employed by the

XCLComb method is a linear combination of the structural and content Similarity (LC-

Sim) values which is defined as follows:

LCSim(dx, dy) = (TSim(tx, ty)× (1− λ) + (TV Sim(tvx, tvy)× λ) (3.6)


TV Sim(tvx, tvy) =tvTxUkU

Tk tvy

|UTk tvx||UT

k tvy|(3.7)

where λ is a weighting value ranging from 0 to 1 defined by the user, tx and ty are the

summary tree structures of the documents dx and dy. The λ can be adjusted depending

on the importance of the content and structure in the input collection. The TVSim (Text

Vector-based Similarity) is a cosine measure using a kernel matrix Uk which is constructed

from LSI [15]. The construction of the Uk is discussed later in this section.

3.3.2 The XCTPath Method

Unlike the XCLComb method, the XCTPath method groups the XML documents using

one data model called the text-path model. Two reasons for using one data model to rep-

resent both the structure and the content are: (1) It does not need to adjust the weighting

value λ; and (2) It produces a more meaningful content and structure-based clustering

solution since the relationships between the structure and the content are maintained.

3.3.2.1 The Text-Path Vector Model

To represent the structure and the content of XML documents, the XCTPath method uses

text paths. The text paths contain the structure along with its content which is similar

to Yao et al. [83]. Given a collection of XML documents D = d1, d2, ..., dn, a set of text

paths TPV = tp1, tp2, ..., tpm are extracted from D after the stop-word removal and

stemming [58] are performed on the content.


Definition 2. A text path is the partial path or complete path along with a term that

occurs under the leaf node and under the descendant leaf node of the path. A text path

always starts with the root node and ends with a term.

The text paths of document dj are represented using a vector tpvj = tpvj = w1j , w2j , ..., wmj,

where wi,j is the TF-IDF weighting of a text path tpi that appears in document dj , and

m is the number of text paths in the document collection D. Text paths that represent

the content and structure of an XML document can occur in many different path lengths.

The length of the text path can be adjusted by the user to include the desirable number of

ancestors for a term. For instance, when the length of a text path increases, the structure

plays an important role, and when the length of the text path decreases, the content plays

an important role. A document that contains a text path with the length of 1 means the

text path only contains the root node, and a term in the root node’s descendant leaf node.

The text path with the length of 2 contains the root node, the root node’s immediate child

node, and a term in the root node’s immediate child descendant leaf node, and so forth.

A document which has text paths with the maximum length of 3 means that the text paths

can be a collection of complete paths with the length lesser or equal to 3 along with their

content, and the partial paths contain only 3 ancestors in which terms are under starting

from the root node. This approach is similar to Yao et al. [84]. Take the example of XML

document in Figure 3.6. The XML document that has text paths with the maximum

length of 3 will contain the following text paths:

conf/id/IE06, conf/title/Conference, conf/title/Knowledge,

conf/title/Discovery, conf/title/Data, conf/title/mining, conf/title/KDD,


<?xml version="1.0"?>

<!DOCTYPE conf SYSTEM "conf.dtd">

<conf id=”IE06”>

<title> The 16th ACM SIGKDD Conference on Knowledge Discovery and Data mining (KDD-2010)</title>

<year> 2010 </year>

<editor>

<person>

<name>”Peter Gavin”</name>

<email>[email protected]</email>

<phone>61-9828712</phone>

</person>

</editor>

<paper>

<title>”]Mining the structure for XML document clustering”</title>

<author>

<person>

<name>”Susan Smith”> </name>


</person>

</author>

<reference>

<paper>

<title>”A Survey of XML Similarity Measures” </title>

<author>

<person>

<name>”David MacDonald”> </name>


</person>

</author>

</paper>

</reference>

</paper>

</conf>

Figure 3.6: An example of a conference XML document

conf/year/2010, conf/editor/person/Peter, conf/editor/person/Gavin, etc.

3.3.2.2 The Non-Linear Measure: TPVSim

The XCTPath method uses a measure called Text Path Vector-based Similarity (TPVSim)

for the document similarity which is defined as follows:


TPV Sim(tpvx, tpvy) =tpvTxUkU

Tk tpvy

|UTk tpvx||UT

k tpvy|(3.8)

where Uk is the kernel matrix constructed from LSI [15]. Different input XML document

collections will have a different Uk and the Uk in this method is different from the Uk in

the XCLComb method. The next section describes the construction of the Uk in more

detail.

3.3.3 The Kernel Construction Approach

To calculate the degree of similarity between text vectors or text-path vectors, a kernel

is used. A kernel is constructed using the Latent Semantic Indexing (LSI) [15]. LSI

can construct a semantic space wherein terms and documents that are closely associated

are placed beside one another, which reflects major associative patterns in the data and

ignores less important influence patterns.

The construction of the kernel is expensive in terms of memory usage (refer to section

2.2.2.1 of Chapter 2 that gives the background on semantic kernel construction) since

it needs to compute the Singular Value Decomposition(SVD). Therefore, this thesis in-

troduces a reduction method called XML Dimensional Document Reduction (XDDR) to

reduce the dimensional document of a feature-document matrix Xm×n to Xm×n′ , where

the feature is either term or text-path, m is the number of features, n is the number of doc-

uments in an input XML document collection, and n′ is the reduced number of documents

which is lesser than n. Each item in the matrix is the frequency of a feature occurring in a

document. This thesis tries to preserve the term dimensionality rather than the document


dimensionality for the grouping of XML documents because the document dimensionality

might not be important in finding the associations between terms.

Algorithm 3 describes the algorithm of the XDDR method. Before the XDDR method is

executed, the input document collection D is first partitioned using one of the structure-

only clustering methods, the XCTree method or the XCPath method proposed in this

thesis. The clustering solution generated from the structure-only clustering method is

then processed by the XDDR method as follows. Let the structure-only clustering solution

be a collection of clusters SC = sc1, sc2, ..., sck, where (1) sci = d1, d2, ..., dn′′, (2)

sc1∩sc2∩ ...∩sck = d1, d2, ..., dn where n′′ < n is the number of documents of document

collection D, and (3) |sci| <= |sci+1|, i.e. clusters in SC are sorted in ascending order

according to the number of documents that they contain (Line 5 of Algorithm 3). Clusters

containing the smaller number of documents are processed before the larger sized clusters.

Let Ψ be the number of documents to be selected for the current cluster and η be the

number of documents to be selected for each cluster in SC. If the number of documents

in sci, denoted by |sci|, is equal or lesser than Ψ, then the documents belonging to sci

are added to a new document collection D′. If |sci| is lesser than Ψ then the remaining

numbers of Ψ, denoted by remNum, from sci is distributed evenly across the remaining

unprocessed clusters in SC. The Ψ and η are adjusted to consider the remaining numbers

remNum (line 14 to 25). For the clusters where |sci| > Ψ, document importance of each

document in cluster sci is calculated. The document importance (DI) of a document in a

cluster is measured as:


Algorithm 3 The XDDR Algorithm

Input: structure-only clustering solution SC = sc1, sc2, ..., sck;User-defined numbers of dimensional document space r for matrix X;

Output: Document collection D′ = d1, d2, ..., dn′;

1. /*η is the number of selected documents for each cluster*/

2. int η = r/|SC|;3. /*Ψ is the number of selected documents for a current cluster*/

4. int Ψ = η;

5. sort the clusters in SC in ascending order according to the number

of documents in the clusters;

6. document collection D′ = empty;

7. for each cluster sci ∈ SC

8. if |sci| > Ψ

9. calculate the document importance DI for each document ∈ sci;

10. add Ψ-1 documents with the highest DI to D′;

11. merge the content of the left over documents into

a new document d′;

12. add d′ to D′;

13. Ψ = η;

14. else

15. add the documents ∈ sci to D′

16. /*the unselected numbers of Ψ from sci is to be distributed across

17. the remaining clusters that have not been processed yet*/

18. int remNum = Ψ− |sci|;19. if(remNum>0)

20. int distrNum= remNum/the number of clusters left in SC;

21. /*adjust η and Ψ*/

22. η = η+ distrNum;

23. Ψ = η+ (remNum -(distrNum × the number of clusters left in SC));

24. end if

25. end if

26. end for

27. /*document collection D′ = d1, d2, ..., dn′ where n′ <= r < n, where n

28. is the number of documents in the input document collection D */

29. return D′;


DI(dj) =

∑m′

i=1 wi,j√∑m′

i=1 w2i,j

. (3.9)

wherem′ is the number of distinct terms extracted from dj and wi,j is the Tf×IDf weighting

of term ti in document dj . Refer to Equation 3.5 for more detail of the TF-IDF weighting.

The TF computed by the ratio of the number of times term ti appears in document dj

to the total number of term frequencies in document dj ; the IDfi is obtained by dividing

the number of documents in sci by the number of documents containing term ti, and then

taking the logarithm of that quotient. A high weight is yielded by a high term frequency

in a given document and a low document frequency of the term in the collection in SCi.

Documents with higherDI value are added toD′ (Line 10 of Algorithm 3). The documents

with lower DI values are merged their content into a new document d′ (Line 11). The

number of documents which is merged into d′ is equal to |sci| - (Ψ-1).

The output of the XDDR method is the document collection D′ = d1, d2, ..., dn′ where

n′ is lesser than the number of documents n in the clustering solution SC. The D′ is

then constructed into a feature-document matrix Xm×n′ . Then SVD is performed on the

matrix to obtain a kernel matrix Uk which is used in the TVSim measure(equation 3.7)

or the TPVSim measure (equation 3.8). Refer to Section 2.2.2.1 of Chapter 2 for more

detail of the SVD method.

3.4. The Hybrid Clustering 78

3.4 The Hybrid Clustering

The previous sections describe the data model and data similarity for the clustering meth-

ods which have been developed in this research. This section describes the clustering

algorithm employed by the clustering methods. The clustering algorithm proposed in

this thesis is a hybrid clustering algorithm which consists of three stages: incremental

clustering, iteration, and pair-wise clustering. The overview of the algorithm is shown in

Figure 3.7.

Incremental

Clustering

Ck’

C2

C1

Pair-wise

Clustering

Itera!on

Ck

C2

C1

Ck’

C2

C1

Hybrid Clustering Algorithm

Data Similarity

Data Modelling

XML

Documents

Figure 3.7: The hybrid XML clustering approach overview

It is a hybrid approach as it includes two types of clusterings. The first clustering is the

incremental clustering which is used at the beginning to group the large size of an input

XML collection in an intermediate numbers of clusters. The second clustering is the parti-

tioning method [30] which is a type of hierarchical clustering. It uses the pair-wise matrix


as input. The pair-wise matrix is obtained by calculating the data similarity between all

possible pairs of cluster representations generated from the incremental clustering. The

hybrid clustering algorithm also has an iteration stage after the incremental clustering to

reassign the input document collection again to the clusters, which are generated in the

incremental clustering stage, having the maximum data similarity value. The iteration is

to address the sensitivity of the input document ordering in the incremental clustering.

The reason why the proposed algorithm utilizes the two types of clustering is because its

goal is to address the drawbacks of the two clusterings. These are the accuracy problem

in the incremental clustering, and the scalability problem in the pair-wise clustering. The

algorithm of the hybrid clustering is outlined in Algorithm 4. The three stages of the

hybrid clustering are described further in this section.

3.4.1 The Incremental Clustering Stage

The first stage of the proposed clustering algorithm is the incremental clustering (from

Line 2 to 17 of Algorithm 4). This clustering generates the number of clusters at run-time.

The incremental clustering begins with no cluster in the clustering solution C. Therefore,

the first document in the collection D makes a new cluster ci in C and becomes the cluster

representation ri of cluster ci. When there are clusters in the clustering solution C, the

documents in the collection D are compared with the clusters as follows. A document

is compared with the cluster representations using data similarity which is the similarity

measure employed by the proposed clustering methods (refer to Figure 3.1). If the best

data similarity value between a document and a cluster exceeds a user-defined clustering


Algorithm 4 The hybrid clustering algorithm

Input: Document collection D = d1, d2, ..., dn, user-defined number of clusters β,

clustering threshold α;

Output: clustering solution C = c1, c2, ..., ck;

1. /*Incremental Clustering*/

2. for each document dj in dataset D

3. if clustering solution C is empty

4. create a new cluster ci in C;

5. assign dj to ci;

6. make dj the cluster representation ri of ci;

7. else

8. for each cluster ci in clustering solution C

9. compute the data similarity between dj and ri;

10. end for

11. if the highest data similarity value exceeds or equals to α

12. assign dj to cluster ci having the maximum data similarity value

13. change ri if applicable;

14. else execute steps 3 to 5;

15. end if

16. end if

17. end for

18.

19. /*Iteration*/

20. for each document dj in document file D

21. for each cluster ci in clustering solution C

22. compute the data similarity between dj and ci;

23. end for

24. assign dj to cluster ci having the maximum data similarity value;

25. end if

26.

27. if |C| > β

28. /*Pair-wise Clustering*/

29. generate a pair-wise matrix by computing the data similarity

30. between all pairs of clusters representations in C;

31. perform partitioning clustering on the distance matrix;

32. reassign the documents to new clusters based on the clustering result of the

partitioning clustering;

33. return C = c1, c2, ...cβ;34. else return C = c1, c2, ...ck<=β;35. end if


threshold δ, the document is assigned to that cluster, otherwise the document makes a new

cluster in C and becomes the clustering representation of the new cluster. The clustering

process continues until all the documents in the document collection D are grouped into

clusters. The clustering threshold α is defined by the user and it is a value that determines

the degree of similarity that a document should have with a cluster in order to assign that

document to the cluster. The clustering threshold value is between 0 and 1, where 1 is the

highest, which indicates that the structure of a document is an exact match or a subset

of the cluster representation.

The incremental clustering requires a representation for the clusters in order to compare

the clustering objects with the new input objects. The selection of the cluster represen-

tation is important as it determines the accuracy of the clustering solution. There are

two types of cluster representations employed by the hybrid clustering: the common path

representation and the first document representation.

The common path representation. The common paths are the paths shared by all the

documents that exist in the cluster. The common path representation is mainly used by the

XCPath method as the method is based on the path model. The term “common” indicates

the degree of similarity between the paths that exceed a user-defined path threshold which

is described in Section 3.2.2.2. The initial clustering representation is based on the common

paths between the first two documents in a cluster. The cluster representation is expanded

by adding the paths with PSim (Equation 3.4) values that exceed the path threshold if

the paths do not already exist in the clustering representation. The cluster representation

for a cluster with only one document is the paths of that document in the cluster.


Since the common path representation is a path model similar to the XCPath method, the

data similarity between a set of paths, representing a document, and a set of paths, rep-

resenting a cluster representation for a cluster only contains one document , is therefore

computed using the CPSim measure (Equation 3.3). The CPSim measure in Equa-

tion 3.3 is used to calculate the set of paths of a document and a cluster with only one

document. If there is more than one document in a cluster, then the common paths of

that document are used as the cluster representation. Equation 3.3 is altered to calculate

the similarity between a new document and a cluster representation, representing a cluster

with more than two documents. The new measure is defined below:

CPSim(dx, ry) =

∑|dx|i=1 max(Psim

|ry|j=1(pi, pj))

|dx|(3.10)

This measure is different to the measure in Equation 3.3 in that instead of dividing the

sum of the PSim values by the maximum number of paths of the document and the

cluster representation, this measure is divided by the number of paths in the document.

The reason is that if the number of paths in the cluster representation ry is large and if

the document is a sub-tree of the cluster representation, the CPSim value produced from

Equation 3.3 will be low.

The first document representation. The first document representation uses the feature,

structure and/or content of the document that makes a new cluster in a clustering solution

C as the cluster representation instead of using a common feature such as the common

path representation. The main reason why the first document representation is used

because to extract the common tree structure from all of the documents in a cluster is


more complicated than to extract the common paths based only on the path model. The

time taken to perform such a task can slow down the clustering process. The idea of

using the documents that make new clusters as the cluster representations is based on the

assumption that the documents that make new clusters might contain some feature that

is different from already existing cluster representations and that the feature can be used

to cluster new documents having the similar feature. The first document representation

is employed by the XCTREE, XCLComb, and XCTPath methods for the clustering of

XML documents. As the cluster representation is also a document, the data similarity

employed by the clustering methods can be used to calculate the degree of similarity

between a document and a cluster representation.

3.4.2 The Iteration Stage

After the incremental clustering, the clustering process will execute an iteration stage. This

stage is to reassign the documents again according to the current cluster representations.

In the iteration stage it does not need the clustering threshold because the documents

are assigned to the clusters that have the highest degree of data similarity value. This

stage is important because the clustering solution generated in the incremental clustering is

sensitive to the input ordering. The iteration stage allows the input documents which have

been clustered earlier in the incremental clustering stage to be compared to the clusters

which are generated at a later stage. Throughout this stage, there is no alternation to the

cluster representations.

3.5. Summary 84

3.4.3 The Pair-wise Clustering Stage

After the iteration, if the generated number of clusters exceeds user-defined numbers of

clusters β then pair-wise clustering is executed. Each pair of the cluster representations

in the clustering solution C is compared to calculate the data similarity value. This in

turn generates a pair-wise similarity matrix. This matrix is then used as an input to a

partitioning method [30] for grouping the cluster representations so the number of clusters

generated is equal to β. The partitioning clustering method [30] first divides the distance

matrix into two groups, and then one of these two groups is chosen to be divided further.

The process is repeated until the number of divisions in the process is equal to the number

of user-defined clusters.

3.5 Summary

In summary, this chapter has described the clustering methods which have proposed in this

research. The clustering methods are divided into two types: structure-only clustering,

content and structure-based clustering. There are two structure-only clustering methods

which are based on the structure of the XML documents. These two methods vary in

terms of data modelling and data similarity. In addition, two content and structure-

based clustering methods are also developed. The two methods are different in terms of

how the content and structure are utilized for the document similarity. The chapter also

introduces a hybrid clustering algorithm which is employed by the proposed clustering

methods. The hybrid clustering algorithm consists of three stages: incremental clustering,

iteration, and pair-wise clustering. The hybrid clustering algorithm uses two different types

3.5. Summary 85

of clustering algorithms to group the XML documents. The hybrid clustering algorithm

aims to improve the scalability of the pair-wise clustering by performing clustering on

cluster representations instead of on the whole of the input documents. Also, to improve

the accuracy of the incremental clustering, the hybrid clustering algorithm includes an

iteration stage in order to reduce the sensitivity of the document ordering. The next

chapter is the empirical evaluation of the proposed clustering methods.

Chapter 4Empirical Evaluation of the Clustering

Methods

The previous chapter introduced four clustering methods for the clustering of XML doc-

uments. Two methods are based on the structure-only information, and the other two

methods are on the content and structure of the XML documents. In this chapter the four

clustering methods are evaluated and analysed.

This chapter describes the XML data collections which have been used for evaluating the

proposed clustering methods, the pre-processing of the data collection, the evaluation met-

rics, and the experimental results. In conclusion, there is the discussion and comparison

of all the clustering methods.

86

4.1. Data Collection 87

4.1 Data Collection

The proposed clustering methods presented in Chapter 3 focus on the clustering of XML

documents. Table 4.1 shows the XML document collections which have been used for eval-

uating the proposed clustering methods in this thesis. They are a mixture of homogeneous

and heterogeneous collections. The homogeneous collections have XML documents con-

forming to the same structural definition. On the other hand, heterogeneous collections

consist of XML documents conforming to more than one structural definition. All of them

are real life collections. Refer to the Appendix for the detail of the schema definitions of

these collections:

Niagara: This collection is derived from the Niagara Institution for Information

Retrieval System1 testing. It is a mixture of different XML documents conforming

to different schema definition as shown in Table 4.2.

Publication: This collection is derived from the Heterogeneous Track in INEX

2005 [17]. This collection relates to the publication domain. The documents are

a subset of four different sources, each of which has a different schema definition:

Berkeley, Computer Science, HCI Biliography, and DBPUB Bibliography.

DBLP: This collection is also derived from the Heterogeneous Track in INEX 2005 [17].

Even though the collection is derived from the Hetergeneous Track in INEX, in this

thesis the DBLP is treated as a homogeneous collection because the XML documents

in this collection is conformed to one schema definition. The structure and the con-

tent feature of the documents in this collection is small, therefore only a subset of the

1http://www.cs.wisc.edu/niagara/data.html

4.2. Data Pre-Processing 88

whole collection is used in the experiments to test the performance of the proposed

clustering methods.

IEEE: This collection comes from the INEX Document Mining Track in 2006 [17].

The IEEE corpus is composed of 12,000 scientific articles from IEEE journals from

year 2002 to 2005. In the 2006 INEX document mining track, a total of 6054

documents were used as a testing collection for the clustering task [17].

Table 4.1: Data collections for XML clusteringXML collection Number of Number of Collection type

documents classes

Niagara 5289 4 heterogeneous

Publication 460 22 heterogeneous

DBLP 4910 8 homogeneous

IEEE 6054 18 homogeneous

Table 4.2 describes the document classification of the XML document collections. The

table shows that the documents in the Niagara and DBLP are not evenly distributed

across the classes. Some classes only have one or two documents. In the IEEE collection,

on the other hand, the documents are evenly distributed across the classes.

4.2 Data Pre-Processing

This section looks at the pre-processing of the XML document collections as shown in

table 4.1. Two features of the XML documents needed to be extracted: the structure and

the content. To extract a feature of the XML document, SAX parsing technology is used.

It is faster than the DOM parsing. SAX parses the elements in the XML documents one

by one starting the root elements. The structure is extracted and modelled as described

in chapter 3.


Table 4.2: The classification of the data collections for XML clusteringXML Collection Classification Number of

documents

Niagara

Movie 37Actor 37Department 19Course 2Report 1Automobile 208Bibliography 16Profile 11Personal 12Quote 15Hospitality message 24Travel 10Order 10Auction data 4Appointment 2Document page 3Linux How-to documents 12Bookstore 2Shake 20Club 12Catalogue record 1Medicine Citation 1Nutrition 37

Publication

Berkeley 698dbpub Bibliography 364Computer Science 2878HCI Bibliography 1349

DBLP

books 1076conference 2065journals 1634miscellaneous 2persons 11phd 62technical report 27world wide web 33

IEEE

IEEE Annals of the History of Computing 156IEEE Computer Graphics and Applications 345Computer 963IEEE Computational Science and Engineering 286IEEE Design and Test of Computers 273IEEE Expert 351IEEE Internet Computing 266IT Professional 133IEEE Micro 284IEEE MultiMedia 235IEEE Parallel and Distributed Technology 192IEEE Software 460IEEE Transactions on Computers 516IEEE Transactions on Parallel 396and Distributed SystemsIEEE Transactions on Visualization 120and Computer GraphicsIEEE Transactions on Knowledge 291and Data EngineeringIEEE Transactions on Pattern Analysis 482and Machine IntelligenceIEEE Transactions on Software Engineering 305


The content of the documents is pre-processed as follows:

1. The text of the element nodes and attributes nodes are extracted;

2. The text is tokenized by spaces;

3. The numbers and special symbols are removed;

4. Common words known as stop-words such as the, and, their, my, etc. are removed;

5. The terms are stemmed; and

6. The terms in which their length is lesser than three characters are removed.

The pre-processing of the content is important as it helps remove insignificant terms and

improve the clustering based on the term feature.

Word Removal. In pre-processing, the text in the content of element and attribute nodes

is extracted. Stop-words are removed from the term collection. Stop-words are words such

as articles, prepositions, etc. that are common and have no significant meaning. Also

words that have fewer than three characters are removed since words with less than three

characters considered insignificant. Special symbols such as , !, @, #, %, etc. are also

discarded.

Stemming. Stemming refers to a process of reducing words to their ‘stems’, e.g. ‘banking’

to ‘bank’, ‘flooded’ to ‘flood’. Such abbreviation reduces a series of terms to a single

common concept.

Table 4.3 shows the detail of the pre-processed XML document collections, where # stands


for ‘the number of’. The maximum level and minimum level in Table 4.3 refer to the

number of levels in the hierarchical structure of the XML documents in the data collections.

Based on the pre-processing data in Table 4.3,the structure of the Niagara and the IEEE

collections are large and more complex than the Publication and DBLP collections. The

DBLP collection has a relatively small structure with the maximum level of four. In

the experiment, the attribute nodes in the IEEE collection are not processed or used in

the clustering of the documents. From previous experiments [72], attributes of element

nodes are not important for the clustering of XML documents, however, the inclusion of

attributes can lower the accuracy of the clustering solution.

Based on the number of terms after the pre-processing of the content in the document col-

lections, on average each document in the Niagara collection contains around one thousand

eight hundred and eighty two terms; each document in the Publication collection contains

around one hundred terms; each document in the DBLP collection contains around twenty

three terms; and each document in the IEEE collection contains around two thousand nine

hundred and twenty two terms. The documents in the IEEE collection are larger in their

content than the other collections. Not only is the structure of the DBLP collection small

but its term collection is small as well.

Even though the XML document collections used for the evaluation of the proposed clus-

tering methods are not very large in terms of documents, nevertheless the collections have

the following characteristics which make them good candidates for the evaluation of the

proposed clustering methods: (1) The collections are homogeneous as well as heteroge-

neous; (2) The distribution of the documents across classes are different; (3) The collections

4.3. Evaluation Metric 92

are real-life documents in XML format; and (4) The documents in the collections vary in

structure and content.

Table 4.3: Details of the pre-processed data collectionsNiagara Publication DBLP IEEE

#internal nodes 100682 48908 9820 922246

#leaf nodes 383810 122835 41195 2823062

#attributes 6067 54805 11288 -

#complete paths 390870 181589 52498 3156564

maximum level 16 6 4 19

minimum level 2 3 2 2

#terms 865846 532913 116960 17692610

#distinct terms 35826 40588 22259 224099

4.3 Evaluation Metric

To evaluate the proposed clustering methods, a number of evaluation metrics are used in

this thesis. The evaluation metrics are for evaluating the quality of the clustering solutions.

Usually, clustering is an off-line process therefore the accuracy of the clustering solution

is more important than the performance of the clustering methods.

There are a number of evaluation metrics available for document clustering. This thesis

uses three commonly used evaluation metrics to evaluate the quality of a clustering solu-

tion. They are Purity, Normalized Mutual Information (NMI), and F1-score [14]. Another

commonly used metric is the Entropy. The calculation of the NMI measure also considers

the Entropy metric, therefore the Entropy metric is not used directly in this research.

The evaluation metrics are used to calculate the external quality of the clustering solution

based on the comparison of cluster classes to known external classes. The values of the

evaluation metrics range from 0 to 1, where 1 is a perfect clustering solution.


4.3.1 Purity

Given a set of clusters Ω = s1, s2, ..., sK and a set of classes C = c1, c2, ..., cR, the

purity(P) [49] of a cluster sk is defined as:

P (sk) =maxr(n

rk)

nk(4.1)

where nrk is the number of documents in class r that occurs most in cluster k and nk

is the number of documents in cluster k. The purity of the clustering solution Ω can

be calculated based on micro-averaging purity (micro purity) and macro-average purity

(macro purity). They are defined as:

micro purity(Ω) =

∑Kk=0(P (sk) ∗ nk)∑K

k=0 nrk

(4.2)

macro purity(Ω) =

∑Kk=0 P (sk)

R(4.3)

The micro purity and macro purity of the clustering solution Ω is obtained as a weighted

sum of individual cluster purity. The difference between micro purity measure and macro

purity measure is that the micro purity is more concerned with how the documents are

grouped in the clustering solution rather than in the number of true classes in which the

clustering solution has discovered. The problem with the purity metric is that as the

number of clusters approaches the same number of classes, the purity score will continue


to improve until it reaches a perfect score when the number of clusters equals the number

of documents. Therefore using the purity metric, the clustering solution that has more

clusters tends to have a higher purity metric than the clustering solution that has a lesser

number of clusters.

4.3.2 Normalized Mutual Information

The Normalized Mutual Information(NMI) is considered to be an improvement over the

purity metric. The NMI metric balances the clustering solution against the number of

clusters. It is defined as:

NMI(Ω, C) =I(Ω;C)

[H(Ω) +H(C)]/2(4.4)

where I is mutual information similar to maximum likelihood estimates of the probabilities

which is defined as:

I(Ω;C) =∑k

∑r

Pro(sk∩

cr)logPro(sk

∩cr)

Pro(sk)Pro(cr)(4.5)

=∑k

∑r

|sk∩

cr|N

logN |sk

∩cr|

|sk||cr|(4.6)

where Pro(sk),Pro(cr), and Pro(sk∩

cr) are the probabilities of a document in cluster k,

class r, and in the intersection of k and r, respectively.


H is entropy which is defined as:

H(Ω) = −∑k

Pro(sk)logPro(sk) (4.7)

= −∑k

|sk|n

log|sk|n

(4.8)

Refer to Christopher et al. [14] for more details of this metric. The mutual information

metric used alone suffers the same problem as the purity metric; however, normalizing it

by entropy fixes this problem since entropy tends to increase with the number of clusters.

For example, H(Ω) reaches its maximum log n for k=n, which ensures the NMI is low for

k=n where n is the number of documents in a data collection.

4.3.3 F1-Score

Another evaluation metric is F1-Score (F1) [17]. The difference between F1 and NMI is

that NMI calculates how many documents in a class are discovered by a cluster, whereas

the F1 calculates how many documents that are correctly classified together in a cluster

and how many documents are misclassified from the cluster. The F1-Score of a clustering

solution is calculated using recall and precision which can be calculated using micro-

averaging and macro-averaging.

Given a class cr and a cluster sk, true positive (TPr) is defined as the number of documents

which is in class cr but which appears in cluster sk; false positive (FPr) is defined as the


number of documents which is not in class cr which appears in cluster sk; and true negative

(TNr) is defined as the number of documents which is in class cr but which is not in cluster

sk. The precision and recall for the micro-averaging F1 (micro F1) is defined as:

precision =

∑Rr=1 TPr∑R

r=1 TPr + FPr

(4.9)

recall =

∑Rr=1 TPr∑R

r=1 TPr + TNr

(4.10)

The precision and recall for the macro-averaging F1 (macro F1) is defined as:

precision =

∑Rr=1

TPr∑Rr=1 TPr+FPr

R(4.11)

recall =

∑Rr=1

TPr∑Rr=1 TPr+TNr

R(4.12)

The precision is to calculate how many incorrect documents have been classified in the

same cluster. Whereas, recall is to calculate how many correct documents are not grouped

in the same cluster. Based on the recall and precision, F1 is defined as:

F1 =2(precision× recall)

precision+ recall(4.13)

To obtain the micro-averaging F1 value, the recall and precision in equations 4.10 and 4.9

4.4. Benchmarks 97

are used in the F1. On the other hand, to obtain the macro-averaging F1 value, the recall

and precision in equations 4.12 and 4.11 are used. The micro-averaging F1 is to measure

the quality of clusters in a clustering solution in which the calculation of the recall and

precision does not consider the number of document classes. The macro-average F1, on

the other hand, is to measure the quality of the overall clustering solution in which the

measure takes the number of document classes into consideration when calculating the

recall and precision.

4.4 Benchmarks

The experiments for the proposed clustering methods are carried out to evaluate the

followings:

The structure-only clustering methods

– The sensitivity of the clustering method and the path threshold.

– The scalability of XCTree and XCPath methods in terms of processing time

which is measured in seconds.

– The comparison of the clustering solutions of XCTree and XCPath methods

through the three different stages of the hybrid clustering algorithm: incremen-

tal clustering, incremental plus iteration, pair-wise clustering.

– The comparison of the XCTree and XCPath with the XCLS method [46] and

Zhang & Shasha tree edit distance [89]. XCLS extends the transactional data

clustering algorithm for the clustering of XML documents. In contrast, the

4.4. Benchmarks 98

Zhang & Shasha tree edit distance computes the cost of transforming one tree to

another tree. Using this cost, the Zhang & Shasha tree edit distance computes

the similarity between trees. The XCLS and the Zhang and Shasha tree edit

distance are chosen for the methods comparison in this section as XCLS is a

fast method that outperform many methods based on incremental clustering,

whereas the Zhang & Shasha tree edit distance method is a family of the tree-

edit distance method based on the tree structure model.

The content and structure-based clustering methods

– The sensitivity of the dimensional document k for kernel matrix Uk and the

number of reduced number of document n′ for feature-document matrix Xm×n′ .

– The sensitivity of the λ in the XCLComb method and the sensitivity of the

path length in the XCTPath method.

– The comparison of the XCLComb method where the weight for the content

similarity value is one (content-only clustering solution) with CLUTO repeated

bisection algorithm using TF-IDF weights and CLUTO repeated bisection al-

gorithm [30] with BM25 weighting on the content-only information

– The comparison between the XCLComb method and the XCTPath method,

the linear combination of the structure and content method and the non-linear

combination of the structure and content of XML documents.

The comparison of all the clustering methods which also includes the XCLS method [46],

Zhang & Shasha tree edit distance method [89], and CLUTO repeated bisection

methods [30].

4.5. Results of Experiments 99

4.5 Results of Experiments

This section evaluates the performance of the clustering methods proposed in this the-

sis. Firstly the structure-only clustering methods are evaluated and analysed. After the

evaluation of the structure-only clustering methods, the results of the experiments on

content and structure-based clustering methods are presented. Finally, the section ends

with the evaluation of both the structure-only clustering methods, and the content and

structure-based clustering methods.

4.5.1 Analysing the Structure-only Clustering Methods

This section presents the results and evaluations of the two structure-only clustering meth-

ods: XCTree and XCPath. It evaluates the effect of the different clustering thresholds,

the scalability, the effect of the path thresholds on the XCPath method, the clustering

solution at the different stages of the hybrid clustering algorithm, and the comparison of

the methods.

4.5.1.1 Clustering Threshold

The hybrid clustering algorithm presented in Chapter 3 Section 3.3 is affected by a clus-

tering threshold used in the incremental clustering stage. Figure 4.1 shows the effect of

the different clustering thresholds on the XCTree and XCPath methods. The result of the

XCPath method is based on the path threshold of 0.7. The results in Figure 4.1 are based

on the clustering solutions after the iteration stage, because for some collections, the num-

ber of clusters generated by the XCTree and the XCPath methods is lesser than the actual


user-defined number of clusters in the incremental clustering stage. It can be ascertained

from the results in figure 4.1 that the clustering algorithm executes better with a high clus-

tering threshold. The reason behind that is that a high clustering threshold maximizes

the closeness of the documents in the same cluster. A high clustering threshold increases

the similarity of the documents within a cluster (intra-cluster similarity) and decreases

the similarity of the documents in different clusters (inter-cluster similarity). The results

show that the XCTree method (Figure 4.1(a)) is affected by the clustering threshold more

than the XCPath method (Figure 4.1(a)) because the XCPath method is based on the

common path representation, whereas the XCTree method is based on the first document

representation. The common path representation is the representation which consists of

all the common structures of the documents held within a cluster. Therefore the ordering

of the input documents in a collection does not affect the XCPath method much. With the

common path representation, the XCPath method can achieve a perfect macro F1 value

for the Niagara collection when the number of clusters are not refined down to twenty-two

clusters. This shows that documents in different classes of the Niagara collection share

different common structures.

0

0.2

0.4

0.6

0.8

1

1.2

0.1 0.3 0.5 0.7 0.9

ma

cro

F1

Clustering Threshold

Niagara

Publica on

DBLP

IEEE

0

0.2

0.4

0.6

0.8

1

1.2

0.1 0.3 0.5 0.7 0.9

ma

cro

F1


Niagara

Publica on

DBLP

IEEE

(a) XCTree (b) XCPath

Figure 4.1: The effect of the clustering threshold on the XCTree and XCPath methods.


4.5.1.2 Scalability

With respect to the scalability, the XCTree method is more efficient than the XCPath

method. It computes the structural similarity using the tree structure rather than indi-

vidual paths in a tree as shown in figure 4.2. The time taken to process the IEEE collection

is longer than the other collections as its structure is much larger. It is significantly ex-

pensive to process the IEEE collection when the clustering threshold is around 0.7 and

above. From 0.7 and above, the clusters of the IEEE generated from the incremental clus-

tering stage up from 11 to 65, and to 1399 with the clustering threshold of 0.9 as shown in

Table 4.4 for the XCTree method. As the number of clusters increases so is the time taken

to compute them. Since from 0.7 and above, the time taken to run the XCPath method

on the IEEE collection is significantly long therefore no result is shown in Figure 4.2 or

Table 4.4 The DBLP takes less time to process than the Niagara even though the DBLP

has more documents than the Niagara collections. The reason is that the structure of the

documents in DBLP collection is small, therefore, the number of clusters generated by the

clustering method is fewer when compared to the Niagara. The XCTree method generates

fewer clusters when compared with the XCPath using the same document collections as

shown in Tables 4.4. The XCPath works with paths, therefore it has more features to be

considered in calculating the structural similarity between documents when compared to

the XCTree method.


0

100

200

300

400

500

600

700

800

900

0.1 0.3 0.5 0.7 0.9

Tim

e (

se

c)


XCTree XCPath

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0.1 0.3 0.5 0.7 0.9

Tim

e (

se

c)


XCTree XCPath

(a) Niagara (b) Publication

0

50

100

150

200

250

0.1 0.3 0.5 0.7 0.9

Tim

e (

se

c)


XCTree XCPath

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

0.1 0.3 0.5 0.7 0.9

Tim

e (

se

c)


XCTree XCPath

(c) DBLP (d) IEEE

Figure 4.2: The processing time of the structure-only clustering methods.

Table 4.4: The number of clusters generated at the incremental clustering stage withdifferent clustering thresholds.

— — 0.1 0.3 0.5 0.7 0.9

XCTree

Niagara 16 23 26 34 53Publication 3 3 5 12 56DBLP 1 1 1 4 15IEEE 1 4 11 65 1399

XCPath

Niagara 29 40 48 51 66Publication 8 12 18 44 198DBLP 8 8 8 14 14IEEE 1 13 104 - -

4.5.1.3 Path Threshold

XCTree method is affected only by the clustering threshold, whereas, the XCPath method

is affected by the path threshold as well as the clustering threshold. The path threshold is

the lowest similarity values between two paths for these paths to be considered as similar.


The effect of the path threshold is shown in Figure 4.3 using the clustering threshold of 0.9.

The effect of the path threshold has not been done on the IEEE collection because there

is no result using the clustering threshold of 0.9 as mentioned before in Section 4.5.1.2.

It is a significantly long run on the IEEE collection using the XPath method. From the

figures, the optimum path threshold is at around 0.7 since it is more flexible than the path

threshold of 0.9. With the path threshold of 0.9 it is equivalent to saying that two paths

should be exactly the same for them to be considered as common paths. The effect of the

path threshold using the clustering threshold of 0.9 performs differently on the Niagara

and Publication collections. The path threshold does not affect the clustering solutions

of the Niagara collection much because the Niagara collection can perform equally well

with a low clustering threshold as shown in Figure 4.1. On the other hand, the clustering

solutions of the Publication collection significantly improve after the path threshold of 0.5

showing that the documents from different classes in the Publication collection are much

closer than the documents in the Niagara collection.

The DBLP collection does not change much with the different path thresholds at the

clustering threshold of 0.9 which is shown in Figure 4.3. Therefore, the effect of the path

threshold on the DBLP collection is analysed further using different clustering thresholds

as shown in Figure 4.4. The results show the different path thresholds on the different

clustering thresholds (CT) using the macro F1 and NMI metrics. With clustering threshold

of 0.5 and over, the path threshold does not affect the clustering solution of the DBLP

collection at all. This happens due to the fact that structure of the documents in the

DBLP collection is small and the structure of the documents from different classes is

closely related, especially for the path matching begins at the leaf nodes (refer to the


0

0.2

0.4

0.6

0.8

1

1.2

0.1 0.3 0.5 0.7 0.9

Accu

ra

cy

Path Threshold

micro purity

macro purity

micro F1

macro F1

NMI

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1 0.3 0.5 0.7 0.9

Accu

ra

cy

Path Threshold

micro purity

macro purity

micro F1

macro F1

NMI


0

0.2

0.4

0.6

0.8

1

1.2

0.1 0.3 0.5 0.7 0.9

Accu

ra

cy

Path Threshold

micro purity

macro purity

micro F1

macro F1

NMI

(c) DBLP

Figure 4.3: The effect of the path thresholds with the clustering threshold of 0.9.

Appendix for the schema definition of the DBLP collection). Therefore using a low path

threshold on a high clustering threshold still produces the same clustering solution with

a high path threshold. For instance, when the structure of the documents from different

classes is closely related, the lowest path similarity values between the document structure

from the classes may be at around 0.8. In this scenario, the clustering solution with the

setting of the path threshold of 0.1 will have the same clustering solution with the path

threshold of 0.8 with the same clustering threshold.


0

0.2

0.4

0.6

0.8

1

1.2

0.1 0.3 0.5 0.7 0.9

ma

cro

F1

Path Threshold

CT-0.1

CT-0.3

CT-0.5

CT-0.7

CT-0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.1 0.3 0.5 0.7 0.9

NM

I

Path Threshold

CT-0.1

CT-0.3

CT-0.5

CT-0.7

CT-0.9

(a) DBLP (b) DBLP

Figure 4.4: The effect of the path threshold with different clustering thresholds on theXCPath method.

4.5.1.4 Three Stages of the Hybrid Clustering Algorithm

Figures 4.5 and 4.6 show the results of the clustering solutions at the three different stages

of the hybrid clustering algorithm for the XCTree and XCPath methods, respectively.

There are three stages in the hybrid clustering algorithm: incremental clustering, iteration

and pair-wise clustering. A number of observations can be made from the results in

Figures 4.5 and 4.6. The first observation is that in most evaluation metrics and for most

of the data collections, with unrestricted numbers of clusters generated at the incremental

clustering stage, the accuracy of the clustering solution is much better than the results

of the methods in which the number of clusters is refined down due to the generation

of user-defined number of clusters. The results in Figures 4.5 and 4.6 shows that the

incremental clustering is able to discover the natural groupings that exist in the data

collection. However, by forcing the clusters down to produce a required number of clusters

produces a less effective clustering solution than in the incremental clustering stage.

The second observation is that the clustering solutions produced using the incremental


clustering with the iteration stage do not improve much. This shows that based on the

data collections, the first document representation employed by the XCTree and the com-

mon path representation employed by the XCPath is not affected much by the document

ordering in the data collections. Therefore, there is little difference between the clustering

solutions generated from the incremental clustering stage and the clustering solutions gen-

erated from incremental clustering plus the iteration stage. Nevertheless, there is a small

improvement of the clustering solutions on the DBLP collection shown in figure 4.5(c).

This improvement shows that the iteration stage is useful for documents in which the

structures from different classes are highly related to one another, therefore the order-

ing of the input document becomes sensitive to the incremental clustering. There is no

improvement in the DBLP collection using the XCPath method because using the com-

mon path representation the iteration does not require since the representation is the

global representation of the common paths of the documents within a cluster. Due to the

time complexity, for instance the incremental clustering alone will have the complexity of

O(nlogn) but with the iteration it will be O((nlogn) × 2), therefore the iteration stage

may not be needed in the hybrid clustering algorithm especially for the common path

representation.

The final observation is that the NMI measure takes into account the Entropy measure

generated in the clustering solution in its calculation, thus, except for the clustering solu-

tion on IEEE collection in the XCTree method and the Niagara collection in the XCPath

method, the accuracy of the NMI tends to be higher in comparison to the other evaluation

metrics in the final result of the structure-only clustering methods. The Entropy measure

in the NMI measure tends to increase as the number of clusters increases. The NMI value


is low for the IEEE collection in the final solution of the XCTree method due to the fact

that the clustering solutions generated in the previous stages (incremental clustering and

iteration) are low and the structural relationships between the documents in the IEEE

collection are so much overlapping between clusters that the Entropy value in the NMI

metric for the final solution of the XCTree is still very high. The same applies to the

Niagara collection. With the perfect solution generated at the incremental clustering and

the iteration stages, the Entropy in the NMI measure is high when the number of clusters

is refined (down) to the user-defined number of clusters in the final clustering solution of

the XCPath method.

0

0.2

0.4

0.6

0.8

1

1.2

micro purity macro purity micro F1 macro F1 NMI

Accuracy

Evalua on Metric

Incremental Clustering Incremental Clustering +Itera!on XCTree

0

0.2

0.4

0.6

0.8

1

1.2


Accuracy

Evalua on Metric



0

0.2

0.4

0.6

0.8

1

1.2


Accuracy

Evalua on Metric


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Accuracy

Evalua on Metric


(c) DBLP (d) IEEE

Figure 4.5: The accuracy of the clustering solution at the three stages of the XCTreemethod at clustering threshold 0.9.


0

0.2

0.4

0.6

0.8

1

1.2


Accuracy

Evalua on Metric

Incremental Clustering Incremental Clustering + Itera!on XCPath

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Accuracy

Evalua on Metric



0

0.2

0.4

0.6

0.8

1

1.2


Accuracy

Evalua on Metric


(c) DBLP

Figure 4.6: The accuracy of the clustering solution at the three stages of the XCPathmethod at the clustering threshold of 0.9 and path threshold of 0.7.

4.5.1.5 Methods Comparison

The accuracy of the structure-only clustering methods is compared to the other structure-

only clustering methods, namely XCLS [46], and Zhang & Shasha tree edit distance [89].

To generate a pair-wise similarity matrix using Zhang & Shasha tree edit distance is time

consuming. Thus, in the experiments the hybrid clustering algorithm proposed in this

thesis uses the Zhang & Shasha tree edit distance to generate the clustering solution. The

accuracy of the results is shown in Figure 4.7. The results of the XCPath method is based

on the path threshold of 0.7. Both the XCPath and XCTree methods use the clustering


threshold of 0.9. The Zhang & Shasha method uses the clustering threshold of 0.9. From

the results, the Zhang and Shasha tree edit distance method is the worst of the methods.

The reason might be that the Zhang & Shasha tree edit distance is not applicable for

incremental clustering. The XCTree method performs consistently better than the other

methods in most data collections.

0

0.2

0.4

0.6

0.8

1

1.2


Accuracy

Evalua on Metric

XPATHClust XTREEClust XCLS ZhangShasha_distance

0

0.2

0.4

0.6

0.8

1

1.2


Accuracy

Evalua on Metric



0

0.2

0.4

0.6

0.8

1

1.2


Accuracy

Evalua on Metric


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4


Accuracy

Evalua on Metric


(c) DBLP (d) IEEE

Figure 4.7: The comparison of different structure-only clustering methods.

In terms of scalability the XCLS is faster than the XCPath, XCTree, and Zhang & Shasha

tree edit distance. To perform the clustering it takes less than one second to cluster

collections such as the Niagara, dblp, and for collections such as the Publication and IEEE,

it takes less than two minutes with any clustering thresholds. Even with a high clustering


threshold of 0.9, the number of clusters generated by the XCLS for IEEE collection is fewer

than eighteen classes suggesting that using the summary representation for the clusters

tends to lead the documents to the cluster with large representation. Since XCLS uses the

global structure of the documents within a cluster to be the cluster representation, the

method tends to be faster than the other clustering methods because it creates far fewer

clusters especially for IEEE collection.

The hybrid clustering algorithm takes longer to compute the Zhang & Shash tree edit

distance measure for the document clustering when compared to the measures employed

by the XCTree and XCPath methods. The proposed structure-only clustering methods

in this thesis and Zhang & Shasha tree edit distance are slower than the XCLS method

however they can exploit the structure of the XML documents in more detail, therefore

they tend to create more clusters. The proposed structure-only clustering methods and

Zhang & Shasha method, therefore, are more applicable for applications such as schema

matching and XML data integration, whereas the XCLS is more useful in the information

retrieval area where speed is important.

4.5.2 Analysing the Content and Structure-based Clustering Methods

In addition to the structure-only clustering methods, this thesis also proposes two content

and structure-based clustering methods; the first method is a linear combination of the

structure and content measures and the other is a non-linear combined method.


4.5.2.1 Kernel

The content in the XCLComb method and the text paths utilized by the XCTPath method

are calculated using a kernel described in Chapter 4. The kernel is sensitive to the selection

of the k value which is the reduced dimensional document space for the kernel constructed

from the SVD method. Figure 4.8 shows the sensitivity of the different k values against

the reduced number of documents n′ for the feature-document matrix X. The results in

Figure 4.8 are generated by XCLComb method with λ equals 1 (content-only measure)

using the clustering threshold of 0.9 for the DBLP and Publication collections and the

clustering threshold 0.7 for the IEEE collection. Here, the analysis is based on the NMI

and macro F1 values. Based on the results in Figure 4.8, the optimal reduced number of

documents for matrix X is 1500 with the k values between 600 and 800. The Publication

collection is better with more documents selected with a high k value. As for the IEEE

collection, it performs best from 1000 documents upward with k values between 600 and

800. As the for DBLP collection, its performance patterns are more irregular than the

other two collections. The best result is with 2500 documents with k value of 800, or with

1500 documents with k of 1000. For all k values, the performance increases at the reduced

number of 1000 documents but decreases as the lower number of documents increases and

then improves at 2500 documents. The k value of 200 has the worst result for most reduced

number of documents for matrix X and for all collections. Figure 4.8 does not show the

results for the Niagara collection because the collection is so small to analyse, however,

the best k value for the Niagara collection is also around 800 with the reduced number of

300 documents. Based on the results in Figure 4.8, the XCLComb and XCTPath use the

kernels with k value of 800 and the reduced number of documents of 1500 for IEEE and


Publication collections; the k value of 800 and the reduced number of documents of 2500

for DBLP; the k value of 800 and the reduced number of 300 documents for the Niagara

collection for all the experiments presented in this section.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

500 1000 1500 2000 2500

ma

cro

F1

Number of documents

k=200

k=400

k=600

k=800

k=1000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

500 1000 1500 2000 2500

NM

INumber of documents

k=200

k=400

k=600

k=800

k=1000

(a) Publication (b) Publication

0.347

0.348

0.349

0.35

0.351

0.352

0.353

0.354

0.355

0.356

0.357

500 1000 1500 2000 2500

ma

cro

F1

Number of documents

k=200

k=400

k=600

k=800

k=1000

0.54

0.55

0.56

0.57

0.58

0.59

0.6

0.61

500 1000 1500 2000 2500

NM

I

Number of documents

k=200

k=400

k=600

k=800

k=1000

(c) DBLP (d) DBLP

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

500 1000 1500 2000 2500

ma

cro

F1

Number of documents

k=200

k=400

k=600

k=800

k=1000

0.29

0.3

0.31

0.32

0.33

0.34

0.35

0.36

0.37

500 1000 1500 2000 2500

NM

I

Number of documents

k=200

k=400

k=600

k=800

k=1000

(c) IEEE (d) IEEE

Figure 4.8: The sensitivity of the k value on the kernel.


4.5.2.2 Weighting in the XCLComb method.

The XCLComb method combines the structure measure and content measure using the

weight of λ which is in the range of 0 to 1. A higher λ value indicates more importance

is given to the content measure and vice versa. When λ equals 0, it indicates that the

clustering is mostly based on the structure measure. When λ equals 1, it indicates that the

clustering is mostly based on the content measure. Figure 4.9 shows the effect of the λ on

the data collections. For the Publication collection, the λ does not make much impact after

0.1. This means that even though the Publication collection is a heterogeneous collection,

the inclusion of the content in the Publication collection also plays a role in finding the

true grouping of the documents; however, the collection does not necessarily require a high

weighting for the content similarity value to obtain a good clustering solution. For the

Niagara and DBLP collections, the accuracy of the clustering solutions declines when the

λ is high, showing that the content similarity value is not as important as the structural

similarity value. This decline in accuracy shows that the structure of the documents in

the Niagara collection is more important than the content and this is true because the

Niagara collection is a heterogeneous collection. On the other hand, even though the

DBLP collection is a homogeneous collection the inclusion of the content degrades its

performance showing that the structure is more distinguishable in this collection than the

content. As for the IEEE the impact is the inverse, the increase of the λ improves the

clustering solution. This is true because the IEEE is a homogeneous collection where the

content of the documents is more distinguishable than the structure.


0

0.2

0.4

0.6

0.8

1

1.2

0.1 0.3 0.5 0.7 0.9

Accuracy

λ

micro purity

macro purity

micro F1

macro F1

NMI

0

0.2

0.4

0.6

0.8

1

1.2

0.1 0.3 0.5 0.7 0.9

Accuracy

λ

micro purity

macro purity

micro F1

macro F1

NMI


0

0.2

0.4

0.6

0.8

1

1.2

0.1 0.3 0.5 0.7 0.9

Accuracy

λ

micro purity

macro purity

micro F1

macro F1

NMI

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.1 0.3 0.5 0.7 0.9

Accuracy

λ

micro purity

macro purity

micro F1

macro F1

NMI

(c) DBLP (d) IEEE

Figure 4.9: The effect of the lambda of the XCLComb method.

4.5.2.3 Content-Only Comparison

Figure 4.10 shows the performance of the different content-only clustering methods. The

XCLComb method uses the clustering threshold of 0.9 and the λ of 1 (content-only mea-

sure). The clustering solution is compared to the repeated bisections (rbr) in CLUTO [30]

with the different weighting of the terms using TF-IDF and BM25. Based on the results in

Figure 4.10, for most data collections the results generated by the XCLComb method out-

perform the clustering solutions generated by the CLUTO method using TF-IDF weight-

ing, however the results of the CLUTO method with BM25 weighting outperform the

XCLComb method in the IEEE collection. The XCLComb method also uses TF-IDF


weighting but unlike the CLUTO method, the term feature of the documents is measured

using a kernel which can learn the associations between the term concepts better.

The IEEE collection, on the other hand, performs better using BM25 weighting indicating

that the documents in the IEEE collection are very different in document length in regard

to the number of terms. The difference between the BM25 and TFIDF weightings is

that BM25 weighting has two tune parameters b and K1 to tune the impact of document

length and/or the term frequency. Therefore, the IEEE collection works better with the

BM25 weighting than the other collections. For collections such as the Publication and

DBLP collection, all methods perform almost the same, showing that term collection

of the documents from different classes has different term concepts. For the Niagara

collection, the clustering solution of the CLUTO using BM25 weighting performs the

worst which highlights that term collection of the documents in different classes of the

Niagara collection varies in term concepts but the length of the documents in regard to

the number of terms does not vary much. Therefore using the kernel works better for the

Niagara collection.

4.5.2.4 Path Length in the XCTPath Method

Figure 4.11 shows the results of the XCTPath method with various path lengths. Where

TPath 1 in Figure 4.11 means the length of the text path is 1 which contains only the

root node and a term, TPath 2 means the length of the text path is 2 and so on. The

results of the XCTPath method is based on the clustering threshold of 0.9. The results

of the IEEE collection again highlight that the BM25 weighting is the best for the IEEE


0

0.2

0.4

0.6

0.8

1

1.2


Accuracy

Evalua on Metric

XCLComb_λ=1 CLUTO_"idf_Content_Only CLUTO_BM25_Content_Only

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Accuracy

Evalua on Metric

XNLCClust_λ=1 CLUTO_"idf CLUTO_BM25


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Accuracy

Evalua on Metric


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Accuracy

Evalua on Metric


(c) DBLP (d) IEEE

Figure 4.10: The comparison of the different content clustering methods.

collection. The result of the XCTPath in the IEEE collection is lower than the Cluto with

the BM25 weighting shows that the representation of the structure and content as text

paths produces many unrelated concepts which cannot be discovered using a kernel. The

clustering solutions generated using a kernel by the XCTPath method again outperform

the Cluto method with TFIDF weighting for most collections. Similar to the results of the

XCLComb method, the results for the Niagara and DBLP collections improve when the

path length increases showing that including more structure in the text paths improves

the clustering solutions.


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

TPath_1 TPath_2 TPath_3 TPath_4

NM

I

Text Path Length

XCTPath Cluto_bm25 Cluto_ idf

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


NM

I

Text Path Length



0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8


NM

I

Text Path Length


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


NM

I

Text Path Length


(c) DBLP (d) IEEE

Figure 4.11: The comparison of the different path length of the XCTPath method.

4.5.2.5 Content and Structure-based Methods Comparison.

Figure 4.12 presents the results of the XCTPath and XCLComb methods with the Cluto

method using the TFIDF and BM25 weightings. The results of the XCTPath and the

Cluto methods are based on the text path length of 2 (TPath 2). The results are based

on the inclusion of both the structure and the content measures. The clustering solutions

generated by the XCLComb method using different λ settings outperform the other meth-

ods. The results in Figure 4.12 highlights that the relationships between the structure

and content cannot be measured effectively in one data model for document clustering.

The XCTPath method with the path length of 2 is the worst of all the other methods

4.6. Discussion 118

in the Publication collection. For the Publication collection, by combining the structure

and content in text paths results in many unrelated feature concepts in which the kernel

cannot discover any associations between the text paths.

0

0.2

0.4

0.6

0.8

1

1.2


Accuracy

Evalua on Metric

XCTPath_TPath_2 XCLComb_λ=0.3

CLUTO_"idf_TPath_2 CLUTO_BM25_TPath_2

0

0.2

0.4

0.6

0.8

1

1.2


Accuracy

Evalua on Metric




0

0.2

0.4

0.6

0.8

1

1.2


Accuracy

Evalua on Metric



0

0.1

0.2

0.3

0.4

0.5


Accuracy

Evalua on Metric



(c) DBLP (d) IEEE

Figure 4.12: The comparison of the clustering methods utilizing semantic kernel.

4.6 Discussion

The previous section presents the experimental results of the structure-only clustering

methods, and content and structure-based clustering methods. Figures 4.13, 4.14, 4.15,

and 4.16 display the best results of all clustering methods.

4.6. Discussion 119

Based on the overall results, a number of findings have been obtained. The first finding

is that for the structure-only clustering methods, clustering based on the tree model and

tree similarity by the XCTree method is better than the XCPath method which is based

on the path model and path similarity. However, The XCPath method is not sensitive

to the clustering threshold since the method uses the common path representation which

is the global path structure representation of the documents in a cluster. The XCTree

method uses the first document representation which is sensitive to the input document

ordering therefore the method performs best with a high clustering threshold.

The second finding is that in homogeneous collections such as the IEEE collection, the

grouping of the documents is based mainly on the content since the documents are con-

formed to the schema definition and the classification of the documents is based on the

content. The DBLP collection, on the other hand, is also homogeneous, but since the

classification of the documents in the DBLP is based on structure, the clustering solu-

tions therefore use the structure-only clustering methods such as the XCTree, XCPath

and XCLS methods which outperform the content-based clustering solutions. For hetero-

geneous collections such as the Niagara and Publication collections, the structure and the

content both play an important role. However, the structure is more distinguishable in

heterogeneous collections than in the homogeneous collections. The results of the Niagara

collection in figure 4.13 and the results of the Publication collection in Figure 4.14 highlight

the idea that the content-only clustering solutions cannot outperform the structure-only

clustering solutions or the content and structure-based clustering solutions.

The third finding is that the clustering solutions generated by the XCLComb method

outperform the clustering solutions by the XCTPath method since the XCLComb method

4.7. Summary 120

allows the user the flexibility to adjust the λ weighting of the content and structural

similarity values. The content and structure of the document are calculated using two

different data models. Even though the XCTPath method also allows users the flexibility

to adjust the path length, nevertheless using one data model in representing both the

structure and content creates many unrelated concepts that cannot be discovered efficiently

by the XCTPath method when using a large path length.

The clustering methods proposed in this thesis and the experiments conducted in this

chapter were to investigate the first hypothesis of this thesis which is that utilizing both

the structure and content of the XML documents can produce a better clustering solution

than the content-only clustering solution, and the content and structure-based clustering

solution. The results of the experiments in this chapter verify this hypothesis. The results

generated by the XCLComb method, except for the IEEE collection in which there is

no association between the structure and the content at all, outperform the clustering

solutions generated by the other clustering methods.

In terms of scalability, the construction of the semantic kernel is expensive in terms of time

and memory consumption. However, the kernel is useful for clustering algorithms such

as the incremental clustering in which the input documents are compared to the cluster

representations .

4.7 Summary

To summarize, this chapter has evaluated the clustering methods proposed in the pre-

vious chapter with two types of XML data collections: homogeneous and heterogeneous

4.7. Summary 121

0

0.2

0.4

0.6

0.8

1

1.2

micro

purity

macro

purity

micro F1 macro F1 NMI

Accuracy

Evalua on Metric

XCTPath_TPath_2

XCLComb_λ=0.3

XCLComb_λ=1

CLUTO_"idf_TPath_2

CLUTO_BM25_TPath_2

CLUTO_"idf_Content_Only

CLUTO_BM25_Content_Only

XCPath

XCTree

XCLS

Figure 4.13: The comparison of all methods on the Niagara collection.

0

0.2

0.4

0.6

0.8

1

1.2

micro

purity

macro

purity


Accuracy

Evalua on Metric

XCTPath_TPath_2

XCLComb_λ=0.9

XCLComb_λ=1

CLUTO_"idf_TPath_2

CLUTO_BM25_TPath_2



XCPath

XCTree

XCLS

Figure 4.14: The comparison of all methods on the Publication collection.

collections. The evaluation of the clustering methods is based on a number of evaluation

metrics. The clustering methods are evaluated in terms of accuracy as well as scalability

of the hybrid clustering algorithm. This chapter has also analysed the following parame-

ters for the proposed clustering methods described in Chapter 3: (1) the sensitivity of the

4.7. Summary 122

0

0.2

0.4

0.6

0.8

1

1.2

micro

purity

macro

purity


Accuracy

Evalua on Metric

XCTPath_TPath_2

XCLComb_λ=0.1

XCLComb_λ=1

CLUTO_"idf_TPath_2

CLUTO_BM25_TPath_2



XCPath

XCTree

XCLS

Figure 4.15: The comparison of all methods on the DBLP collection.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

micro

purity

macro

purity


Accuracy

Evalua on Metric

XCTPath_TPath_2

XCLComb_λ=0.9

XCLComb_λ=1

CLUTO_"idf_TPath_2

CLUTO_BM25_TPath_2



XCPath

XCTree

XCLS

Figure 4.16: The comparison of all methods on the IEEE collection.

clustering threshold to the hybrid clustering algorithm; (2) the scalability of the hybrid

clustering algorithm; (3) the sensitivity of the path threshold of the XCPath method; (4)

the sensitivity of k value on the kernel; and (5) the effect of λ in the XCLComb. The chap-

ter ends with the discussion and analysis of the experimental results from all clustering

4.7. Summary 123

methods.

Chapter 5XML Transformation Approach

The previous chapter has evaluated and analysed the results of the clustering methods

which have been proposed in this thesis. The clustering solutions generated by the pro-

posed clustering methods can be utilized in applications such as XML integration and

Information Retrieval. To further the research in this thesis, one of the clustering methods

namely the XCTree method is modified and utilized in an XML transformation approach,

which is to be discussed in this chapter, as a pre-processing stage. The aim of this chapter

is to investigate the second key hypothesis of this thesis. That is XML clustering based

on the structural information of XML documents can improve the transfor-

mation process in terms of time and accuracy for the conversion of more than

two source documents into the same target document.

In this chapter, an XML transformation approach is proposed to convert data from a

collection of XML documents to another structure format. Unlike other transformation

approaches, the proposed approach does not use schema definitions but instead, it uses

124

5.1. The XML Transformation Approach: Overview 125

the summary tree structure of the input source XML documents for generating XSLT

scripts. Dealing with large input XML documents can be complex, therefore, the proposed

approach first applies the XCTree method on the input source XML documents before

entering the actual transformation process. The input source XML documents are grouped

into a number of clusters where each cluster will have a global structure summary of the

documents within. The global structure summary acts as a source document which is

then used as an input to the transformation process. The proposed XML transformation

approach creates an XSLT script for every cluster in the clustering solution. The source

documents can use the XSLT script associated to the cluster for the conversion.

This chapter begins with an overview of the stages in the XML transformation approaches.

The XML transformation approach has four stages: pre-processing, element matching,

transformation operator, and XSLT script generator. After explaining the four stages of

the XML transformation approach, this chapter then evaluates and analyses the XML

transformation approach.

5.1 The XML Transformation Approach: Overview

The focus of the proposed XML transformation approach is to transform large XML docu-

ments into the same target document more efficiently. To simplify the structure integration

process of the documents in this research, the transformation problem is addressed using

the structure of XML documents from homogeneous collections. Homogeneous collections

have documents sharing the same or similar structure definition. The idea behind this is

if the transformation approach works for homogeneous collections then it would work for

5.1. The XML Transformation Approach: Overview 126

heterogeneous as it is easier to distinguish those documents in similar structures.

The proposed approach is an XML clustering-based transformation (XCTrans). Figure 5.1

illustrates the stages in the XCTrans approach. The input to the XCTrans is a collection

of XML documents known as source documents. In this thesis, homogeneous collections

are used as input collections. Before performing the transformation process, a clustering

algorithm is first applied on the input source document to group them based on their

common structures. The XCTrans approach modifies the XCTree method for grouping

of the input source documents. The XCTree is used rather than the XCPath method

because the tree model utilizing by the XCTree method preserves the sibling relationships

between nodes. Furthermore, the tree model can be easy broken down into paths which

are commonly used in the schema matching stage for finding corresponding nodes between

a source and a target.

After the clustering of the source documents, each cluster is represented by a global sum-

mary structure of the documents within the cluster. The global summary structures of

the clusters in the clustering solution are then used as input to the schema matching

stage. The schema matching stage, or element matching stage, is then executed between

the global summary structures and a target structure document. The element mapping

results generated from the schema matching process are then used in the XSLT script

generator stage to create XSLT scripts. Each cluster will have an associated XSLT script

for converting the content of the documents in the cluster to the target structure. The

generation of the XSLT script in this thesis is inspired from the work of Boukottaya et

al. [8].

5.2. Pre-processing 127

Clustering

Process

Source XML

Document

Collection

Target

XML

Documents

Source XML

Document

Collection

C1

C2

Cn

Schema

Matching

Target

Document

Element Mapping

Result

Transformation

Operators XSLT

Scripts

Transformation

Processor

Figure 5.1: The XCTrans approach.

5.2 Pre-processing

The first stage of the XCTrans approach is the pre-processing. This stage involves the

clustering of the source documents into a number of clusters. The XCTree method is

utilized by the XCTrans for the clustering of the source documents. The summary tree

structure in the XCTree method is extended with addition of the quantifiers for the nodes.

For example, take the structure tree c in Figure 5.2, the depth-first string format of c

will be company(1,1) address(1,1) cname(1,1) -1 personnel(1,1) person(1,2) name(1,1)

first(1,1) last(1,1) -1 -1 -1 -1, where each node is associated with two numbers between

the brackets. The first number indicates the minimum occurrence that the node appears

under its parent node, and the second number indicates the maximum occurrence that

the node appears under its parent node.

The cluster representation of the XCTree method is also different for it to be utilized in the

XCTrans method. Instead of using the first document representation, it uses the summary

tree structure called global summary tree structure which consists of all the structure of

the documents in a cluster. An example of the global summary tree structure is shown in


Figure 5.3 which is extracted from the collection of documents in Figure 5.2. Each element

in the global summary structure is associated with two numbers. These numbers are used

to identify the quantifier, or cardinality operator, of the element node. For example in

DTD schema definition, + quantifier indicates that an element can appear in its parent

content model one or more times, * indicates zero or more, and ? indicates zero or one

time. The first number appearing with an element in the summary structure indicates

the number of times the element appears in the documents of the cluster. This number

helps to identify the existence of quantifiers in the elements. To identify the minimum

occurrence of an element node the first number is divided by the number of the documents

in the cluster. If the result is lesser than 1 then the element is optional else it must exist

at least once. The second number indicates the maximum number of times the element

is appearing under its parent in any of documents in the cluster. If the number is greater

than 1 then the element can occur multiple times. For example, the person(3,2) node

in Figure 5.3 has two numbers, 3 and 2 associated to it. These numbers indicate that

the person node has the + quantifier because (1) the division of the first number by the

number of documents in the cluster yields a value equal to 1 and (2) the second number

is larger than 1.

Definition 3. A cluster representation of a cluster is the a global summary tree structure.

A global summary tree structure is the integrated tree structure of the summary tree struc-

tures which currently exist in the cluster. The integration of the summary tree structure

is simply the union of the nodes from the same level. There exist a node i and a node j

on the same level and ni = nj provided that their node labels and node types are the same,

only one node is presented in the global summary tree summary.


For a document to be assigned to a cluster, the data similarity (Equation 3.1 between the

document and a cluster has to exceed a clustering threshold β as well as the union of the

document structure and the cluster representation has to exceed a integration threshold

δ. The integration threshold δ is a value to control and determine whether the structures

of two summary trees should be integrated or not. The integration threshold is calculated

by considering a number η which is defined by the user between 1 and 2. The δ is defined

by the total number of nodes of a summary tree structure (of a document) and a global

summary tree structure (a cluster representation) divided by η. For instance, let η be

1.3, the number of nodes in a summary tree structure is 10 and the number of nodes in

a cluster representation is 12. If the number of nodes in the integrated structure of the

two structures is 15, the two structures can be integrated since 15 does not exceed the

integration threshold δ of 17 nodes ((10+22)/1.5)).

If the maximum data similarity value between a document and a cluster exceeds the

clustering threshold and the union structure between the document structure and the

cluster representation does not exceed the integration threshold, the document is assigned

to the cluster and the union structure becomes the new cluster representation for the

cluster. On the other hand, if the maximum data similarity value between the document

and a cluster does not exceed the clustering threshold or the union structure exceeds the

integration threshold, the document is assigned to a new cluster. Each time a document

creates a new cluster in a clustering solution, the structure of that document becomes the

cluster representation for comparing and grouping new input documents.

5.3. Element Matching 130

company


person

name

first last

company


person

name

first last

person

name

first last

(c) (b) (a)

company


person

name email

first last

Figure 5.2: An example of source document structures in the same cluster.

company(3,1)

address(3,1) cname(3,1) personnel(3,1)

person(3,2)

name(3,1) email(1,1)

first(3,1) last(3,1)

Figure 5.3: An example of a source summary structure format.

5.3 Element Matching

Before executing the element matching stage (or schema matching), the structure of the

cluster representations and the target structure document needed to be processed. Let

Figure 5.4 be an example of a target structure and Figure 5.3 be the global summary

structure of a cluster. The input target structure document can be in DTD or XML-

Schema definition format. These schema definitions are converted and also represented in


company


+ street city postal state

name

email

person

?

Figure 5.4: An example of a target structure definition represented in a tree formats.

Table 5.1: Quantifier mapping between XSD and DTDQuantifier minOccurs maxOccurs No. of ChildOperator Element(s)

none 1 1 once and only once

? 0 1 zero and one

* 0 unbounded zero or more

+ 1 unbounded one or more

a tree structure like the one shown in Figure 5.4. The mapping of the different quantifiers

in DTD and XML-Schema definition is shown in Table 5.1. If there is no quantifier

indicated for a particular element then that element exists once and only once under its

parent content model.

For the element matching process the tree structure of a target document and a cluster

representation are broken down into a collection of paths. Each path in the path collection

contains the elements from the root to the element containing leaf nodes. It also contains

a set of leaf nodes belonging to the path like the example below:

Source:


p1 : company/cname, address

p2 : company/personnel/person/name/first, last

p3 : company/personnel/person/email

Target:

p1 : company/address/street,city, state, postal

p2 : company/cname

p3 : company/personnel/person/name, email

After the paths are extracted, these paths are then used in the element matching stage.

The element matching stage is to find the corresponding elements between a global sum-

mary structure of a cluster and the target document structure. It is divided into two

stages: (1) discovery of corresponding leaf elements and (2) discovery of all corresponding

elements between a source and a target structure.

5.3.1 Discovery of Corresponding Leaf Elements

Before finding corresponding leaf elements between the input source and target structure,

path similarity is first calculated between the extracted path collections. Let px and py

are the two sets of nodes that exist in the paths. The path similarity measure which is

defined in Equation 5.2 between px and py is the intersection of the two sets of nodes

divided by the total number of nodes in px and py. For example, the pathSim between

p1 (company/cname,address) in the source structure and p2 (company/cname) in the


target structure is (2× 2)/(3 + 2) equal 0.8.

pathSim(px, py) =2× |px ∩ py||px|+ |px|

(5.1)

The leaf elements of the target are compared with the leaf elements of the source if their

path similarity values exceed a path threshold φ which is defined by the user ranges from

0 to 1, where 1 means is an exact match. The leaf similarity measure is defined as follows:

leafSim(ei, ej) = γlabelSim(ei, ej) + µancestorSim(ei, ej) + ωleafSiblingSim(ei, ej);

(5.2)

where the weightings γ, µ, ω are defined by the users to adjust the important of the

similarity measures. To compute the leaf similarity measure the following similarities are

used:

labelSim(ei, ej) where element ei ∈ px and element ej ∈ py: This is to measure the

name similarity of the leaf elements ei and ej . We use the n-gram method as defined

in Equation 5.3 where A is the number of unique n-grams in the first element name,

B is the number of unique n-gram in the second, and C is the number of unique

n-grams common of the two names. For example, let the two names of the elements

be company and company1. If we apply a string matching method then the labelSim

will be 0. Using the n-gram method, for instance the 2-grams (di-grams) for company

is co, om, mp,pa, an, ny. The labelSim between ei and ej will be (2(6)/(6+7)) 0.92.


labelSim =2C

(A+B)(5.3)

ancestorSim(ei, ej): There are two different similarity measures in ancestorSim, one

measure is to count the common ancestors of ei and ej without considering the hier-

archical order of the ancestors divided by the maximum number of ancestors of ei and

ej , which is denoted as nonLevelSim, and the other measure is to count the common

ancestors of ei and ej occurring in the same hierarchical level divided by the maxi-

mum number of ancestors of ei and ej , which is denoted as levelSim. The average

sum of these two measures becomes the ancestorSim value. To find the common an-

cestors between two leaf elements, a intersection operator is applied on the ancestor

sets. For example let the ancestors of ei and ej be company/personnel/person/name

and company/personnel/name/person respectively, the nonLevelSim will be 1 (4/4)

and LevelSim will be 0.5 (2/4). The ancestorSim of these two elements will be the

average of nonLevelSim and LevelSim similarities, which is equal to 0.75.

leafSiblingSim(ei, ej): This similarity measure counts the number of sibling ele-

ments that ei and ej have in common times. The number of common siblings, which

is the intersection of the sibling sets of ei and ej , is multiplied by 2. The similarity

is normalized by the total number of sibling elements of ei and ej . If the two leaf el-

ements have no leaf siblings then the leafSiblingSim will be 1. For example let the

siblings of ei and ej be name, email and email respectively, the leafSiblingSiim

of ei and ej is 0.66 ((1× 2)/(2+1)).

All the above similarity measures are range from 0 and 1 where is the exact match.


The higher the similarity values the higher the chance that the element ei and ej are a

corresponding pair.

5.3.2 Discovery of All Corresponding Elements

After the leafSim is calculated for all leaf elements from the target structure to the

source structure, the matching result of the leaf elements can be confirmed by the user

or in the absence of any user approval, the best pair is selected for each leaf element in

the target document. The next step of the element matching process is to discover all

corresponding elements between the target structure and the source structure. This is

done automatically. For example, Table 5.2 shows the result of the corresponding leaf

elements. The result is then used in this stage to find all corresponding elements which

can be used for the generation of the XSLT script. Figure 5.5 is the algorithm for finding

all corresponding elements. Let elemMapSet stores the mapping pairs. For each mapping

pair in Table 5.2, the following steps are applied:

For each element in a path Px of the target document, starting from the root element

– If there exists an element on the same level in the path Py of the source structure

and they are the same element type (i.e. complex or leaf element), generates a

mapping pair between the two. The element of the target document is stored

with the source element in elementMapSet if it is not already there. The source

element is stored as relative path, path containing elements from the root to

the mapping element.

– If the element of the target document is not the leaf element and could not find

5.4. Transformation Operator 136

a match in the source document on the same level because |px| > |py| then it

is stored in the elemMapSet with a Null value.

if the element in Px is a leaf element then the element is mapped with the leaf

element in the source and the mapping result is stored in elemMapSet.

For example. Consider a mapping path pair company/address/street and company/address

of the target and the source structure respectively. The corresponding elements gen-

erated from the mapping pair are: company− > company, address− > Null, and

street− > company/address. Even though both paths contain the address element on

the same level, however, one is a leaf element and the other is a complex element.

Table 5.2: The leaf element mapping resultTarget Doc Source Doc

company/address/street company/address

company/address/city company/address

company/address/state company/address

company/address/postal company/address

company/cname company/cname

company/personnel/person/name company/personnel/person/name/first

company/personnel/person/name company/personnel/person/name/last

company/personnel/person/email company/personnel/person/email

5.4 Transformation Operator

Before generating the XSLT script, transformation operator should be identified for each

mapping pair discovered in the element mapping process. Three transformation operators

are considered to be important in the proposed XML transformation approach, they are

connect, join, and split operators.


Figure 5.5: Element mapping algorithm

Input: Set completePathMapSet //contains complete path mapping pairs;

Output: Set elemMapSet //contains element mapping pairs

1. Set elemMapSet = null;

2. String map = null;

3. for each mapping ∈ completePathMapSet;

4. Set T = getTargetPathElem();

5. Set S = getSourcePathElem();

6. for (j = 1; j <= T.length; j ++)

7. if(j < S.length∥(j == T.length&T.length == S.length))

8. map = Tj− > S[1...j];

9. if(!elemMapSet.contain(map))

10. elemMapSet.add(map);

11. end if;

12. else if(j => S.length)

13. map = TT.length− > S[1...S.length];



16. end if;

17. for j to T.length− 1

18. map = Tj− > Null



21. end if

22. end for

23. break for

24. else if(j = T.length)

25. map = TT.length− > S[j...S.length];



28. break for;

29. end if;

30. end if;

31. end for;

32. end for;


connect: t=connect(s), it copies the content from a source element s to a target ele-

ment t with no modification to the structure of the source document. This operator

is used for one-to-one mapping result where no modification is necessary.

join: t=join(s), it joins the content of 2 or more elements in the source document

structure into one element in the target document structure. This operator is re-

quired for one-to-many mapping relationships.

split: t=split(s), it splits the content of an element in the source document structure

into two or more elements in the target. This operator is applied on many-to-one

relationships.

Table 5.3 shows the corresponding elements between the target and the source structure.

Each corresponding element pair is assigned with a transformation operator. The assigna-

tion of the transformation operator is done automatically. To assign a connect operator to

an element in the target document, the mapping pair of the element in the elementMapSet

should occur only once. If an element of the target document occur more than once in

the elemMapSet with the same source path in the elemMapSet then a split operator is

assigned to the mapping pair of the element. To assign a join operator to an element

in a target document, the element occurs many times in the elemMapSet with different

source paths. The elements in the target document which do not matched with any of the

elements in the source structure will be assigned with a null operator.

Using the mapping result established in Figure 5.3, for each mapping pair, the following

information is stored: ID, source access path list, target element list, child mapping list,

child mapping type list, quantifier, transformation operator, and condition. Their detail


Table 5.3: Transformation operators for corresponding elementsTarget Source TransformationDoc Access Paths Operator

company company connect

address null null

street, company/address splitcity,state,postal

cname company/cname connect

personnel company/personnel connect

person company/personnel/person connect

name company/personnel/name/first, joincompany/personnel/name/last

email company/email connect

is as follows:

ID - each mapping result in Table 5.3 will have a unique ID. The mapping result of

the root element will have an ID of 1. The child index will have its parent ID before

its ID. For example, if element company has the ID of 1 then its first child mapping

result ID will be 1.1, its second child mapping result ID will be 1.2 and so on. In

this way, the mapping result can be processed in hierarchical structure of the target

document in which the mapping result of the root element serves as a starting point

to process the mapping results.

source access path list - a list containing all the source access paths that are match

to the elements in the target element list.

target element list - a list containing all the target elements that are match to the

elements in source access path list.

child mapping list - a list containing all the IDs of the target child mapping results.


ID,source access path list,target element list,Child mapping list,Child Mapping Type list,quantifier,transformation,condition

1,company,company,1.1,1.2,1.3,null,one-to-one,one-to-one,none,connect,null

1.1,,address,1.1.1,many-to-one,none,null,null

1.1.1,company/adress,street,city,state,postal,,,none,split,null

1.2,company/cname,cname,,,none,connect,null

1.3,company/personnel,personnel,1.3.1,one-to-one,none,connect,null

1.3.1,company/personnel/person,person,1.3.1.1,1.3.1.2,one-to-many,one-to-one,+,connect,null

1.3.1.1,company/personnel/person/first,company/personnel/person/last,name,,,none,join,null

1.3.1.2,company/personnel/person/email,email,,,?,join,null

Figure 5.6: Element mapping result.

child mapping type list - a list containing all the types of the child mapping results,

i.e. one-to-one, one-to-many or many-to-one.

quantifier - if the quantifier is none then the mapping pair occurs once and only

once under its parent node. Quantifier such as ? (optional) is treated the same way

as mapping pair with a none quantifier.

transformation operator - the transformation operator of the mapping result such as

split,connect,join or null. The null operator indicates that there is no mapping for

the target element.

condition - this information is to indicate the special condition for the mapping

result to be possible

For example, using the mapping results in Table 5.3, the mapping result is processed into

the example in as shown Figure 5.6. Figure 5.6 shows the final mapping result between

the elements in the target document and the source documents. This result is used and

processed by the next stage which is the XSLT script generator stage.

5.5. XSLT Script Generator 141

5.5 XSLT Script Generator

The XSLT script generator stage is inspired by the work of Boukottaya et al. [8]. An

XSLT1 program relies on XPath expressions to navigate a source document. There are

two technique to generate the XSLT script: pull and push. Push means emitting output

whenever some conditions are satisfied by the nodes (elements) in the source document.

Pull technique usually refers to the process that walks through an output template and

retrieves data from the nodes in the source document. An example of a push technique is

the use of “match” and “apply-templates” to generate the output by further processing

all the children of the matched node. An example of pull technique is the use of ”select”

to query the source instance and extract the value of the source selected node. In the

proposed XML transformation approach, these two techniques are used in generation of

the XSLT script. An XSLT template generally takes the following form:

<xsl:template match=pattern name=qname priority=number mode=qname>

construction rules which possibly call/apply other templates

< /xsl:template>

Three kinds of XSLT template are used in here. They are pattern template, mode template

and named template. The pattern templates do not need a name or mode attribute.

They can be called by an xsl:apply-templates element without a mode or name attribute.

Similarly, the mode templates can be called by an xsl:apply-templates element but with

1http://www.w3.org/TR/xslt


a mode attribute. The mode templates can be used to enforce a particular construction

phase by restricting processing to a set of templates that will be called during that phase.

Lastly, the named templates can be used to give the flexibility to call a specific template

whenever necessary at any construction phase. They can be called by an xsl:call-template

element with a matched name attribute.

Using the format of the mapping result shown in Figure 5.6, the XSLT script is generated

using the following steps below. An example of the output XSLT script is shown in

Figure 5.7:

1. Initializing the translation - Take the first mapping element which is the mapping of

the root element. It is assumed that the root element always has one-to-one mapping

relationship. Once the first root element mapping result is located, the generation

of the template rules can begin.

2. Traverse the mapping result as shown in Figure 5.6 in depth-first manner meaning

processing the mapping element result in hierarchical order according to the ID of

the mapping element result.

(a) generate a construction template for the current mapping element pair

(b) for each child mapping elements, adjust the above template for inserting more

construction or apply-template rules whenever necessary.

(c) add the templates to the XSLT stylesheet.

(d) if there is more mapping element pair needed to be processed, loop back to

2(a).


<xsl:transform

xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

version="1.0">

<xsl:output method="xml" indent="yes"/>

<xsl:template match="company">

<company>

<xsl:apply-templates select="address"/>

<xsl:apply-templates select="cname"/>

<xsl:apply-templates select="personnel"/>

</company>

</xsl:template>

<xsl:template match="address">

<address>

<street>

<xsl:value-of select = "substring-before(.,',')"/>

</street>

<city>

<xsl:value-of select="substring-before(substring-after(.,','),',')"/>

</city>

<state>

<xsl:value-of select="substring-before(substring-after(substring-after(.,','),','),',')"/>

</state>

<postal>

<xsl:value-of select ="substring-after(substring-after(substring-after(.,','),','),',')"/>

</postal>

</address>

</xsl:template>

<xsl:template match="cname">

<cname>

<xsl:value-of select="."/>

</cname>

</xsl:template>

<xsl:template match="personnel">

<personnel>

<xsl:for-each select="person">

<person>

<xsl:call-template name="person-trans"/>

</person>

</xsl:for-each>

</personnel>

</xsl:template>

<xsl:template name="person-trans">

<name>

<xsl:value-of select="name/first"/>

<xsl:text> </xsl:text>

<xsl:value-of select="name/last"/>

</name>

<email>

<xsl:value-of select="email"/>

</email>

</xsl:template>

</xsl:transform>

Figure 5.7: An example of an XSLT Script.


The element matching, transformation operator and the XSLT script generator stages

are executed for each global summary structure generated in the pre-processing stage.

Eventually, each cluster will have an XSLT script associated with it which is used by an

XSLT processor2 to convert the input source documents into the target document format.

5.6 Results of Experiments

The experiments are carried out to measure the performance of the proposed XML trans-

formation approach in comparison to the traditional way of transforming XML data.

Furthermore the accuracy of the element matching employed by the proposed approach is

also evaluated.

5.6.1 Data Collection

The detail of the input source data collections which are used in the experiments is shown

in Table 5.4. The input source data collections are derived from the XML document

collections which have been used to evaluate the clustering methods proposed in this thesis

(Table 4.2). The Movie and the Bibliography collections are homogeneous collections.

The DBLP collection is a homogeneous collection however it has the characteristic of a

heterogeneous collection which contains 8 different structure formats for books, conference,

journals, MS, persons, PhD, Tr and WWW as shown in Table 4.2. The structure of the

input source collections is not large, however it is chosen for the evaluation of the proposed

XML transformation approach because with small structure it is easier to analyse and

2http://saxon.sourceforge.net/


evaluate the performance of the XCTrans approach.

Table 5.4: Data collections for XML transformationcollection No. of Docs No. of No. of

Hierarchical DistinctLevels Elements

DBLP 4910 4 32

Movie 37 4 12

Bibliography 16 5 14

The target DTD documents for each data collection are manually defined. Refer to the

Appendix for the source DTD documents of the data collections in Table 5.4 and target

DTD documents which are used for the testing of the proposed transformation approach.

5.6.2 Evaluation Metric

In the evaluation of the XCTrans approach, the processing time of using the XSLT scripts

generated from the XCTrans to transform the input source collections in the target format

is measured in seconds. For the evaluation of the element matching in the XCTrans

approach, recall and precision methods are used. Let A is a set of correct matches detected

by a human, C is a set of mappings generated by an automatic matching system. Precision

is the ratio between the number of correct mappings generated by the system and the

total number of mappings in C. It indicates of how many incorrect mappings have been

discovered by the element matching stage.

precision =C ∩A

C(5.4)

Recall is the ratio between the number of correct mappings generated by the matching


system and the total number of correct mappings (i.e., mappings discovered by human).

It gives an indication of how many correct mappings are missed by the matcher.

recall =C ∩A

A(5.5)

5.6.3 Scalability

Figure 5.8 shows the performance of the transformation processing time of using the

XSLT scripts generated from the XCTrans approach and the processing time using the

One-Script method. The One-Script method is the traditional way of doing transformation

in which each source data collection is associated with an XSLT script. The XSLT script

is generated using the data collection’s DTD schema definition (refer to the Appendix).

The XCTrans approach, on the other hand, has many XSLT scripts for transforming the

input source data collections. The number of generated XSLT scripts corresponds to

the number of generated clusters after executing the XCTree method on the input source

collections. For example, after the grouping of the DBLP collection based on the document

structure, 8 different clusters might be produced from the XCTree method, therefore, eight

XSLT scripts are produced from the proposed approach. From the results in Figure 5.8,

it can be seen that in terms of speed the XSLT scripts generated by XCTrans approach

performs better in the DBLP collection which has more documents and the structure

of the documents are larger than the Movie and the Bibliography collections. For small

structure such as the Movie and Bibliography collections, the performance of the XCTrans

approach is equivalent or slightly worser than the One-Script method due to the searching


and loading of the different XSLT scripts.

0

10

20

30

40

50

60

dblp movies bibliography

pro

cess

ing

m

e (

sec)

Data Collecon

XCTrans

One-Script Method

Figure 5.8: XML transformation process time on the dataset.

Figure 5.9 displays the difference in the transformation processing time in relation to the

size of the DBLP collection. The graph in Figure 5.9 shows that for the DBLP collection,

the larger the collection size, the greater is the difference in the processing time between

the XCTrans and the One-Script method. The whole collection of the DBLP is around

157120 documents as discussed in Chapter 4 Section 4.1.

0

50

100

150

200

250

300

350

400

450

pro

cess

ng

m

e (

sec)

No. of docs

XCTrans

One-Script Method

Figure 5.9: The processing time in second in relation to the number of documents in theDBLP collection.


Furthermore, experiments are carried out to test the processing time of the XCTrans

approach with the different numbers of generated clusters (scripts). Figure 5.10 shows

the processing time in seconds in relation to the number of clusters generated on the

DBLP collection. From the result, it can be ascertained that there exist an optimal

number of clusters (scripts) that needed to reach in order to have the best performance,

however, when the number of clusters is beyond the optimal number the performance

of the transformation process will degrade because of the need of higher indexing and

retrieving of the scripts.

0

50

100

150

200

250

1 2 3 4 5 6 7 8 9 10 11 12 13 14

pro

cess

ing

m

e (

sec)

No. of clusters generated

XCTrans

Figure 5.10: The processing time in seconds with the different numbers of clusters on theDBLP collection

5.6.4 Element Mapping

The accuracy of the mapping results generated by the element matching process are com-

pared with the Similarity Flooding (SF) method [42]. The SF approach is based on

labelled graphs which are used in an iterative fixpoint computation whose results tell us

what nodes in one graph are similar to nodes in the second graph. For computing the


similarities, it relies on the intuition that elements of two distinct models are similar when

they occur in similar contents, i.e., when their adjacent elements are similar. In other

words, a part of similarity of two elements propagates to their respective neighbours. Fig-

ures 5.11 and 5.12 show the results on the recall and precision of the employed element

matching process and the SF method respectively. The XCTrans outperforms the SF

method. The string matching measures combined with a propagation process employed

by the SF method is not flexible enough to achieve better results.

0

0.2

0.4

0.6

0.8

1

1.2

DBLP Movies Bibliography

Re

call

Data Collec!on

XCTrans

SF Matching

Figure 5.11: The mapping accuracy based on recall measure.

0

0.2

0.4

0.6

0.8

1

1.2

DBLP Movies Bibliography

Pre

cisi

on

Data Collec!on

XCTrans

SF Matching

Figure 5.12: The mapping accuracy based on precision measure.

5.7. Discussion 150

5.7 Discussion

The evaluation of the XCTrans approach in this thesis is not analysed extensively. The

main focus of the thesis is to investigate the usability of the XML clustering in XML

transformation. From the results of the experiments, the XSLT scripts generated by

the XCTrans approach performs better on the input data collections such as the DBLP

collection which are large in the size of the documents and complex in structure. As

the number of documents in the input source data collections increases, the greater is the

difference in the processing time between the XTrans and the One-Script method as shown

in Figure 5.9.

Even though the structure of the input source collections is not very large in terms of the

number of documents or complex in structure, the XCTrans still shows an improvement in

the transformation process. If this approach can be modified and used on heterogeneous

collections, it is believed that the approach can be very useful. This has shown to be

true for DBLP collection which is a homogeneous collection but has the characteristic of

a heterogeneous collection.

5.8 Chapter Summary

In this chapter, the XCTree method has been utilized in the proposed XML transforma-

tion approach for translating large input source XML document collections into a target

structure. Firstly, the XCTree method is applied on the input source data collections to

reduce the complexity of the structure integration of the whole input source documents.

5.8. Chapter Summary 151

Each cluster has a global summary structure representation of the documents in the clus-

ter. The global summary structure representation acts as a source structure which is used

as input to the schema matching stage for the generation of transformation script. The

experiments show an improvement in the performance of the XCTrans approach in terms

of the processing time and the accuracy of the element mapping.

Chapter 6Conclusion

As the popularity of the XML data increases so does the amount of XML data on the

Web. With the increasing amount of XML data, there is a necessity to better manage

and analyse of large collections of XML data. For better management of the XML data,

XML clustering has played an important role. There are still many problems existing in

the XML clustering task such as the nature of the roles for the structure and content of

the XML documents in the clustering of the documents. Therefore, the first main research

question of this thesis is: Can the accuracy of the clustering solution be improved by using

both the structure and content of XML documents?

In response to the first question, this thesis has proposed a number of clustering methods

for the clustering of XML documents using the structure-only information and using both

the content and structure of documents. The results of the experiments verify that for most

data collections, the clustering solutions using both the structure and content, especially

for the linear combination of the structure and content, outperform the results of the

152

6.1. Summary of Findings 153

structure-only clustering and the content-only clustering.

The existing transformation approaches discussed in Chapter 2 address only the transfor-

mation problem for XML data between a source data and a target data at a time. However,

to perform the transformation process for much source data sharing similar characteristics

can be time consuming. Therefore, the second research question in this thesis is: Given

a collection of source XML documents and a target document, can the grouping of the

source documents into small sets of similar structures improve the processing time and

accuracy of the XML transformation?

In response to the second research question, an XML transformation approach has been

proposed in this thesis which incorporates the clustering process as a pre-processing stage

in the XML transformation approach for converting many source documents into the same

target document more efficiently. This confirms the hypothesis that XML clustering based

on the structural information of XML documents can improve the transformation process

in terms of time and accuracy for the conversion of more than two source documents.

6.1 Summary of Findings

For the proposed clustering methods, a number of findings have been made which have

been discussed in Chapter 4 in the discussion section. To summarize, the results of the

experiments which have been conducted on the proposed clustering methods illustrate

that the proposed clustering methods perform differently on different types of XML data

collections. For homogeneous collections such as the IEEE collection the content-only

clustering solutions outperform the structure-only clustering solutions and the content and

6.2. Summary of Contributions 154

structure-based clustering solutions. However for the classification of the documents from

homogeneous collections based on the structure such as the DBLP collection, the structure

also plays an important role. As for the heterogeneous collections, both the structure-only

clustering solutions and the content and structure-based clustering solutions outperform

the content-only clustering solutions. The results generated by the XCLComb method

which linearly combines the structural similarity value and the content similarity value

outperforms the other methods for most collections used in this thesis. Thus, the first

hypothesis of this thesis is verified.

In addition, the findings of the proposed XML transformation approach have been dis-

cussed in Chapter 5. To summarize, the proposed XML transformation approach using

a structure-only clustering method such as the XCTree method in this thesis as the pre-

process stage for the transformation task improves the transformation process for convert-

ing many source documents into the same target document at the same time. Since the

XML transformation approach uses a global representation of the source documents in

each clusters as the source documents, the errors of the schema-matching process in the

XML transformation process also reduced in element matching.

6.2 Summary of Contributions

This thesis provides an overview of XML data, XML clustering and XML transformation.

Based on the literature review of current work, a number of XML clustering methods, as

well as a novel XML transformation approach have been proposed in this research. The

main contributions are summarised below.

6.3. Limitations and Future Work 155

Developed clustering methods to deal with both homogeneous and heterogeneous

collections.

Combined structure and content to improve the quality of the clustering solution on

both homogeneous and heterogeneous collections.

Proposed clustering methods to assist the schema matching process in data integra-

tion application, as well as in XML transformation application.

Proposed a novel XML transformation approach to incorporate the XML clustering

process to improve the transformation process in transforming more than one XML

document into the same target document.

6.3 Limitations and Future Work

Several extensions can be made to improve the current proposed methods in the future.

Extend the clustering methods so that they can apply to the clustering of XML

schema definition data. Current clustering methods only address the problem of

XML clustering at the document level, however, the methods can easily be extended

for the clustering of XML schema definition data.

Improve the similarity measure using external sources such as WordNet [22] to learn

the synonyms between tag names and content. Current methods do not use any

external sources for finding the synonyms between tag names or the content.

Extend the proposed XML transformation approach for converting a collection of

6.3. Limitations and Future Work 156

XML schema definition data. The evaluation of the XML transformation approach at

the moment is not sufficient to verify the second hypothesis of this thesis, therefore an

extensive evaluation of the XML transformation is still required on a more complex

source data collections. At the moment, the current XML transformation approach

address only the problem of transforming more than one source documents such

as XML documents. The performance of the transformation process improves but

not that significantly. However, if this proposed transformation system is applied

to XML schema definitions, the performance of the transformation process might

improve significantly. By applying the proposed transformation approach on the

XML schema definitions, more work will need to be done the schema matching

process.

Chapter 7Appendix

7.1 DTD Definitions of the Data Collections for XML Clus-

tering Methods

This section contains the schema definitions for the data collections which are used to

evaluate the proposed clustering methods in this thesis. Since the schema definitions of the

data collections are very long, therefore the examples as shown in Figures 7.1, 7.2, 7.3, 7.4

show only a portion of the schema definitions.

7.2 DTD definitions for the XML Transformation Approach

This section contains the DTD definitions of the source data collections and the DTD

definitions of the target DTD definitions for evaluation of the XML transformation ap-

proach proposed in this thesis. Figures 7.9 and 7.10 show only a portion of the source

157

7.2. DTD definitions for the XML Transformation Approach 158

…

<!ELEMENT article (fno, doi?, fm, bdy, bm?)>

<!ELEMENT fno (#PCDATA)> 

<!ATTLIST fno fid NMTOKEN #IMPLIED>

<!ELEMENT doi (#PCDATA)>







<!ELEMENT fm (hdr?, (edinfo|au|tig|pubfm|abs|edintro|kwd|fig|figw)*)>







<!ELEMENT hdr (fig?, hdr1, hdr2)>

<!ELEMENT hdr1 (#PCDATA|crt|obi|pdt|pp|ti)*>

<!ELEMENT hdr2 (#PCDATA|crt|obi|pdt|pp|ti)*>

…

Figure 7.1: An example of the IEEE article DTD definition

...

<!ELEMENT USMARC (Leader, Directry, VarFlds)>

<!ATTLIST USMARC Material (BK|AM|CF|MP|MU|VM|SE|AU) "BK"

id CDATA #IMPLIED>

<!ELEMENT Leader (LRL, RecStat, RecType, BibLevel, UCP, IndCount, SFCount,

BaseAddr, EncLevel, DscCatFm, LinkRec, EntryMap)>

<!ELEMENT Directry (#PCDATA)>

<!ELEMENT VarFlds (VarCFlds, VarDFlds)>

<!ELEMENT LRL (#PCDATA)>

<!ELEMENT RecStat (#PCDATA)>

<!ELEMENT RecType (#PCDATA)>

<!ELEMENT BibLevel (#PCDATA)>

<!ELEMENT UCP (#PCDATA)>

<!ELEMENT IndCount (#PCDATA)>

<!ELEMENT SFCount (#PCDATA)>

<!ELEMENT BaseAddr (#PCDATA)>

<!ELEMENT EncLevel (#PCDATA)>

<!ELEMENT DscCatFm (#PCDATA)>

<!ELEMENT LinkRec (#PCDATA)>

<!ELEMENT EntryMap (FLength, SCharPos, IDLength, EMUCP)>

<!ELEMENT FLength (#PCDATA)>

<!ELEMENT SCharPos (#PCDATA)>

<!ELEMENT IDLength (#PCDATA)>

<!ELEMENT EMUCP (#PCDATA)>

...

Figure 7.2: An example of the Berkeley article DTD definition


<!ELEMENT bibliography (

article|book|dissertation|proceedings|inproceedings|incollection|techreport|misc)+>

<!ELEMENT article (

author?,title,year?,journal?,volume?,number?,pages?,note?,titletranslation?,keywords?,abstract?,

reviewer?,classification?,language?,links?,issns?,affiliation?,provider?,mixed?)>

<!ATTLIST article





<!ELEMENT book (author?,editor?,title,year?,publisher?,volume?,series?,address?,edition?,note?,

titletranslation?,keywords?,abstract?,reviewer?,classification?,language?,links?,isbns?,

affiliation?,provider?,mixed?)>

<!ATTLIST book





<!ELEMENT dissertation (author?,title,year?,school?,address?,month?,note?,titletranslation?,keywords?,

abstract?,reviewer?,classification?,language?,links?,isbns?,affiliation?,provider?,mixed?)>

<!ATTLIST dissertation





…

Figure 7.4: An example of the DBLP article DTD definition


<?xml encoding="ISO-8859-1"?>

<!ELEMENT bib (vendor)*>

<!ELEMENT vendor (name, email, phone?, book*)>

<!ATTLIST vendor id ID #REQUIRED>

<!ELEMENT book (title, publisher?, year?, author+, price)>

<!ELEMENT author (firstname?, lastname)>

<!ELEMENT name (#PCDATA)>

<!ELEMENT phone (#PCDATA)>

<!ELEMENT email (#PCDATA)>


<!ELEMENT publisher (#PCDATA)>


<!ELEMENT firstname (#PCDATA)>

<!ELEMENT lastname (#PCDATA)>

<!ELEMENT price (#PCDATA)>

Figure 7.5: The source Bibliography article DTD definition


<!ELEMENT bib (vendor)*>

<!ELEMENT vendor (name, email, book*)>

<!ATTLIST vendor id ID #REQUIRED>

<!ELEMENT book (title, publisher?, year?, author+)>

<!ELEMENT author (name)>

<!ELEMENT name (#PCDATA)>

<!ELEMENT email (#PCDATA)>


<!ELEMENT publisher (#PCDATA)>


Figure 7.6: The target Bibliography article DTD definition

DTD definition and target DTD definition of the DBLP collection, respectively.



<!ELEMENT W4F_DOC (Movie)>

<!ELEMENT Movie (Title,Year,Directed_By,Genres,Cast)>

<!ELEMENT Title (#PCDATA)>

<!ELEMENT Year (#PCDATA)>

<!ELEMENT Directed_By (Director)*>

<!ELEMENT Director (#PCDATA)>

<!ELEMENT Genres (Genre)*>

<!ELEMENT Genre (#PCDATA)>

<!ELEMENT Cast (Actor)*>

<!ELEMENT Actor (FirstName,LastName)>

<!ELEMENT FirstName (#PCDATA)>

<!ELEMENT LastName (#PCDATA)>

Figure 7.7: The source Movies DTD definition


<!ELEMENT Movie (Title,Year,Directed_By,Genres,Cast)>

<!ELEMENT Title (#PCDATA)>

<!ELEMENT Year ((#PCDATA)>

<!ELEMENT Directed_By (Director)*>

<!ELEMENT Director (FirstName, LastName)>

<!ELEMENT Genres (Genre)*>

<!ELEMENT Genre (#PCDATA)>

<!ELEMENT Cast (Actor)*>

<!ELEMENT Actor (FirstName,LastName)>

<!ELEMENT FirstName (#PCDATA)>

<!ELEMENT LastName (#PCDATA)>

Figure 7.8: The target Movies DTD definition


<!ELEMENT bibliography (

article|book|dissertation|proceedings|inproceedings|incollection|techreport|misc)+>

<!ELEMENT article (

author?,title,year?,journal?,volume?,number?,pages?,note?,titletranslation?,keywords?,abstract?,

reviewer?,classification?,language?,links?,issns?,affiliation?,provider?,mixed?)>

<!ATTLIST article





<!ELEMENT book (author?,editor?,title,year?,publisher?,volume?,series?,address?,edition?,note?,



<!ATTLIST book





<!ELEMENT dissertation (author?,title,year?,school?,address?,month?,note?,titletranslation?,keywords?,

abstract?,reviewer?,classification?,language?,links?,isbns?,affiliation?,provider?,mixed?)>

<!ATTLIST dissertation





…

Figure 7.9: A portion of the source DBLP DTD definition

<!ELEMENT bibliography (type,

author?,editor?,title,year?,publisher?,volume?,series?,address?,edition?,note?,



<!ELEMENT type(#PCDATA)>

…

Figure 7.10: A portion of the target DBLP DTD definition

Publications 164

Publications from this Thesis

1. Tien Tran, Sangeetha Kutty and Richi Nayak. Utilizing the Structure and Con-

tent Information for XML Document Clustering. In Shlomo Geva, Jaap Kamps,

and Andrew Trotman, editors, Advances in Focused Retrieval, pages 460-468, 2009.

Springer Berlin / Heidelberg.

2. Tien Tran, Richi Nayak, and Peter Bruza (2008). Combining structure and content

similarities for xml document clustering. In: Proceedings of the 7th Australasian

data mining conference (AusDM). Adelaide, Australia.

3. Tien Tran, Richi Nayak, and Peter Bruza. Document Clustering Using Incremental

and Pairwise Approaches. In Norbert Fuhr, Jaap Kamps, Mounia Lalmas, and

Andrew Trotman, editors, Focused Access to XML Documents, pages 222-233, 2008.

Springer Berlin / Heidelberg.

4. Tien Tran, Richi Nayak, and Peter Bruza. Evaluating the Performance of XML

Document Clustering by Structure Only. In Norbert Fuhr, Mounia Lalmas, and

Andrew Trotman, editors, Comparative Evaluation of XML Information Retrieval

Systems, pages 473-484, 2007. Springer Berlin / Heidelberg.

Bibliography

[1] Xsl transformations (xslt) 2.0. http://www.w3.org/TR/xslt20/, 2002.

[2] S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to

Semistructured Data and XML. Morgan Kaufmann, San Francisco, California, 2000.

[3] Charu C. Aggarwal, Na Ta, Jianyong Wang, Jianhua Feng, and Mohammed Zaki.

Xproj: a framework for projected structural clustering of xml documents. In KDD

’07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge

discovery and data mining, pages 46–55, San Jose, California, USA, 2007.

[4] Mohamad Alishahi, Mahmoud Naghibzadeh, and Baharak Shakeri Aski. Tag name

structure-based clustering of xml documents. International Journal of Computer and

Electrical Engineering,, 2(1):1793–8163, 2010.

[5] Gunter Saake Alsayed Algergawy, Richi Nayak. Element similarity measures in xml

schema matching. Information Science, 189(24):4975–4998, 2010.

165

BIBLIOGRAPHY 166

[6] Panagiotis Antonellis, Christos Makris, and Nikos Tsirakis. XEdge: clustering ho-

mogeneous and heterogeneous XML documents using edge summaries. Proceedings

of the 2008 ACM symposium on Applied computing. ACM, Fortaleza, Ceara, Brazil,

2008. 1363940.

[7] R. Baeza-Yates and G. Navarro. Integrating contents and structure in text retrieval.

ACM SIGMOD, 25(1), 1996.

[8] Aida Boukottaya, Christine Vanoirbeek, Federica Paganelli, and Omar Abou Khaled.

Automating xml document transformations: A conceptual modelling based approach.

In JOHN F. RODDICK (Ed.) SVEN HARTMANN, editor, First Asia-Pacific Con-

ference on Conceptual Modelling, pages 81–90. Dunedin, New Zealand, January 2004.

[9] Emmanuel Bruno, Jacques Le Maitre, and Elisabeth Murisasco. Extending xquery

with transformation operators. In ACM symposium on Document engineering, Greno-

ble, France, 2003.

[10] S. Cha. Comprehensive survey on distance/similarity measures between probability

density functions. International Journal of Mathematical Models and Methods in

Applied Sciences, 1(4):300–307, 2007.

[11] S. Chawathe. Comparing hierarchical data in external memory. In Twenty-fifth Int.

Conf. on Very Large Data Bases, pages 90–101, 1999.

[12] SS Chawathe and H. Garcia-Monlina. Meaningful change detection in structured

data. In Proceedings of the 1997 ACM SIGMOD Int. conf. on management of data

(SIGMOD), pages 26–37, New York, USA, 1997.

BIBLIOGRAPHY 167

[13] Yun Chi, Richard R. Muntz, Siegfried Nijssen, and Joost N. Kok. Frequent subtree

mining - an overview. Fundam. Inf., 66(1-2):161–198, 2004.

[14] Manning D. Christopher, Raghavan Prabhakar, and Hinrich Schtze. Introduction to

Information Retrieval. Cambridge University Press, 1 edition edition, 2008.

[15] N. Cristianini, J. Shawe-Taylor, and H. Lodhi. Latent semantic kernels. Journal of

Intelligent Information Systems (JJIS), 18(2), 2002.

[16] Theodore Dalamagas, Tao Cheng, Klaas-Jan Winkel, and Timos Sellis. A methodol-

ogy for clustering xml documents by structure. Information Systems, 31(3):187–228,

2006. 0306-4379 doi: DOI: 10.1016/j.is.2004.11.009.

[17] L. Denoyer, P. Gallinari, and Anne-Marie Vercoustre. Report on the xml mining

track at inex 2005 and inex 2006. In INEX 2006, pages 432–443, Dagstuhl Castle,

Germany, 2006.

[18] H. H. Do and E. Rahm. Coma - a system for flexible combination of schema matching

approaches. In 28th VLDB, Hong Kong, China, 2002 August. propose a hybrid

matching algorithm using the modulation of veraious approaches. They support user

feedback and reuse previous matchings (one to one matching).

[19] A. Doan, R. Domingos, and A. Y. Halevy. Reconciling schemas of disparate sources:

a machine-learning approach. In ACM SIGMOD, Santa Barbara, California, United

States., 2001.

BIBLIOGRAPHY 168

[20] Carina Friedrich Dorneles, Rodrigo Goncalves, and Ronaldo dos Santos Mello. Ap-

proximate data instance matching: a survey. Knowledge and Information Systems,

pages 1–21, 2010.

[21] A. Doucet and H. A. Myka. Naive clustering of a large xml document collection. In

INEX Annual ERCIM Workshop, pages 81–88, 2002.

[22] C. Fellbaum. Wordnet: An electronic lexical database. MIT Press, 1998.

[23] S. Geva. K-tree: a height balanced tree structured vector quantizer. In IEEE Signal

Neural Networks for Signal Processing Workshop 2000 (NNSP-2000), pages 11–13,

Sydney, 2000.

[24] F. Giunchiglia, P. Shvaiko, and M. Yatskevich. S-match: an algorithm and an imple-

mentation of semantic matching, 2004.

[25] J. Han and M. Kamber. Data Mining: Concepts and Techiques. San Diego, USA:

Morgan Kaufmann, 2001.

[26] Z. Huang. A fast clustering algorithm to cluster very large categorical data sets in data

mining. In SIGMOD Workshop on Research Issues on Data Mining and Knowledge

Discovery, 1997.

[27] Jeong Hee Hwang and Keun Ho Ryu. Clustering and retrieval of xml documents

by structure. In Osvaldo Gervasi, Marina Gavrilova, Vipin Kumar, Antonio Lagan,

Heow Lee, Youngsong Mun, David Taniar, and Chih Tan, editors, Computational

Science and Its Applications ICCSA 2005, volume 3481 of Lecture Notes in Computer

Science, pages 925–935. Springer Berlin / Heidelberg, 2005.

BIBLIOGRAPHY 169

[28] A. Hyvarinen and Oja E. Independent component analysis: Algorithms and applica-

tions. Neural Networks, 13(4-5):2000, 2000.

[29] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing

Surveys (CSUR), 31(3):264–323, 1999.

[30] George Karypis. Cluto - software for clustering high-dimensional datasets — karypis

lab.

[31] Eila Kuikka, Paula Leinonen, and Martti Penttonen. Towards automating of doc-

ument structure transformations. In ACM Symposium on Document Engineering,

pages 103–110, McLean, Virginia, USA, 2002.

[32] Sangeetha Kutty, Richi Nayak, and Yuefeng Li. Hcx: an efficient hybrid clustering

approach for xml documents. In DocEng ’09: Proceedings of the 9th ACM symposium

on Document engineering, pages 94–97, Munich, Germany, 2009.

[33] Sangeetha Kutty, Tien Tran, Richi Nayak, and Yuefeng Li. Clustering xml docu-

ments using closed frequent subtrees: A structural similarity approach. In Norbert

Fuhr, Jaap Kamps, Mounia Lalmas, and Andrew Trotman, editors, Focused Access to

XML Documents, volume 4862 of Lecture Notes in Computer Science, pages 183–194.

Springer Berlin / Heidelberg, 2008.

[34] T. K. Landauer, P. W. Foltz, and D. Laham. An introduction to latent semantic

analysis. Discourse Processes, (25):259–284, 1998.

BIBLIOGRAPHY 170

[35] J. W. Lee and S. S. Park. Finding maximal similar paths between xml documents

using sequential patterns. In ADVIS, pages 96–106, Izmir, Turkey, 2004, October

20-24.

[36] Jun-Seung Lee and Kyong-Ho Lee. Computing simple and complex matchings be-

tween xml schemas for transforming xml documents. Special Issue Section: Dis-

tributed Software Development, 48(9):937–946, September 2006.

[37] L. M. Lee, L. H. Yang, W. Hsu, and X. Yang. Xclust: Clustering xml schemas

for effective integration. In 11th ACM International Conference on Information and

Knowledge Management (CIKM’02), Virginia, 2002, November.

[38] Ho-pong Leung, Fu-lai Chung, S.C.F. Chan, and R. Luk. Xml document clustering

using common xpath. In International Workshop on Challenges in Web Information

Retrieval and Integration (WIRI ’05), pages 91–96, 2005.

[39] Zhiwei Lin, Hui Wang, S. McClean, and Haiying Wang. All common embedded

subtrees for clustering xml documents by structure. In Machine Learning and Cyber-

netics, 2009 International Conference on, volume 1, pages 13–18, 2009.

[40] Jianghui Liu, J.T.L. Wang, W. Hsu, and K.G. Herbert. Xml clustering by principal

component analysis. Tools with Artificial Intelligence, 2004. ICTAI 2004. 16th IEEE

International Conference on, pages 658–662, 2004.

[41] J. Madhavan, P. A. Bernstein, and E. Rahm. Generic schema matching with cupid.

In 27th VLDB, Roma, Italy, 2001.

BIBLIOGRAPHY 171

[42] S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity flooding: a versatile graph

matching algorithm and its application to schema matching. In 18th ICDE, 2002.

[43] N.K. Nagwani and A. Bhansali. Clustering homogeneous xml documents using

weighted similarities on xml attributes. In 2010 IEEE 2nd International Advance

Computing Conference (IACC), pages 369–372, Patiala, 2010.

[44] R. Nayak and W. Iryadi. Xml schema clustering with semantic and hierarchical

similarity measures. Knowledge-Based Systems, 20(4):336–349, 2007.

[45] R. Nayak and T. Tran. A progressive clustering algorithm to group the xml data by

structural and semantic similarity. IJPRAI, 21(3):1–21, 2007.

[46] R. Nayak and S. Xu. Xcls: A fast and effective clustering algorithm for heterogenous

xml documents. In PAKDD’2006, Singapore, 2006.

[47] Richi Nayak. Fast and effective clustering of xml data using structural information.

Knowledge and Information Systems, 14(2):197–215, 2008. 0219-1377.

[48] Richi Nayak and Wina Iryadi. Xmine: A methodology for mining xml structure. In

Xiaofang Zhou, Jianzhong Li, Heng Shen, Masaru Kitsuregawa, and Yanchun Zhang,

editors, Frontiers of WWW Research and Development - APWeb 2006, volume 3841

of Lecture Notes in Computer Science, pages 786–792. Springer Berlin / Heidelberg,

2006.

[49] Richi Nayak, Christopher M. De Vries, Sangeetha Kutty, Shlomo Geva, Ludovic De-

noyer, and Patrick Gallinari. Overview of the inex 2009 xml mining track: Clustering

and classification of xml documents. In Shlomo Geva, Jaap Kamps, and Andrew

BIBLIOGRAPHY 172

Trotman, editors, Focused Retrieval and Evaluation, volume 6203 of Lecture Notes in

Computer Science, pages 366–378. Springer Berlin / Heidelberg, 2010.

[50] Richi Nayak and F. B. Xia. Automatic integration of heterogeneous xml-schemas.

In Int. Conf. on Information Integration and Web-based Applications and Services,

pages 427–437, Jakarta, Indonesia, 2004.

[51] H-Q. Nguyen, D. Taniar, J. W. Rahaya, and K. Nguyen. Double-layered schema

integration of heterogeneous xml sources. Systems and Software, 84(1):63–76, 2011.

[52] A. Nierman and H. V. Jagadish. Evaluating structural similarity in xml documents.

In 5th International Conference on Computational Science (ICCS’05), Wisconsin,

USA, 2002.

[53] K Ono, T. Koyanagi, M. Abe, and M. Hori. Xslt stylesheet generation by example

with wysiwyg editing. In 2002 International Symposium on Applications and the

Internet, Nara, Japan, March 2002.

[54] T. Pankowski. Specifying transformations for xml data. In Pre-Conference Workshop

of VLDB, Berlin, 2003.

[55] Tadeusz Pankowski. A high-level language for specifying xml data transformations.

In A. Benczur, J. Demetrovics, and G. Gottlob, editors, ADBIS, pages 159–172,

Budapest, Hungary, 2004.

[56] K. Pearson. On lines and planes of closest fit to systems of points in space. Philo-

sophical Magazine, 2(6):559–572, 1901.

BIBLIOGRAPHY 173

[57] M Peltier, J Bzivin, and G Guillaume. Mtrans: A general framework based on xslt

for model transformations. In WTUML, Genova, Italy, 2001.

[58] M.F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.

[59] K. Saleem, Z. Bellahsene, and E. Hunt. Porche: Performance oriented schema medi-

ation. Information System, 33(7-8):637–657, 2008.

[60] G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing.

Communication of ACM, 18(11):613–620, 1975.

[61] Dong-Hoon Shin and Kyong-Ho Lee. Towards the faster transformation of xml doc-

uments. Journal of Information Science, 32:261–276, 2006.

[62] Pavel Shvaiko and Jerome Euzenat. A survey of schema-based matching approaches.

Journal on Data Semantics IV, pages 146–171, 2005.

[63] Marko Smiljanic, Maurice van Keulen, and Willem Jonker. Using element clustering

to increase the efficiency of xml schema matching. In 2nd International Workshop on

Challenges in Web Information Retrieval and Integration (WIRI’06), pages 95–104,

2006 April.

[64] Ian Stuart. Xml schema, a brief introduction, 2004.

[65] Hong Su, Harumi Kuno, and Elke A. Rundensteiner. Automating the transformation

of xml documents. In ACM Symposium on Dcoument Engineering, 2001.

[66] Hong Su, Harumi Kuno, and Elke A. Rundensteiner. Automating the transformation

of xml documents. In ACM Symposium on Dcoument Engineering, 2001.

BIBLIOGRAPHY 174

[67] X. Tang and F.W. Tompa. Specifying transformations for structured documents. In

International Workshop on the Web and Databases, 2001.

[68] Joe Tekli, Richard Chbeir, and Kokou Yetongnon. A hybrid approach for xml sim-

ilarity. In Jan van Leeuwen, Giuseppe Italiano, Wiebe van der Hoek, Christoph

Meinel, Harald Sack, and Frantiek Plil, editors, SOFSEM 2007: Theory and Prac-

tice of Computer Science, volume 4362 of Lecture Notes in Computer Science, pages

783–795. Springer Berlin / Heidelberg, 2007.

[69] Rouset M.-C. Sebag M. Termier, A. Treefinder: a first step towards xml data mining.

In IEEE International Conference on Data Mining, 2002.

[70] A. Theobald and G.Weikum. The index-based xxl search engine for querying xml

data with relevance ranking. In Proceedings of the EBDT Conference, 2002.

[71] Tien Tran, Sangeetha Kutty, and Richi Nayak. Utilizing the structure and content

information for xml document clustering. In Shlomo Geva, Jaap Kamps, and Andrew

Trotman, editors, Advances in Focused Retrieval, volume 5631 of Lecture Notes in

Computer Science, pages 460–468. Springer Berlin / Heidelberg, 2009.

[72] Tien Tran and Richi Nayak. Evaluating the performance of the xml document clus-

tering by structure only. In 5th International Workshop of the Initiative for the Eval-

uation of XML Retrieval, INEX, pages 473–484, Dagstuhl Castle, Germany, 2006.

[73] Tien Tran, Richi Nayak, and Peter Bruza. Combining structure and content similar-

ities for xml document clustering. In Proceedings of the 7th Australasian data mining

conference (AusDM),, pages 219–226, Adelaide, Australia, 2008.

BIBLIOGRAPHY 175

[74] Athena Vakali, Jaroslav Pokorn, and Theodore Dalamagas. An overview of web data

clustering practices. In Wolfgang Lindner, Marco Mesiti, Can Trker, Yannis Tz-

itzikas, and Athena Vakali, editors, Current Trends in Database Technology - EDBT

2004 Workshops, volume 3268 of Lecture Notes in Computer Science, pages 500–501.

Springer Berlin / Heidelberg, 2005.

[75] Christopher M. De Vries and Shlomo Geva. Document clustering with k-tree. In

Shlomo Geva, Jaap Kamps, and Andrew Trotman, editors, Advances in Focused Re-

trieval, volume 5631 of Lecture Notes in Computer Science, pages 420–431. Springer

Berlin / Heidelberg, 2009.

[76] R. Wagner and M. Fisher. The string-to-string correction problem. ACM, 21(1):168–

173, 1974.

[77] S. Waworuntu and J. Bailey. Xsltgen: A system for automatically generating xml

transformations via semantic mappings. In 23rd International Conference on Con-

ceptual Modeling (ER2004), 2004.

[78] Erik Wstner, Thorsten Hotzel, and Peter Buxmann. Converting business documents:

A classification of problems and solutions using xml/xslt. In WECWIS, California,

USA, 2002.

[79] L. Xu and D. W. Embley. Discovering direct and indirect matches for schema ele-

ments. In 8th International Conference on Database Ssytems for Advanced Applica-

tions, 2003.

[80] Jianwu Yang, W.K. Cheung, and Xiaoou Chen. Learning the kernel matrix for xml

document clustering. In e-Technology, e-Commerce and e-Service, 2005.

BIBLIOGRAPHY 176

[81] J.W. Yang and X.O. Chen. A semi-structured document model for text mining.

Journal of Computer Science and Technology, 17(5):603–610, 2002.

[82] Y. Yang, X. Guan, and J. You. Clope: A fast and effective clustering algorithm

for transaction data. In 8th ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining, 2002.

[83] Jin Yao and Nadia Zerida. Rare patterns to improve path-based clustering. In 6th

International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX

2007, Dagstuhl Castle, Germany, 2007.

[84] Jin Yao and Nadia Zerida. Rare patterns to improve path-based clusteringof wikipedia

articles. In 6th International Workshop of the Initiative for the Evaluation of XML

Retrieval, INEX 2007, Dagstuhl Castle, Germany, Dec 17-19, 2007.

[85] Guo Yongming, Chen Dehua, and Le Jiagin. Clustering xml documents by combining

content and structure. In Information Science and Engineering, 2008. ISISE ’08.

International Symposium on, volume 1, pages 583–587, Shanghai, 2008.

[86] J. Yoo, V. Raghavan, and L. Kerschberg. Bitcube: Clustering and statistical anal-

ysis for xml documents. In Thirteenth International Conference on Scientific and

Statistical Database Management, Fairfax, Virginia, 2001.

[87] Jin-sha Yuan, Xin-ye Li, and Li-na Ma. An improved xml document clustering using

path feature. In Fuzzy Systems and Knowledge Discovery, 2008. FSKD ’08. Fifth

International Conference on, volume 2, pages 400–404, Shandong, 2008.

[88] M. J. Zaki. Efficiently mining frequent trees in a forest. In SIGKDD, 2002.

BIBLIOGRAPHY 177

[89] K. Zhang and D. Shasha. A simple fast algorithms for the editing distance between

trees and related problems. SIAM Journal of Computing, 18(6):1245–1262, 1989.