xml clustering and its application to xml transformation · 2013-11-13 · xml clustering and its...
TRANSCRIPT
XML Clustering and Its Application to XML
Transformation
Tien Tran
Discipline of Computer Science
Faculty of Science and Technology at
Queensland University of Technology
Brisbane, Australia
Principal Supervisor: Dr. Richi Nayak
Associate Supervisor: Professor Peter Bruza
Abstract
The continuous growth of the XML data poses a great concern in the area of XML data
management. The need for processing large amounts of XML data brings complications
to many applications, such as information retrieval, data integration and many others.
One way of simplifying this problem is to break the massive amount of data into smaller
groups by application of clustering techniques. However, XML clustering is an intricate
task that may involve the processing of both the structure and the content of XML data
in order to identify similar XML data.
This research presents four clustering methods, two methods utilizing the structure of
XML documents and the other two utilizing both the structure and the content. The two
structural clustering methods have different data models. One is based on a path model
and other is based on a tree model. These methods employ rigid similarity measures
which aim to identifying corresponding elements between documents with different or
similar underlying structure.
The two clustering methods that utilize both the structural and content information vary
in terms of how the structure and content similarity are combined. One clustering method
calculates the document similarity by using a linear weighting combination strategy of
structure and content similarities. The content similarity in this clustering method is
based on a semantic kernel. The other method calculates the distance between documents
by a non-linear combination of the structure and content of XML documents using a
semantic kernel.
ii
ABSTRACT iii
Empirical analysis shows that the structure-only clustering method based on the tree
model is more scalable than the structure-only clustering method based on the path model
as the tree similarity measure for the tree model does not need to visit the parents of an
element many times. Experimental results also show that the clustering methods perform
better with the inclusion of the content information on most test document collections.
To further the research, the structural clustering method based on tree model is extended
and employed in XML transformation. The results from the experiments show that the
proposed transformation process is faster than the traditional transformation system that
translates and converts the source XML documents sequentially. Also, the schema match-
ing process of XML transformation produces a better matching result in a shorter time.
Table of Contents
Abstract ii
List of Figures viii
List of Tables x
Statement of Original Authorship xi
Acknowledgements xii
Chapter 1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Research Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 2 Background and Related Work 13
2.1 XML Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 XML Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Structure-based Clustering . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1.1 Tree-based Approaches . . . . . . . . . . . . . . . . . . . . 22
2.2.1.2 Path-based Approaches . . . . . . . . . . . . . . . . . . . . 26
2.2.1.3 Graph-based Approaches . . . . . . . . . . . . . . . . . . . 28
2.2.2 Content-based Clustering . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.2.1 Feature Reduction . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.2.2 Semantic Kernel . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.3 Content and Structure-based Clustering . . . . . . . . . . . . . . . . 35
2.2.3.1 Non-Linear Approaches . . . . . . . . . . . . . . . . . . . . 36
2.2.3.2 Linear Approaches . . . . . . . . . . . . . . . . . . . . . . . 38
2.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3 XML Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
iv
TABLE OF CONTENTS v
2.3.1 Schema Matching Approaches . . . . . . . . . . . . . . . . . . . . . . 42
2.3.1.1 Schema-Matching Systems . . . . . . . . . . . . . . . . . . 44
2.3.1.2 Schema Matching for XML Clustering . . . . . . . . . . . . 46
2.3.1.3 Schema Matching for Transformation Approaches . . . . . 47
2.3.2 Transformation Approaches . . . . . . . . . . . . . . . . . . . . . . . 48
2.3.2.1 XSLT for XML Transformation . . . . . . . . . . . . . . . 49
2.3.2.2 Other Manipulation Languages for XML transformation . . 50
2.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Chapter 3 The Proposed Clustering Methods 53
3.1 The Proposed Clustering Methods: Overview . . . . . . . . . . . . . . . . . 54
3.2 The Structure-Only Clustering Methods . . . . . . . . . . . . . . . . . . . . 55
3.2.1 The XCTree Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.1.1 The Tree Model . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.1.2 The Tree Similarity Measure: TSim . . . . . . . . . . . . . 58
3.2.2 The XCPath Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.2.1 The Path Model . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.2.2 The Path Similarity Measure: CPSim . . . . . . . . . . . . 65
3.3 The Content and Structure-based Clustering Methods . . . . . . . . . . . . 68
3.3.1 The XCLComb Method . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3.1.1 The Tree Model and The Text Vector Model . . . . . . . . 70
3.3.1.2 The Linear Similarity Measure: LCSim . . . . . . . . . . . 70
3.3.2 The XCTPath Method . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.2.1 The Text-Path Vector Model . . . . . . . . . . . . . . . . . 71
3.3.2.2 The Non-Linear Measure: TPVSim . . . . . . . . . . . . . 73
3.3.3 The Kernel Construction Approach . . . . . . . . . . . . . . . . . . . 74
3.4 The Hybrid Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4.1 The Incremental Clustering Stage . . . . . . . . . . . . . . . . . . . 79
3.4.2 The Iteration Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.4.3 The Pair-wise Clustering Stage . . . . . . . . . . . . . . . . . . . . . 84
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Chapter 4 Empirical Evaluation of the Clustering Methods 86
4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3.1 Purity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.3.2 Normalized Mutual Information . . . . . . . . . . . . . . . . . . . . 94
TABLE OF CONTENTS vi
4.3.3 F1-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5 Results of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5.1 Analysing the Structure-only Clustering Methods . . . . . . . . . . . 99
4.5.1.1 Clustering Threshold . . . . . . . . . . . . . . . . . . . . . 99
4.5.1.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5.1.3 Path Threshold . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5.1.4 Three Stages of the Hybrid Clustering Algorithm . . . . . 105
4.5.1.5 Methods Comparison . . . . . . . . . . . . . . . . . . . . . 108
4.5.2 Analysing the Content and Structure-based Clustering Methods . . 110
4.5.2.1 Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.5.2.2 Weighting in the XCLComb method. . . . . . . . . . . . . 113
4.5.2.3 Content-Only Comparison . . . . . . . . . . . . . . . . . . 114
4.5.2.4 Path Length in the XCTPath Method . . . . . . . . . . . . 115
4.5.2.5 Content and Structure-based Methods Comparison. . . . . 117
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Chapter 5 XML Transformation Approach 124
5.1 The XML Transformation Approach: Overview . . . . . . . . . . . . . . . . 125
5.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.3 Element Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.3.1 Discovery of Corresponding Leaf Elements . . . . . . . . . . . . . . . 132
5.3.2 Discovery of All Corresponding Elements . . . . . . . . . . . . . . . 135
5.4 Transformation Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.5 XSLT Script Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.6 Results of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.6.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.6.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.6.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.6.4 Element Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Chapter 6 Conclusion 152
6.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.2 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.3 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 155
TABLE OF CONTENTS vii
Chapter 7 Appendix 157
7.1 DTD Definitions of the Data Collections for XML Clustering Methods . . . 157
7.2 DTD definitions for the XML Transformation Approach . . . . . . . . . . . 157
Publications 164
Bibliography 165
List of Figures
1.1 The current approach for XML transformation process. . . . . . . . . . . . 7
1.2 The proposed approach for XML transformation process. . . . . . . . . . . 8
2.1 The classification of XML data. . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 An example of a conference XML document . . . . . . . . . . . . . . . . . . 16
2.3 An example of a conference DTD definition . . . . . . . . . . . . . . . . . . 16
2.4 An example of a conference XSD definition . . . . . . . . . . . . . . . . . . 17
2.5 A generic XML data clustering process . . . . . . . . . . . . . . . . . . . . . 19
2.6 Tree representation of the XML document structure . . . . . . . . . . . . . 23
2.7 Complete paths extracted from the tree model in Figure 2.6. . . . . . . . . 26
2.8 Graph representation of an XML definition . . . . . . . . . . . . . . . . . . 29
2.9 Bipartite graph representation . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.10 The VSM model with term frequency . . . . . . . . . . . . . . . . . . . . . 32
2.11 The classification of the XML clustering approaches for XML data. . . . . . 40
2.12 The transformation process for XML data. . . . . . . . . . . . . . . . . . . . 42
3.1 An overview of the proposed clustering methods. . . . . . . . . . . . . . . . 55
3.2 An example of a tree structure (a) and its corresponding summary tree
structure in depth-first string tree encoding format (b). . . . . . . . . . . . 57
3.3 An example of the treeMatching algorithm from tx to ty. . . . . . . . . . . . 63
3.4 An example of the treeMatching algorithm from ty to tx. . . . . . . . . . . . 64
3.5 CNC matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.6 An example of a conference XML document . . . . . . . . . . . . . . . . . . 73
3.7 The hybrid XML clustering approach overview . . . . . . . . . . . . . . . . 78
4.1 The effect of the clustering threshold on the XCTree and XCPath methods. 100
4.2 The processing time of the structure-only clustering methods. . . . . . . . . 102
4.3 The effect of the path thresholds with the clustering threshold of 0.9. . . . . 104
4.4 The effect of the path threshold with different clustering thresholds on the
XCPath method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.5 The accuracy of the clustering solution at the three stages of the XCTree
method at clustering threshold 0.9. . . . . . . . . . . . . . . . . . . . . . . . 107
viii
LIST OF FIGURES ix
4.6 The accuracy of the clustering solution at the three stages of the XCPath
method at the clustering threshold of 0.9 and path threshold of 0.7. . . . . 108
4.7 The comparison of different structure-only clustering methods. . . . . . . . 109
4.8 The sensitivity of the k value on the kernel. . . . . . . . . . . . . . . . . . . 112
4.9 The effect of the lambda of the XCLComb method. . . . . . . . . . . . . . . 114
4.10 The comparison of the different content clustering methods. . . . . . . . . . 116
4.11 The comparison of the different path length of the XCTPath method. . . . 117
4.12 The comparison of the clustering methods utilizing semantic kernel. . . . . 118
4.13 The comparison of all methods on the Niagara collection. . . . . . . . . . . 121
4.14 The comparison of all methods on the Publication collection. . . . . . . . . 121
4.15 The comparison of all methods on the DBLP collection. . . . . . . . . . . . 122
4.16 The comparison of all methods on the IEEE collection. . . . . . . . . . . . . 122
5.1 The XCTrans approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.2 An example of source document structures in the same cluster. . . . . . . . 130
5.3 An example of a source summary structure format. . . . . . . . . . . . . . . 130
5.4 An example of a target structure definition represented in a tree formats. . 131
5.5 Element mapping algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.6 Element mapping result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.7 An example of an XSLT Script. . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.8 XML transformation process time on the dataset. . . . . . . . . . . . . . . . 147
5.9 The processing time in second in relation to the number of documents in
the DBLP collection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.10 The processing time in seconds with the different numbers of clusters on
the DBLP collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.11 The mapping accuracy based on recall measure. . . . . . . . . . . . . . . . . 149
5.12 The mapping accuracy based on precision measure. . . . . . . . . . . . . . . 149
7.1 An example of the IEEE article DTD definition . . . . . . . . . . . . . . . . 158
7.2 An example of the Berkeley article DTD definition . . . . . . . . . . . . . . 158
7.3 An example of the HCI article DTD definition . . . . . . . . . . . . . . . . 159
7.4 An example of the DBLP article DTD definition . . . . . . . . . . . . . . . 160
7.5 The source Bibliography article DTD definition . . . . . . . . . . . . . . . . 161
7.6 The target Bibliography article DTD definition . . . . . . . . . . . . . . . . 161
7.7 The source Movies DTD definition . . . . . . . . . . . . . . . . . . . . . . . 162
7.8 The target Movies DTD definition . . . . . . . . . . . . . . . . . . . . . . . 162
7.9 A portion of the source DBLP DTD definition . . . . . . . . . . . . . . . . 163
7.10 A portion of the target DBLP DTD definition . . . . . . . . . . . . . . . . . 163
List of Tables
2.1 An overview of the structure-only clustering approaches . . . . . . . . . . . 41
2.2 An overview of the content and structure-based clustering approaches . . . 41
4.1 Data collections for XML clustering . . . . . . . . . . . . . . . . . . . . . . 88
4.2 The classification of the data collections for XML clustering . . . . . . . . . 89
4.3 Details of the pre-processed data collections . . . . . . . . . . . . . . . . . . 92
4.4 The number of clusters generated at the incremental clustering stage with
different clustering thresholds. . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.1 Quantifier mapping between XSD and DTD . . . . . . . . . . . . . . . . . . 131
5.2 The leaf element mapping result . . . . . . . . . . . . . . . . . . . . . . . . 136
5.3 Transformation operators for corresponding elements . . . . . . . . . . . . . 139
5.4 Data collections for XML transformation . . . . . . . . . . . . . . . . . . . . 145
x
Statement of Original Authorship xi
Statement of Original Authorship
The work contained in this thesis has not been previously submitted to meet requirements
for an award at this or any other higher education institution. To the best of my knowledge
and belief, the thesis contains no material previously published or written by another
person except where due reference is made.
Signature:
Date:
Acknowledgements xii
Acknowledgements
I would like to express my sincere gratitude and deep appreciation to my principal super-
visor, Dr. Richi Nayak. This thesis would not have been possible without the guidance,
encouragement and support of my principal supervisor.
I also thank my associate supervisor, Prof. Peter Bruza, for his valuable time and support.
Special thanks to Helen for her careful proofreading of most chapters in this thesis in a
very short time.
I would like to acknowledge the High Performance Computing and Research Support
(HPC) services at Queensland University of Technology (QUT) for providing me the su-
percomputer account for running most of my clustering methods on.
I am indebted to my many colleagues, especially to Sangeetha Kutty, for their cheer and
support throughout the course of my PhD research.
Finally, I would like to thank my family and friends for their encouragement and support.
Chapter 1Introduction
The growth in the amount of XML data is inevitable as an increasing number of organiza-
tions are starting to take advantage of the Web for data distribution [2, 74]. Consequently,
the need for better managing and analysing large collections of XML data is indisputable.
For better management, many researchers have focused their attentions on the clustering
of XML data [2]. Clustering is a data mining technique for grouping objects into smaller
groups according to their feature commonality [25]. XML clustering has played a crucial
role for many application domains, such as information retrieval, data integration, docu-
ment classification, Web mining and query processing [37, 5, 70]. Therefore, the first part
of the thesis is to explore methods for the clustering of XML documents. The first key hy-
pothesis for this research is that the clustering methods utilizing both the content
and structural information of XML documents produce a better clustering
solution in terms of accuracy than the clustering methods solely utilizing the
content-only or structure-only information of XML documents.
1
2
The second part of the research is to use XML clustering for XML transformation. The
XML transformation is a process of converting the structural representation of an XML
document which is the source document into another given structure which is the tar-
get document. One problem with this process is that the generation of a transformation
script is time consuming [31, 55, 54, 65, 67]. For example, if there are ten source docu-
ments which need to be converted into the same target document then the transformation
process has to be executed ten times. However, if among these source documents, there
are some similar substructures, such as in publication articles, then these substructures
repeatedly perform by the schema-matching process in finding corresponding structures
in the target document. One way of reducing the work of the schema matching pro-
cess is by integrating the structures of these documents into a global summary structure
known as schema integration [37]. Integration of similar substructures raises the problem
of document heterogeneity which occurs when documents which are semantically the same
may have different structures and elements names. One way of simplifying the document
heterogeneity of large amounts of XML documents is through a clustering process which
groups the XML documents into groups based on their similar data and/or structure [37].
Thus, the second key hypothesis of this research is that XML clustering based on
the structural information of XML documents can improve the transforma-
tion process in terms of time and accuracy for the conversion of more than
two source documents into the same target document.
To clearly understand the objective of this research, this chapter outlines the motivation,
objectives, aims and contributions of the research in this thesis. The chapter is concluded
with the structure of this thesis.
1.1. Motivation 3
1.1 Motivation
XML1 has become a popular data exchange language due to its flexibility of allowing users
to define their own XML schema definitions. However, such flexibility gives rise to the
problem of document heterogeneity because each organization or application can create its
own XML data according to specific requirements. The documents heterogeneity problem
constantly appears in the area of XML transformation [8]. XML transformation is an
important process for data distribution and message exchange of XML over the Web. For
instance in e-business, different companies may have different structures (schemas) for
representing the same information such as the invoice data. In order for the companies
to process the invoice data sent by their suppliers, a transformation process is necessary.
The transformation process is used to extract the invoice data, which is represented in
the structural format of the suppliers, and store the data in the structures in which the
application systems of the companies can process upon.
With the continuous growth of XML data on the Web, the problem of document het-
erogeneity becomes more difficult to manage. To simplify the document heterogeneity
problem in large XML data a process such as clustering is used. XML clustering, or clus-
tering in general, is the task of partitioning large amounts of data or objects into small
groups of similar characteristic data [25]. The clustering process is useful for many appli-
cation processes such as schema/data integration, data warehouses, information retrieval,
etc. [2, 74]
The Following is a discussion of the background of XML clustering, XML transformation,
1http://www.w3.org/XML/
1.1. Motivation 4
and the related issues for a better understanding of this research.
XML Clustering
In general, there are three tasks in the clustering process: data modelling, data similarity,
and data partitioning [25]. In order to cluster the XML data, a data model such as the
Vector Space Model (VSM) [60] or tree-based model is employed to capture the semantic
content and/or structural relationships in the XML data. Based on the data model, a
data similarity measure is defined to calculate the distance between the data instances.
Finally, based on the similarity value, a clustering algorithm can be applied to group XML
data. For example, if the data is represented as a tree then the tree edit distance [89, 12,
52, 68, 16] can be used to measure the data distance. However, if a data model such as the
VSM model is used then similarity measures such as the cosine or Euclidean distance [10]
can be used. After data similarity is defined, clustering algorithms such as hierarchical or
partitioned clustering can be used to group the XML data.
Due to the popularity of XML in document representation a myriad of XML clustering
methods can be found in the literature. However, many existing clustering approaches [84,
86, 80, 21] have not been able to efficiently combine the structure and content for the
clustering of XML documents. Approaches such as Yang et al. [80] and Yoo et al.[86]
use complex models to represent the structural and content information. Such approaches
consume an inordinate amount of memory space. On the other hand, Yao et al. [84] and
Doucet et al. [21] approaches have utilized the VSM model to combine the structural and
content information contained within XML documents. These latter approaches are less
complex than the former approaches, however, they may result in a lack of accuracy as
1.1. Motivation 5
only one dimension is used to represent both the structural and content information.
This thesis presents two clustering methods which utilize both the content and structure
of XML documents. The first clustering method uses two different models and similar-
ity measures for the content and structure. For the content, a semantic kernel is used,
whereas for the structure a tree model is used a tree similarity measure. The term ‘con-
tent’ in this thesis refers to the data of the XML documents, which does not include the
elements defined in the schema definitions. In terms of memory space, the first clustering
method requires less memory space than approaches such as Yang et al. [80] and Yoo et
al.[86] because the used of two different models for the content and structure is easier
to process. The document similarity is ascertained by non-linearly combining the con-
tent similarity value and the structural similarity value with different weightings. The
non-linear combination measure is applicable for homogeneous, as well as heterogeneous
collections. Homogeneous collections contain XML documents conforming to the same
schema definition, whereas heterogeneous collections contain XML documents conforming
to different schema definitions.
The second proposed method represents the content and structural information together
as a collection of text paths similar to Yao et al. [84]. However, instead of using the VSM
model, the proposed method calculates the document similarity using a semantic kernel.
XML transformation
The two most important tasks in the XML transformation process to transform the data
from the source document format to the target document format are (1) the generation
of mapping rules by finding the matching elements between the source document and
1.1. Motivation 6
the target document, known as schema matching, and (2) the generation of a script that
processes these mapping rules with a manipulation transformation language such as the
eXtensible Stylesheet Language Transformation (XSLT)2.
The XML transformation is a complicated task. For example, XML designers can define
their own tags, therefore, the structure of the XML documents representing the same
information may not contain the same structure. Moreover, XML language contains car-
dinality operators which are defined by the schema to determine how many instances of an
element type are permitted in an XML document. These operators mean that the XML
documents derived from the same schema may contain varied lengths of structures.
Not many existing XML transformation approaches [31, 55, 54, 65, 67] address the prob-
lem of XML transformation using multiple XML documents. Furthermore, most XML
transformation approaches only address the transformation problem between one source
document and one target document. The problem of dealing with multiple XML sources
has been addressed by researchers [51, 59] in the area of schema integration to resolve
structural conflicts such as nesting discrepancies and backward path representations; how-
ever, the work in the area of schema integration has not gone further and applied the
mediated schema in the XML transformation application.
Schema integration is desirable in situations where the target document changes regularly,
the transformation process between the source document and a new target document needs
to be executed repeatedly.
To solve the above problem, schema integration (or structural integration) can be per-
2www.w3.org/TR/xslt
1.1. Motivation 7
S1
S2
S3
S4
S5
Transforma!on
Process
T1
T2
T3
T4
Figure 1.1: The current approach for XML transformation process.
formed on the source documents to create a global summary structure of the structure in
the source documents. In this case, the schema matching in the transformation process
only needs to perform between the global summary structure and the target document.
However, schema integration is a complex task when the source documents are very dif-
ferent in structure. Thus, a task such as clustering can be employed which first groups
the source documents into similar structures before performing schema integration. For
instance, Figure 1.2 illustrates how the five input source documents are processed in the
proposed approach. Let us assume, based on the five input source documents, that three
clusters can be formed according to their structural similarity. The concept of schema in-
tegration can then be applied by simply combining the structure of the source documents
held within each cluster into a global summary structure. The global summary structure
acts like a source document definition which can then be used in the transformation pro-
cess, thus, the number of times it needs to be executed with the changing of four different
target documents is twelve times. The clustering process needs to be executed only once
if the source documents do not change. The saving in term would be significant if there
were a large number of source documents that needed to be transformed.
1.2. Research Questions 8
S1
S2
S3
S4
S5
Transforma!on
Process
T1
T2
T3
T4
Clustering
Process
G1
G2
G3
Figure 1.2: The proposed approach for XML transformation process.
1.2 Research Questions
Due to the existing limitations of the XML clustering and transformation process, this
thesis therefore addresses two main questions:
1. Can the accuracy of the clustering solution be improved by using both the structure
and content of XML documents?
2. Given a collection of source XML documents and a target document, can the group-
ing of the source documents into small sets of similar structures improve the pro-
cessing time and accuracy of the XML transformation?
The first question responds to the first key hypothesis of this research which is to study
the impact of a clustering solution using the structure as well as the content of XML
documents. The second question responds to the second key hypothesis that is to explore
whether structural clustering can improve the accuracy of the schema matching task as
well as the length of time of the transformation process.
1.3. Research Aim 9
1.3 Research Aim
The first objective of this research, in response to the first research question, is to develop a
number of clustering methods by utilizing the different features, structure and/or content,
of XML documents in the clustering process. These clustering methods are then analysed
to see the impact of the different features on the clustering solutions.
The second objective, corresponding to the research second question, is to develop an XML
transformation approach which utilizes structural clustering as a pre-processing stage. The
structural clustering is expected to reduce the complexity in the structural integration of
the source documents and in the generation of the transformation script for transforming
multiple source documents into a target document simultaneously.
1.4 Contributions
This research has the following contributions.
1. A hybrid clustering algorithm has been proposed which utilizes both the partitioning
clustering and hierarchical clustering process. The proposed clustering algorithm
aims to balance the drawback of these two existing processes. Empirical results show
that the proposed clustering algorithm has been able to improve the scalability of
the pair-wise clustering and to improve the accuracy of the clustering solution in the
incremental clustering.
2. A number of clustering methods have been developed for the grouping of the XML
1.4. Contributions 10
documents: two structure-only clustering methods and two structural and content
clustering methods. The two structure-only clustering methods are based on two
different data models, the tree model and the path model. Two structural similarity
measures based on the tree model and the path model have been included in this
thesis. For the two structural and content clustering methods, the first clustering
method is based on a linear combination of the structural similarity, defined for
the tree model, and the content similarity, using a semantic kernel, with different
weightings. The second method is based on text paths, paths which also contain
their content information, which are measured using a semantic kernel. This is
non-linear combination of the content and structure. The clustering methods that
utilize both the structure and content of XML documents performs better than the
structure-only clustering methods in the experimental results.
3. A transformation approach has been proposed which employs one of the structure-
only clustering methods for the pre-processing stage. The proposed approach can be
used for the conversion of more than two XML source documents to another XML
structure. After the grouping of the source documents, the structure of the source
documents in each group is then integrated (or combined) into a global summary
tree structure. Each group has a global summary tree structure which is used in
the schema matching process. Results show that by using the clustering process,
this approach can improve the scalability as well as the accuracy in comparison to
the traditional XML transformation system for the conversion of multiple source
documents into the same target document.
1.5. Thesis Structure 11
1.5 Thesis Structure
The following is an overview of this thesis. It is broken into the following chapters:
Chapter 2: Background and Related Work
This chapter begins with a brief description of the XML data and its structure. Next
is the background knowledge and related work of XML clustering followed by the back-
ground knowledge and related work of XML transformation. This chapter provides the
fundamental information for the rest of the chapters in this thesis.
Chapter 3: The Proposed Clustering Methods
This chapter describes the two structure-only clustering methods and the two structural
and content clustering methods which have been proposed in this thesis. This chapter
defines the different data models and similarity measures employed by the proposed clus-
tering methods. Each data model has a different similarity measure. This chapter also
introduces a new clustering algorithm which is used by the proposed clustering methods
for the grouping of XML documents.
Chapter 4: Empirical Evaluation of the Clustering Methods
This chapter empirically analyses the clustering methods on different data collections
and evaluation metrics. It compares all the proposed clustering methods which have
been developed in this thesis. The chapter starts with the evaluation of the structure-
only clustering methods. Following that is the evaluation of the structural and content
clustering methods. Finally, there is a discussion and comparison of all the proposed
1.5. Thesis Structure 12
clustering methods in this research.
Chapter 5: The XML Transformation Approach
This chapter investigates a solution to the second research question of XML transfor-
mation. A transformation approach has been developed which incorporates a clustering
process to improve the transformation process dealing with a collection of input source
documents. A number of experiments have been conducted to analyze the performance
of the proposed approach in terms of scalability. The chapter also analyses the quality of
the proposed element mapping with another existing element-mapping technique.
Chapter 7: Conclusion
This chapter concludes the thesis with a summary discussion of the obtained results
throughout the course of this research. It also includes the limitations of this thesis
and work that needs to be done in the future.
Chapter 2Background and Related Work
This chapter discusses the background knowledge and the related work of XML clustering
and XML transformation. To understand why traditional clustering methods for text
documents are not sufficient in the grouping of XML data, this chapter begins with an
introduction to the XML data. Following that is the related work of XML clustering.
The existing XML clustering approaches are addressed according to the structure-only
approaches, content-only approaches, and structure and content approaches. After the
discussion of XML clustering, this chapter continues the related work in the area of XML
transformation which includes the schema matching approaches and the transformation
languages for XML data. This chapter concludes by addressing the limitations of the
related work and how some of these limitations are approached in this research.
13
2.1. XML Data 14
2.1 XML Data
Over the decade, XML, the eXtensible Markup Language1, has become the standard for
data distribution and message exchange over the Internet and among various organizations
and computing applications [2, 74]. It is an extensible language because it allows users to
define their own markup symbols and to define the structure for representing the XML
data. It is a meta-language which can be used to define other new mark-up languages as
well. The common uses of XML include:
Information Identification - user defined mark-ups have meaningful names which can
be used to identify the text content of a document;
Information Storage - XML can be used to store textual information across any
platform and application;
Information Structure - any kind of hierarchical structure can be defined for storing
any data whether it is simple or complex in structure;
Publishing - a style language such as XSL, the eXtensible Stylesheet Language 2,
can be used to publish the data of an XML document to another format such as
HTML for web viewing, PDF for electronic paper viewing and many others; and
Web Services - it provides a common language for inter-process communication. The
majority of web services such as weather services, e-commerce sites, blog newsfeeds,
and thousands of other data-exchange services use XML for data management and
transmission.1http://www.w3.org/XML/2http://www.w3.org/Style/XSL/
2.1. XML Data 15
XML Data
XML Schema XML Document
XML-Schema
Defini"on (XSD)
Document Type
Defini"on (DTD)
Data Structure
Ill-formed
Document
Well-formed
Document
Invalid
Document
Valid
Document
Figure 2.1: The classification of XML data.
Figure 2.1 illustrates the different categories of XML data. The two forms of XML data
are document(Figure 2.2) and schema (Figures 2.3 and 2.4). XML schema data contains
the grammar for restricting syntax and structure of accompanying XML documents. The
two most popular languages for defining an XML schema are Document Type Definition
(DTD) (Figure 2.3) and XML-Schema Definition (XSD)3 (Figure 2.4). The XSD is the
enhancement of the DTD which has additional features such as namespace support and
more data types. Many documents can conform to the same XML schema definition.
There are two types of XML document collection: (1) a collection that contains documents
conforming to the same schema definition is called a homogeneous collection; and (2) a
collection that contains documents conforming to different schema definitions is called a
heterogeneous collection.
3http://www.w3.org/XML/Schema
2.1. XML Data 16
<?xml version="1.0"?>
<!DOCTYPE conf SYSTEM "conf.dtd">
<conf id=”IE06”>
<title> The 16th ACM SIGKDD Conference on Knowledge Discovery and Data mining (KDD-2010)</title>
<year> 2010 </year>
<editor>
<person>
<name>”Peter Gavin”</name>
<email>[email protected]</email>
<phone>61-9828712</phone>
</person>
</editor>
<paper>
<title>”]Mining the structure for XML document clustering”</title>
<author>
<person>
<name>”Susan Smith”> </name>
<email>”[email protected]”</email>
</person>
</author>
<reference>
<paper>
<title>”A Survey of XML Similarity Measures” </title>
<author>
<person>
<name>”David MacDonald”> </name>
<email>”[email protected]”</email>
</person>
</author>
</paper>
</reference>
</paper>
</conf>
Figure 2.2: An example of a conference XML document
<!ELEMENT conf(title,year, editor?, paper*)>
<!ATTLIST conf id ID #REQUIRED>
<!ELEMENT title (#PCDATA)>
<!ELEMENT year (#PCDATA)>
<!ELEMENT editor (person*)>
<!ATTLIST editor eids IDREFS #IMPLIED>
<!ELEMENT paper (title,author,references?)>
<!ELEMENT title(#PCDATA)>
<!ELEMENT author (person*)>
<!ELEMENT person(name,email,phone?)
<!ELEMENT name(#PCDATA)>
<!ELEMENT email(#PCDATA)>
<!ELEMENT phone(#PCDATA)>
<!ELEMENT references (paper*)>
Figure 2.3: An example of a conference DTD definition
2.1. XML Data 17
<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd=http://www.w3.org/2001/XMLSchema,
targetNamespace=http://www.conferences.org,xmlns=http://www.conferences.org,
elementFormDefault="qualified">
<xsd:element name="conf">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="title" minOccurs="1" maxOccurs= "1"/>
<xsd:element ref="year" minOccurs="1" maxOccurs= "1"/>
<xsd:element ref="editor" minOccurs="0" maxOccurs= "unbounded"/>
<xsd:element ref="paper" minOccurs="1" maxOccurs= "unbounded"/>
</xsd:sequence>
<xsd:attribute ref="id" use="required"/>
</xsd:complexType>
</xsd:element>
<xsd:element name="editor">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="person" minOccurs="1" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="paper">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="title" minOccurs="1" maxOccurs="1"/>
<xsd:element ref="author" minOccurs="1" maxOccurs="1"/>
<xsd:element ref="references" minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="author">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="person" minOccurs="1" maxOccurs="unbounded"/>
</xsd:sequence> </xsd:complexType> </xsd:element>
<xsd:element name="person">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="name" minOccurs="1" maxOccurs="1"/>
<xsd:element ref="email" minOccurs="1" maxOccurs="1"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="references">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="paper" minOccurs="1" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:attribute name="id" type="xsd:string"/>
<xsd:element name="title" type="xsd:string"/>
<xsd:element name="year" type="xsd:string"/>
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="email" type="xsd:string"/>
</xsd:schema>
Figure 2.4: An example of a conference XSD definition
2.1. XML Data 18
An example of XML document is given in Figure 2.2. The XML document contains two
main information:(a) the markup and (b) the content. The main markup which defines the
logical component of an XML document is called an element, i.e., title, year, person, etc. A
markup construct that begins with “<” and ends with “>” is a tag. The text between the
start-tag and end-tag of an element is the content. Another component worth mentioning
is the attributes of an element which exists within the start-tag of the element, e.g., the
id attribute of the element conf.
There are many relationships that can exist between elements [89, 12, 68]. A child-parent
relationship occurs when an element is contained within another element and they are only
one level apart. Consider the document in Figure 2.2 for example, the element title is the
child of the element conf and the element conf is the parent of the element title. A sibling
relationship exists when two elements have the same parent, for instance, the elements
street, city, state, postal are siblings. An element is contained within another element
and whether they have the child-parent relationship or not they will have the descendant-
ancestor relationship. For instance, all the elements that exist within the start-tag and
end-tag of the element conf, such as title, year, editor, person, paper and etc. are its
descendants and the element conf is their ancestor.
The relationships between elements within an XML document define its structure. The
structure of an XML document is either ill-formed or well-formed [64]. An ill-formed
document does not follow the XML syntax, i.e., has no XML declaration statement or
no ending tag. However, a well-formed document follows the XML syntax which has the
following properties: has one root element; has unique opening and closing tags; and has
tags that are properly nested. A well-formed document, which is conformed to its schema
2.2. XML Clustering 19
definition, is known as a valid document. It indicates that this document does not contain
any rules that are not permitted by the schema definition.
2.2 XML Clustering
With the structure embedded in the XML data, both the data and structure are important
for the process of XML clustering in order to obtain a good clustering solution. Clustering
is a data mining task that groups or segments a collection of objects into subsets or
“clusters” that share similar characteristics [25].
XML
Data
Data
Model
Data
Similarity
Data
Par!!oning
Cn
C2
C1
Applica!on
Domain 1
Applica!on
Domain 2
Applica!on
Domain M
Figure 2.5: A generic XML data clustering process
Figure 2.5 illustrates a generic clustering process. The clustering process consists of three
main tasks: data modelling, data similarity, and data partitioning [25].
Data Modelling
The input data is represented using a common data model that can capture the semantic
and/or structure information inherent in the input data collection. Some of the most
popular models for XML data are the tree-based model, graph-based model, vector-based
model, and path-based model.
2.2. XML Clustering 20
Data Similarity
After conducting the data modelling task the data similarity task is conducted to apply
the most appropriate measure to compute the degree of similarity between objects in the
data collection utilizing the data model. Selection of the measure depends on the data
model, for instance, if the tree-based model is used then a measure such as the tree edit
distance is commonly employed [52, 68, 16].
Data Partitioning
Once the data model and data similarity is determined for the input data collection, the
next step is to choose a clustering algorithm that can partition the data taking similarity
into consideration. The two most popular types of clustering algorithms are incremental
clustering and pair-wise clustering.
Incremental Clustering - A simple incremental clustering segments the input data
collection as follows: (1) the first data in the collection becomes the centroid for the
first cluster; (2) the second data in the collection is computed with the existing cluster
centroid (or cluster representation) using a similarity measure; and (3) the second
data initiates a new cluster and becomes the cluster centroid if the degree of similarity
between the second data and the existing cluster centroid is not greater than a
clustering threshold value. The clustering threshold is the lowest possible value of
similarity required to join two objects in one cluster. This value is determined by the
user. The next object in the collection is processed in the same way as the second
data. The clustering solution of this method is sensitive to the order of the input
data collection. An incremental clustering is a type of partitioning clustering [74, 29].
2.2. XML Clustering 21
Another popular partitioning clustering method is the k-means method. Given a
set of objects (o1, o2, ..., on) where each observation is a d-dimensional real vector,
k-means clustering aims to partition the objects into k sets where k < n. In this
method, the number of clusters is pre-defined. K-means clustering is much preferred
over the hierarchical clustering as it is faster and easy to implement. One of the
drawbacks with k-means clustering is the selection of k. The time complexity for
the partitioning methods with only one pass through the input data collection is
O(n(logn)) where n is the number of input data in the collection.
Pair-wise Clustering - pair-wise clustering partitions input data collection based on a
similarity matrix which is obtained by calculating the similarity between all possible
pairs of input data in the collection using a similarity measure. This clustering
is a type of hierarchical agglomerative clustering method [29]. One best known
hierarchical method is the single link method. The single link method operates by
joining, at each step, the two most similar objects, either between two input data
or between an input data and an existing cluster. The time complexity of the
hierarchical clustering is at least O(n ∗ n) where n is the number of input objects.
Therefore, this method is only limited to smaller collections.
The rest of this section discusses the existing XML clustering approaches which are cat-
egorized into structure-based, content-based, and structure and content-based clustering
approaches according to the features of the XML documents used.
2.2. XML Clustering 22
2.2.1 Structure-based Clustering
There are many data models that can represent the structure of XML data. This section
will present the approaches according to the data models that they use for XML clustering
such the tree, path and graph models.
2.2.1.1 Tree-based Approaches
The most popular model for representing the structure of XML data is a tree model [89,
12, 68]. A tree is denoted as T = (V, v0, E, f), where V is the set of nodes, v0 is the
root node which does not have incoming edges, E is the set of edges, and f is a mapping
function f : E → V ×V . In a tree model, the components such as elements and attributes
that define the structure of an XML document are referred to as nodes. There are many
different nodes in an XML tree structure such as the element nodes, data (or text) nodes,
comment nodes, a document node and many others. The edges are the child-parent
relationships between the nodes in the tree. An example of the rooted label tree model
corresponding to the XML document in Figure 2.2 is shown in Figure 2.6. The Figure
shows the relationships between the nodes in the tree structure. The immediate children
of the conf node are the title, year, editor, and paper. The conf ’s children of the conf
immediate children are its descendants. The dotted line is the attribute id of the conf
node.
There are a number of similarity measures for the tree-based approaches, namely, the
tree edit distance, frequent subtree mining, and level similarity. The tree edit distance
calculates the minimum cost (or distance) of transforming from one tree structure to
2.2. XML Clustering 23
conf
!tle year
email name
editor
person
phone
name email
paper
person
!tle reference
paper
author
name email
author !tle
person
id
Figure 2.6: Tree representation of the XML document structure
another. The frequent subtree mining is the extraction of the most common sub-structures
existing in a collection of tree structures. The trees are clustered based on these sub-
structures. Finally, the similarity measures based on level similarity which take into
account the level similarity of the nodes. Level similarity is based on the assumption
that associated nodes should appear in the same level. In the following sub-sections, the
approaches based on these similarity measures are discussed in more detail.
Tree Edit Distance
As XML documents can be easily modelled as a tree, many researchers [89, 12, 52, 68, 16]
have adapted the tree edit distance for finding the distance between trees. Tree edit
distance is usually based on dynamic programming techniques for string-to-string correc-
tions [76]. An edit script is is a sequence of tree edit operations such as insert node, delete
node, replace node and others. to transform one tree into another tree. The tree edit
distance between two trees is the minimum cost between the costs of all possible tree edit
2.2. XML Clustering 24
sequences.
Zhang and Shasha [89] approach allows the edit operations to perform anywhere in a
tree. The complexity of this approach is O(|t1||t2|depth(t1)(depth(t2)) where t1 and t2
are two trees. Nierman et al. [52] and Telki at el. [68] approaches have expanded the
work of Chawathe [11] which restricts the insertion and deletion at the leaf nodes only.
Nierman et al. [52] introduces two new operations which are to insert tree and delete tree
to allow insertion and deletion of whole sub-trees. The complexity of the latter approaches
is O(|ND|) where |N | is the total number of nodes in the two trees and D is the number
of misaligned nodes. On the other hand, Telki at el. [68] extended the edit operations to
measure the semantics of the labels of nodes which also take into consideration the depth
of the nodes in a tree. The work of Dalamagas [16] claims that real XML documents tend
to have many repeated nodes which affect the performance of the tree edit algorithms.
The authors introduce a summary tree structure in which the repeated nesting nodes are
reduced (or removed) from the rooted labelled trees.
Frequent Subtree Mining
Computing between each pair of trees is expensive for XML clustering, therefore other
approaches [69, 88, 33, 32, 39] have been developed that extract and mine subtrees from
the whole tree structure. Termier et al. (2002) approach first clusters the trees based
on the occurrence of the same pairs of labels in the ancestor relation using the Apriori
algorithm. After the trees are clustered, a maximal common tree is computed to measure
the commonality of each cluster to all the trees. This algorithm cannot find all frequent
patterns in the subtrees of the ordered labelled tree. To fill the gap, Zaki (2002) proposed
an algorithm to discover all subtrees in a forest (meaning in a large collection of ordered
2.2. XML Clustering 25
trees). Other methods such as Kutty et al. [33, 32] and Lin et al. [39] have extended the
frequent sub-tree mining for finding common sub-trees. The output of the common sub-
trees is used in the clustering of XML documents. These researchers claim that clustering
the XML documents by extracting the subtrees rather than using entire structure of the
datasets is more efficient in terms of scalability and accuracy.
Level Similarity
XCLS [47] extends the transactional data clustering algorithm such as Clope [82] to XML
documents by defining a new concept called the level similarity. The level similarity
is measured by calculating structural similarity between two objects (tree-tree, cluster-
cluster, tree-cluster) by considering their common items in the corresponding levels and
giving different weight in different levels. Unlike other approaches that are based on pair-
wise similarity between two trees, XCLS computes the level similarities between a tree and
existing clusters and moves the tree to the cluster which has the maximum level similarity
with the tree. Using this approach, the computation time is reduced significantly. The
limitation of the XCLS approach is it does not reserve the child-parent relationship as well
as the sibling relationship. Thus, XCLS+ [4] improves the limitation of XCLS by using
edges rather than using the node only. Another study which extends the XCLS further is
the XEdge [6]. The XEdge not only use the edges for the node representation but it also
extends the clustering algorithm by using the k-means algorithm. It claims that it can
cluster both homogeneous as well as heterogeneous XML collections.
2.2. XML Clustering 26
2.2.1.2 Path-based Approaches
In recent years, a great number of approaches has represented the XML data by break-
ing down the tree structure into paths. A path model represents the structure of XML
documents as a collection of paths (or transactions as used in database communities [7]).
An XML path can be of two types: complete path and partial path. A complete path
contains the nodes from the root to the leaf node in sequence order. Consider the example
tree model in Figure 2.6, the corresponding complete paths of the tree model are shown
in Figure 2.7. The number of complete paths is equal to the number of leaf nodes in the
tree model.
conf/title, conf/year, conf/id,conf/editor/person/name,conf/editor/person/email,conf/editor/person/phone,conf/paper/title,conf/paper/author/person/name,conf/paper/author/person/email,conf/paper/reference/paper/title,conf/paper/reference/paper/author/person/name,conf/paper/reference/paper/author/person/email
Figure 2.7: Complete paths extracted from the tree model in Figure 2.6.
A partial path contains the nodes from node m to node n in sequence order in which
node m is the ancestor of node n, and nodes m and n appear in the same complete
path. A complete path can have many partial paths. For example, consider a com-
plete path conf/paper/reference/paper/author/person/name, it can have the following
partial paths with varied lengths: conf/paper, conf/paper/reference/, paper/reference, pa-
per/reference/paper/author/person/name, and etc.
2.2. XML Clustering 27
Similarity measures for a path model can be categorized into sequential pattern mining
and schema matching techniques. A sequential mining technique is similar to a subtrees
mining approach, however it extracts all the common paths of documents. These common
paths are used for clustering of XML data. On the other hand, a schema matching tech-
nique finds corresponding elements between two schema definitions. It has been employed
by a few researchers in calculating the distance between paths. Research relating to these
similarity measures is described in the following sub-sections.
Sequential Pattern Mining
For finding the common paths between documents, many techniques [38, 35, 27, 48, 3] have
incorporated the idea of sequential pattern mining to extract the frequent paths from XML
documents. Considering an XML document as a transaction and paths of documents as
items of the transaction, these techniques find the complete set of frequent sequences from
the set of paths.
Techniques such as Leung et al. and Lee et al. [38, 35] have utilized the idea of sequential
pattern mining to extract common paths from a collection of XML trees to measure the
structural similarity. However, these methods did not go further in term of clustering.
Hwang and Ryu [27] take a step further. they use sequential pattern mining to extract the
frequent paths from the XML documents, assuming that an XML document as a transac-
tion and the frequent structure of documents as the items of the transaction. Hwang and
Ryu [27] then use CLOPE [82] as well as the notion of Large Items [26] clustering methods
for transactional data to cluster a collection of XML documents. However, XMine [48]
uses sequential pattern mining to infer the similarity between elements. This approach
2.2. XML Clustering 28
is for clustering and modelling the relationship between DTD schema definitions using a
pair-wise schema distance matrix. On the other hand, XProj [3] uses the frequent sub-
structures as the cluster representation for clustering the XML documents using k-means
algorithm.
Schema Matching
Besides using sequential pattern mining, some researchers have employed schema match-
ing techniques in finding the similarity between paths. Schema matching is the process
of finding corresponding elements between two schema definitions. The clustering meth-
ods [37, 50, 48, 45] which employ the schema matching technique are generally used in
the data integration application. These clustering methods adapt a complex measure for
determining similarity between XML structures as well as the leaf nodes of two XML data.
They do not only calculate the similarity between the elements names of XML nodes but
also other properties of XML nodes such as the data type and constraints. The most
important thing these clustering methods try to measure is the similarity between the leaf
nodes.
2.2.1.3 Graph-based Approaches
Often the tree-based model is used for representing XML documents, whereas a graph-
based model is more suitable for representing XML schema definition to show the acyclic
relationship of elements in the schema. An example of a graph model is shown in Figure 2.8
which is corresponding to the conference DTD definition in Figure 2.3. In contrast to the
tree representation, the graph representation in the Figure also shows the cardinality
2.2. XML Clustering 29
year
conf
person
?
author
* +
name email
?
phone
!tle
eids
editor
*
paper
reference
? * !tle
id
Figure 2.8: Graph representation of an XML definition
operators. Unlike the tree representation, the author node in the graph has two parents
which are the paper node and the person node. A graph can be defined as a triple (V,
E, f) where V represents the set of vertices and E represents an edge set with a mapping
function f : E 7→ V V . The vertices are the elements in the schema and the edge set
consists of the links that connect the vertices that represent parent-child relationships.
Chawathe [12] computes the distance between XML documents using the concept of edge
cover with a bipartite graph. A bipartite graph G is defined as G = U, V,E, where U
and V are the two disjoint sets which contain the nodes such that every edge connects
a node in U to one in V and no edge connects nodes in the same set. An example of
bipartite graph is shown in Figure 2.9. In the Figure, U and V and stand for two different
2.2. XML Clustering 30
U V
Figure 2.9: Bipartite graph representation
XML documents and the dots represent the nodes in the XML documents. Chawathe’s
approach [12] establishes a bipartite graph by representing one tree structure as U and
another tree structure as V, and then an operation is defined to convert the node from one
tree another. Once all the possible edges to transform the nodes from U into the nodes
in V, the approach calculates the set of edges that connects all the nodes between the two
graphs at the lowest possible cost. This is similar to tree edit distance approaches.
A recent work of Yuan et al. [87] also employs the bipartite graph model to map common
paths between XML documents, where U now is a set of documents and V is a set of
paths. Documents that are closely related should have the most common paths shown
in the bipartite graph. It uses Jaccard coefficient to compute the similarity between
documents dx and dy which is defined as:
Sim(dx, dy) =N(dx) ∩N(dy)
N(dx) ∪N(dy)(2.1)
where N(dx) and N(dy) are the number of paths contained by documents dx and dy
respectively. Based on the Jaccard coefficient, a pair-wise similarity matrix is generated
for the clustering of XML documents.
2.2. XML Clustering 31
2.2.2 Content-based Clustering
The most popular model for representing text content is the Vector Space Model (VSM) [60].
It is widely used in information retrieval, information filtering, indexing and relevancy
rankings. In the VSM model, the data content of XML data is broken down into a set
of index terms. A document di is represented as vector di = (t1, t2, ..., tm), where m is
the number of unique index terms in the input data collection. An example of the VSM
model for a collection of input data is shown in Figure 2.10. The vector-based model can
represent the terms by their frequency (as seen in Figure 2.10), binary value (1 or 0) where
1 means that the feature exists in the document otherwise 0, or by weights. There are
several ways to compute the weights of features. A popular scheme is term frequency-
inverse document frequency(TF-IDF) weighting. For XML data, the “terms” refer to as
“feature”. Thus, the weight vector for document di is di = (w1i, w2i, ..., wmi) is defined
by:
wti,dj = tfi · log|D|
|ti ∈ |D||(2.2)
tfi is the term frequency of term ti in input data dj divided by the total number of term
frequencies in dj and log |D||ti∈|D|| is inverse document frequency. |D| is the total number
of input data in the collection;|ti ∈ |D|| is the number of input data D containing the
term ti. Another weight is Okapi-BM25 similar to the TF-IDF weighting that is employed
in XML clustering [75]. It has two tuning parameters which are K1 and b. K1 influences
the effect of the term frequency, whereas b affect the influence of the document length.
2.2. XML Clustering 32
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14
d1 1 0 2 0 0 0 0 5 0 0 3 0 2 2
d2 2 2 0 0 2 1 0 0 8 7 0 7 0 0
d3 2 3 0 9 0 6 4 0 0 0 0 0 0 0
d4 3 0 0 5 3 2 1 3 0 0 0 0 0 0
Figure 2.10: The VSM model with term frequency
The Okapi-BM25 weighting for a given term ti is defined as:
wti,dj =(CFW × tfi × (K1 + 1)))
(K1× ((1− b) + (b×NDLj) + tfi))(2.3)
where CFW is the collection frequency weight for term ti is calculated as log(|D|)−log(tfi)
and the NDLj is the normalization document length for document dj , where NDL =
|dj |avg(|dj |) , where |dj | is the length of document dj in terms of words and avg(|dj |) is the
average document length in the text collection from which documents are drawn.
After the modelling of the content and weightings, there are a number of similarity mea-
sures for finding the distance between instances of the vector-based model. Two of the
most commonly used measures are cosine and Euclidean. Other distances are discussed
in Cha’s survey paper [10] such as Jaccard, Manhattan and many others. The cosine and
Euclidean between two vectors, dx and dy, are defined as:
Cosine(dx, dy) =dx · dy
∥dx∥ ∥dy∥(2.4)
Euclidean(dx, dy) =
√√√√ m∑i=1
(dx,i − dy,i)2 (2.5)
2.2. XML Clustering 33
where m is the number of terms in the XML document collection. The difference between
the cosine measure and the Euclidean measure is that Euclidean measure takes into account
the magnitude of the vectors, whereas the cosine measure calculates the angle between
the two vectors.
The above similarity measures for the VSM model can be used for clustering of XML
documents using partitioning clustering or hierarchical clustering.
2.2.2.1 Feature Reduction
Clustering of XML data can be expensive due to the presence of large numbers of terms
present in the documents and due to the presence of many outliers or insignificant terms.
Therefore, a number of XML clustering approaches [28, 40, 71] use features reduction
methods such as ICA (Independent component analysis), PCA (principal component anal-
ysis) [56] or LSI (Latent Semantic Indexing) [34].
Given a dataset of XML documents d1, d2, ..., dn, an original term-document matrixX of
m×n can be derived, where m and n are the number of unique terms (or mark-up tags, or
paths) and the number of documents in the dataset respectively. The LSI method applies
Singular Value Decomposition(SVD) on the term-document matrix which is decomposed
into 3 matrices (equation 2.6), where U and V have orthonormal columns of left and right
singular vectors respectively and S is a diagonal matrix of singular values ordered in the
decreasing magnitude.
X = USV T . (2.6)
2.2. XML Clustering 34
The SVD process optimally approximates matrix X in a k−dimensional document space,
where n > k, by selecting k largest singular values and setting the rest of the values to
zero. Matrix Uk of size m × k and matrix Vk of size n × k is redefined along with k × k
singular value matrix Sk (equation 2.7).
Xm×n = UkSkVTk . (2.7)
The difference between the PCA and ICA methods is that the PCA method maximises the
variance and the projections onto the basis vectors are mixtures, whereas ICA correctly
finds the two vectors onto which the projections are independent. ICA is like an extension
of the PCA method. In ICA, the independent components can be derived from:
Sk×n = Wk×m · Xm×n (2.8)
where W is the inverse of matrix Uk which is known as unmixing matrix. The independent
components Sk×n is used to represent the new document collection matrix. Using the
reduced document collection matrix, a clustering algorithm such as K -means is used to
cluster the document collection.
2.2.2.2 Semantic Kernel
Term weightings are not sufficient enough in learning the latent relationships between
terms that may be important in document similarity. Some approaches [37, 44, 50, 45]
2.2. XML Clustering 35
have used a semantic dictionary such as WordNet [22] to measure the synonym sense of
keywords. However, using WordNet to find the synonym of words is expensive in terms of
processing time. A few works [80, 71] have utilized the idea of kernels for XML clustering.
The work of Yang et al. [80] build kernels for learning the terms of the documents in their
true groups. The kernel is an m × m kernel matrix which captures both the similarity
between a pair of XML elements as well as the contribution of the pair to the overall
document similarity. An entry in the kernel being small means that the two XML elements
should be semantically unrelated and the same words appearing in the two elements should
not contribute to the overall similarity and vice versa. This kernel is then used later for the
clustering of the documents. The kernel is a supervised learning approach. The proposed
method in this thesis [71] presented in Chapter 3, on the other hand, builds a semantic
kernel based on latent semantic indexing [15]. For example, given two vectors, dx and
dy, the closeness of the semantic similarity of the two is measured as the cosine similarity
using the Uk generated from the LSI method defined in equation 2.7:
Sim(dx, dy) =dTxUkU
Tk dy
|UTk dx||UT
k dy|(2.9)
A semantic kernel can be applied to unknown data and is much more flexible than using
the WordNet dictionary.
2.2.3 Content and Structure-based Clustering
There are two ways to combine the content and structure of XML documents for XML
clustering - linear and non-linear. Non-linear approaches combine the content and the
2.2. XML Clustering 36
structure of XML data in one model. On the other hand, linear approaches calculate the
similarity values of the content and structure separately. These values are then combined
with weightings to calculate the document similarity.
2.2.3.1 Non-Linear Approaches
A model for representing the content and structure of XML data together is the Structural
Link Vector Model (SLVM) [81]. SLVM represents both structure and content information
of XML documents using vector linking so that the structure and the content features are
not in one vector space model. The SLVM model of an XML document dx is a document
feature matrix ∆x ∈ Rn×m, given as
∆x = [∆x(1),∆x(2), ...,∆x(m)]
where m is the number of distinct XML elements, ∆x(i) ∈ Rn is the TF-IDF feature vector
representing the ith XML element, given as ∆x(i) = TF (tj , dx, ei) ∗ IDF (tj) for all j = 1
to n, where TF (tj , dx, ei) is the frequency of the term tj in the element ei of dx. The basic
document similarity between the two models can be measured using the cosine measure.
The SLVM is employed by Yang et al. [80] to represent documents as vectors of terms,
structures, and neighbouring documents. Yang et al. [80] use a kernel matrix to calculate
the document similarity:
Sim(dx, dy) =∑n
i=1 dTx(i) •Me • dy(i)
2.2. XML Clustering 37
where Me is a m×m kernel matrix which captures the semantic similarity between pairs
of elements.
Another method is given by Yoo et al. [86] which models the relationships of documents,
paths and terms in a 3-dimensional matrix called BitCube . An XML document dx is
defined in BitCube model by BC(dx) = [(dx, p1, v1), (dx, p2, v2), ..., (dx, pn, vm))], where pi
is a path in dx, vj is a word or content of pi, and (dx, pi, vj)=1 or 0, 1 show that the word
vj in the path pi is in dx otherwise 0 is given. The approach simply uses popularity of
common features to cluster the documents in order to optimize query operations. The
similarity between a document and a query is defined through the Hamming Distance:
Sim(dx, dy) = |xOR(BC(dx), BC(dy))| (2.10)
where xOR is a bit-wise exclusive OR operator applied on the representations of the two
documents in the BitCube.
The methods discussed above are complex and might not be scalable enough to handle
large amounts of XML documents. Therefore, other approaches [21, 84, 85, 32] most of
which are from the INitiative for the Evaluation of XML Retrieval (INEX)4 have utilized
the simple VSM model approach. The INEX contains an XML Mining Track which allows
participants to compare their methods and results on the classification and clustering of
XML documents. In the first attempt in the INEX by Doucet et al. [21] the content is
treated as a set of words and the structure is set of nodes labels. These two features,
the content and the structure, are combined in one vector space model. This approach is
4http://inex.is.informatik.uni-duisburg.de/
2.2. XML Clustering 38
known as naive approach as it is a simple way of combining the structure and content. The
relationships between the features are lost in the representation. Other recent attempts
are by Yao et al. [84] and Yongming [85]. Instead of representing the features of the content
and the structure separately, they incorporate the content and structure into a collection
of term-paths. Each term-path contains the labels from an element node to a term in
a text node in which the element node is the ancestor of the text node. HCX [32], on
the other hand, takes a different approach. It first extracts frequent subtrees from XML
documents in a collection. These frequent subtrees are then used to extract the content
and only the content that appears in the frequent subtrees is extracted from the XML
documents. The content is then represented in the VSM model for the clustering of the
XML documents. This approach produces a better clustering solution than Yao et al. [84].
However, the drawback of the HCX method is that it may miss the true classes that have
only one or two documents which are unique in structure.
2.2.3.2 Linear Approaches
Using the non-linear computation might degrade the performance of the clustering de-
pending on the nature of the XML documents, for example, the Wikipedia collections
are more distinguishable in their data content than their structure. Therefore, by mea-
suring the structural and the content similarity of the XML documents using one data
model might result in degrading the accuracy of the clustering process. Thus, linear
approaches [85, 73, 43] have also been proposed to linearly combine different similarity
measures to find similarity between the XML documents. Yongming et al. [85] use differ-
ent vector space models to represent the content and structure separately. The structure
2.2. XML Clustering 39
feature is a collection of complete paths. The distance measure is the product between two
vectors. The document similarity is the integration of the structural similarity value and
the content similarity value with weights. A clustering method proposed in this thesis,
takes a similar approach, however, instead of using vectors for the structure, the similarity
between leaf nodes is calculated by finding the common ancestors of the leaf nodes [73].
Later, Nagwani et al. [43] employs the path similarity measure for computing the paths as
seen in the work of Tran et al. [73]. It also adds one attribute to the document similarity
and that is the similarity between style-sheets.
2.2.4 Discussion
Figure 2.11 outlines the different XML clustering approaches which have been discussed
so far. The last level in the diagram shows the different types of similarity methods.
The second last level shows the data models which are used to represent the content and
structure of the XML data.
Tables 2.1 and 2.2 compare the different structure-only clustering approaches, and the
content and structure clustering approaches, respectively. From the literature review, the
following limitations can be ascertained from the existing XML clustering approaches.
The first limitation is that not many clustering approaches have efficiently combined the
structure and content measures for the clustering of XML documents. Depending on the
type of XML document collections, combining both the structure and content measures
can, sometimes, degrade the quality of the clustering solutions. The second limitation is
the two clustering algorithms, incremental and pair-wise clustering, both have drawbacks.
2.3. XML Transformation 40
XML
Clustering
Content Structure Structure and Content
Tree Graph Path VSM VSM SLVM BitCube
Tree Edit
Distance
Frequent
Sub-tree
Mining
Level
Similarity
Graph
Distance
Sequen!ally
Pa"ern
Mining
Schema
Matching
Vector-based
Distance e.g.
cosine and
Euclidean
Feature
Reduc!on
Seman!c
Kernel
Hamming
Distance
Figure 2.11: The classification of the XML clustering approaches for XML data.
The pair-wise clustering is more expensive in terms of memory and computational time
when dealing with large collections. In contrast, the incremental clustering can deal with
large document collections but it suffers with the problem of poor accuracy due to its
dependence on the input ordering. In this research, a number of clustering methods have
been proposed that utilize both the content and structure. The thesis also proposes a
clustering algorithm that balances the scalability problem in pair-wise clustering and the
accuracy of incremental clustering.
2.3 XML Transformation
Figure 2.12 illustrates the basics of an XML transformation process using the eXtensible
Stylesheet Language Transformation (XSLT)5 language. Before executing the transforma-
tion process, corresponding nodes between a source schema definition and a target schema
5www.w3.org/TR/xslt
2.3. XML Transformation 41
Table 2.1: An overview of the structure-only clustering approachesMethod Name Data Model Similarity Clustering Method
Measure
Nierman et al. [52] tree tree edit distance hierarchical
Dalamagas et al. [16] summary tree tree edit distance hierarchical
Kutty et al. [32] tree frequent mining k-means
Lin et al. [39] tree frequent tree hierarchical
XCLS [47] tree level similarity partitioning
XCLS+ [4] tree level similarity partitioning
XEdge [6] tree level similarity k-means clustering
Hwang and Ryu [27] paths sequential Items clusteringpattern mining
XProj [3] paths sequential k-meanspattern mining
XClust [37] paths schema matching hierarchical(path similarity)
XMine [48] paths sequential hierarchicalpattern mining
PCXSS [45] paths path similarity partitioning
Yuan et al. [87] bipartite graph Jaccard measure hierarchical
Table 2.2: An overview of the content and structure-based clustering approachesMethod Name Data Model Approach Similarity Cluster
Type Measure Algorithm
Doucet et al. [21] term-paths non-linear Euclidean k-meansVSM model
Vries et al. [75] terms and links non-linear Euclidean k-tree [23]vector-based
Nagwani et al. [43] terms linear Euclidean k-meanscomplete paths on similarity
matrix
Kutty et al. [32] terms non-linear Cosine partitioningVSM
Yao et al. [84] term-paths non-linear Cosine partitioningVSM
Yang et al. [80] semantic kernel non-linear Euclidean hierarchicalSLVM
Yoo et al. [86] BitCube non-linear Hamming partitioning
definition are determined through a schema-matching process. Schema matching is a pro-
cess of determining a set of correspondences that identify similar elements in two different
schemas. The result of the schema-matching process is the element mappings between the
2.3. XML Transformation 42
Source
Schema
Target
Schema
Schema
Matching Element
Mapping
Results
Transformation
Operation Transformation
Rules Transformation
Processor
Source XML
Documents
Target
XML
Documents
XSLT
Script
Generator
XSLT
Scripts
Figure 2.12: The transformation process for XML data.
target and source documents. The transformation operation is the process of assigning
different transformation operators to the different mapping relationships. The result is
used by the XSLT generator in a process, to create XSLT transformation script(s). A
transformation processor, in this case the XSLT processor, is used to convert the content
of the XML documents which are conformed to the source definition format and to the
target definition format using the XSLT script(s) generated by the XSLT generator.
The following sections discuss the related work in the area of schema matching and XML
transformation.
2.3.1 Schema Matching Approaches
Schema matching is the process of finding corresponding elements between two XML data
objects. It is a crucial step in XML transformation as well as in many other applications
such as schema integration, data integration, electronic commerce, data warehousing, and
semantic query processing and optimization.
The input to the schema matching system is two XML data objects. XML data, in here,
2.3. XML Transformation 43
refers to both XML document and document schema. A schema-matching approach will
usually first model the XML data to a representation such as a tree structure that can
capture the semantics and structure of the XML data. Then elements of the objects data
are compared using different matchers. According to Smiljanic et al. [63], element match-
ers can be divided into two groups depending on the type of information that is used by
different schema matching systems to compute element similarity: localized matchers and
structure matchers. Localized matchers compute element similarity by considering prop-
erties such as element names, element types or instance values [18]. Structure matchers
on the other hand compute element similarity by considering the structural properties of
elements such as the relationships between elements in the hierarchical level [41]. There
are a number of challenges associated with schema matching:
Schemas developed for different applications are heterogeneous in nature i.e. al-
though the data they describe are semantically similar, the structure and the em-
ployed syntax might differ significantly.
To resolve schematic and semantic conflicts, schema matching often relies on ele-
ment names, element datayptes, structure definition, integrity constraints, and data
values.
A number of surveys have been conducted about this matching problem by researchers
such as Algergawy et al. [5], Shvaiko and Euzenat [62], and Dorneles et al. [20]. The
rest of this section discusses some of the most popular schema-matching systems, and
the schema-matching approaches for XML clustering process and for XML transformation
application.
2.3. XML Transformation 44
2.3.1.1 Schema-Matching Systems
There are a number of schema-matching systems and approaches available, some of the
most popular ones are discussed below.
COMA. COMA [18] is a composite approach that uses different matchers, simple and
hybrid, to find the correspondence elements. These results are then combined to arrive to
the final results in determining the degree of similarity between the elements. The COMA
results show that the use of different matchers produces a more accurate result than using
only a single matcher. Schemas are transformed to rooted directed acyclic graphs on which
all match algorithms operate. Furthermore, each schema element is uniquely identified by
its complete path from the root of the schema graph to the corresponding node.
Cupid. This system [41] uses multiple criteria to perform element matching. In partic-
ular, it combines an element name matcher with a structural element matcher to derive
the elements similarity coefficient based on the match criteria of their components thereby
emphasizing the linguistic, structural and context-depended similarity. Cupid is biased to-
wards the similarity of the leaf elements based on the assumption that much of the schema
semantic is captured in the leaf elements rather than in the internal structure. Hence, this
technique will fail to distinguish the varying element context that are commonly defined
in XSD schema. For example, Cupid will fail to distinguish between the varying element
contexts such as book/name and book/author/name.
Similarity Flooding. The Similarity Flooding (SF) [42] takes a different approach. It
uses different graphs to represent the schemas in order to extract auxiliary data from
one graph into another. It finds the correspondence elements using these graphes. The
2.3. XML Transformation 45
SF approach computes the similarity between two nodes based on the assumption that
two elements are similar if their adjacent elements are similar. This technique is not
appropriate for comparing schema from heterogeneous domains as the structure of the
graphs will be totally different.
S-Match. The S-Match [24] is a schema-based matching system which takes two graph-
like structures (e.g. XML schemas or ontologies) and returns semantic relationships be-
tween the nodes of the graphs that correspond semantically to each other. The relation-
ships are determined by analysing the meaning (concepts, not labels) of the elements and
the structures of schemas/ontologies. In particular, labels at nodes, written in natural
language, are translated into propositional formulas which explicitly identify the label’s
intended meaning. This allows for a translation of the matching problem into a proposi-
tional unsatisfiability problem, which can then be efficiently resolved using state of the art
propositional satisfiability deciders. S-Match was designed and developed as a platform for
semantic matching, namely a highly modular system with the core of computing semantic
relations where single components can be plugged, unplugged or suitably customized.
Doan et al. [19]. This matching system is similar to COMA [18] in that the proposed
method uses a composite approach to combine different matchers. It uses machine learning
for element mappings. In addition, it extends matching learning techniques by introducing
a novel learner that exploits the hierarchical structure of the XML data to improve the
matching results. A drawback of this technique is that it heavily depends on the user
at the training stage because initially the users have to provide some semantic mappings
between the input schemas and mediated schemas then these mappings are further refined
during the training stage.
2.3. XML Transformation 46
Xu et al. [79]. This approach is to find direct as well as indirect matchings between
source and target schema. It is based on the assumption that both source and target
schemas can be described using rooted conceptual-model graphs and each element node
is associated with a data value or object identifiers. This technique can identify indirect
matching between elements from two schemas using the structure matching and data value
characteristics techniques.
2.3.1.2 Schema Matching for XML Clustering
A few research studies [37, 50, 45, 48] have discussed the schema matching concept in the
clustering of XML schema definitions. XClust [37] introduces a complex computational
technique to map the element similarity between the schema of XML data by considering
the semantics, immediate descendant and leaf-context information. The main focus of this
approach is to cluster the DTD schema into similar groups in order to facilitate the schema
integration process. Unlike XClust, Nayak et al. [50, 45] propose a method to find element
mappings between XSD. These methods introduce a rigid function called NCN (number
of common nodes) to measure the similarity between leave nodes using node paths. The
drawback of this function is it does not compute if the leaf element of one path does not
match with the leaf element of another path. Similar to Cupid [41], this approach fails
to distinguish between the varying element contexts that are commonly defined in XSD
schema definitions. XMine [48] on the other hand, computes a complex schema matching
for DTD. It measures the structural similarity between DTDs by finding the maximal
similar paths between schemas using sequential frequent mining. This approach gener-
ates a similarity matrix between XML trees. The XMine approach uses the hierarchical
2.3. XML Transformation 47
clustering algorithm [30] to perform clustering based on the similarity matrix.
2.3.1.3 Schema Matching for Transformation Approaches
Approaches such as Su et al. [66], Boukottaya et al. [8] and Lee et al. [36] are designed
specifically for the transformation of XML documents. Su et al. [66] proposed a schema
matching approach to deal with a XML Schema language. It represents the schemas as a
schema graph where it contains the schema properties such as nodes and edges representing
different relationships between elements within the schema (i.e. containment, of-property
and association relationships) and constraints (i.e. ordered composition, exclusive dis-
junction and referential constraint). For the matching, it considers matchings such as
linguistic, data type and type hierarchy. Besides the semantic matching, it also considers
structural matching. It is based on relaxation matching that allows matching paths even
when nodes are not embedded in the same manner or in the same order. They allow two
elements within each path to be matched, even if they are not identical but their linguistic
similarity exceeds a fixed threshold. Instead of generating similarity scores between source
and target schemas, this approach uses the schema graph to discover matching nodes and
edges, and the necessary transformation operators for the transformation of source and
target schemas. Similarly, Boukottaya et al. [8] do not produce similarity scores between
source and target schema nodes. The authors suggest using conceptual modelling to model
the XML schemas. They use two views to represent the XML-schemas: semantic view
and logical view. The matching process is executed on these views. On the other hand,
Lee et al. [36] introduces schema matching based on domain ontology update. The on-
tology used in this approach is dynamically updated by user feedbacks from the previous
2.3. XML Transformation 48
matching results. The proposed ontology is represented by a set of trees in which nodes
and edges correspond to concept and relationships respectively. There are two steps in the
schema matching. First it creates preliminary matchings between leaf nodes based on do-
main ontology, lexical similarity and data type similarity. This step creates many-to-many
matchings. Therefore the second step is to extract final matching (one-to-one matching)
using the path similarities.
2.3.2 Transformation Approaches
The schema matching process is one of the stages in the XML transformation. It is used to
find corresponding nodes between two XML data. After corresponding nodes are found,
these corresponding nodes are then used to generate a transformation script. One of the
widely used transformation languages for XML data is eXtensible Stylesheet Language
Transformation (XSLT) [1]. An XSLT script is composed of one or more transformation
rules called templates that recursively operate on a single input document. An XSLT
program called stylesheet is composed of one or more transformation rules called templates
that recursively operate on a single input document. Transformation rules in XSLT are
guarded by XPath expressions. XPath6 uses path expressions to select nodes or sets of
nodes in an XML document. It is operated on XML documents using a tree-based model.
Thus, XPath uses path expressions to navigate through elements and attributes in an
XML document.
6http://www.w3.org/TR/xpath20/
2.3. XML Transformation 49
2.3.2.1 XSLT for XML Transformation
The majority of existing research [31, 55, 54, 65, 67] translates each matching relationship
between nodes into an XSLT template, resulting in a script with many templates. Shin et
al. [61] state that XSLT scripts with many templates slow down the transformation process
when the XSLT script has to apply repeatedly to a large volume of XML documents.
Therefore, the Shin et al. [61] approach generates an XSLT script where the number of
templates is proportional to the number of matches between recursive nodes, regardless
of the number of matching relationships between internal nodes. A recursive node is the
node that indicates the reference to its corresponding ancestor. This approach focuses
on cardinality operators (defining how many instances of an element type are permitted
in a document) since cardinality operators make XML documents from the same schema
to having different structures/representations. The disadvantage of this approach is the
XSLT script generated cannot be re-used to transform XML documents of other XML
schemas with similar structures.
On the other hand, Wustner et al. [78] suggest that by processing XSLT on the content
of the XML documents instead of using XML structure may improve the accuracy of
the XML transformation. Using the content, some structural problems that cannot be
solved by simply transforming DTDs or XSD would easily be resolved. Approaches such
as [77, 53] propose new methods to generate the XSLT script automatically. Given a source
XML document and a desired output XML/HTML document, an XSLT stylesheet is
automatically generated to transform the source into the output. The generated stylesheet
contains rules needed to transform the source document into the output document and
2.3. XML Transformation 50
can also be applied to other source documents having the same structure.
2.3.2.2 Other Manipulation Languages for XML transformation
Since XSLT language has a number of disadvantages, new manipulation languages for
XML transformation have been proposed in recent years. Approaches such as Streaming
Transformations for XML (STX) 7 are proposed to generate a template-based XML trans-
formation language that operates on the streams of SAX (Simple API for XML) events.
Unlike XSLT where the input documents and the result tree need to be built in-memory,
STX adopts some of the concepts from XSLT but using SAX as the underlying interface
to the XML documents where it does not need to be stored in memory. This approach
can be used to process large XML files more efficiently.
Approaches such as MTRANS [57] state that writing a long XSLT program is painful when
needing to have a good understanding of the XML specification. Thus, the MTRANS
language is developed in the abstraction level above XSLT where XML documents are
modelled as a class diagram. It uses UML (Unified Modeling Language) to transform a
class diagram (source document) into another (target document). Some approaches [54, 55]
also go in the similar direction, where a high-level language is developed and used to
specify XML data for transformations. They are based on the tree-based model and
use XPath expressions. Authors in [54] use an unranked tree transducer approach for
XML transformation. XDTrans [55] specifies transformations by means of rules which
involve XPath expressions, node variables and non-terminal symbols denoting fragments
of a constructed result. These two approaches are developed at the abstraction level in
7http://www.pair.com/lisovsky/transform/stx/
2.3. XML Transformation 51
which the transformation rules generated by them can be easily transformed to XSLT for
XML transformation.
Along with many XML specifications, XQuery8 is also introduced by W3C for querying
XML documents. Both XSLT and XQuery use XPath expressions to navigate the XML
documents. Even though, the main purpose of XQuery is to query XML documents,
however, it has the functionality to manipulate and transform XML data [9, 55] into
another required XML format. In the case of using XQuery for XML transformation,
its mapping specifications are translated into appropriate XQuery queries over the input
document. The result of the query is the expected output document, and the result
must satisfy the output schema. Approaches such as Bruno et al. [9] extend XQuery with
transformation operators. The main purpose of the work of Bruno et al. [9] is to extend the
XQuery language for the transformation of XML data. It has shown that using XQuery
language for a transformation language can be more manageable and easier than using
XSLT, as XSLT complexity lies in the generation of template rules.
2.3.3 Discussion
The existing XML transformations which have been mentioned in this section have the
following limitations: (1) Not many XML transformation approaches have addressed the
problem of transformation using XML documents that are based on the XML schemas;
and (2) To the best of our knowledge none of the existing XML transformation approaches
attempts to convert more than one XML schema definitions of similar structure to the
same target document together at the same time. Having said that, there exist some
8www.w3.org/TR/xquery/
2.4. Summary 52
work [51, 59] in the area of schema integration which can be used to resolve structural
conflicts such as nesting discrepancies and backward path representations to integrate
XML sources into a mediated schema. However, these work do not go further and apply
the mediated schema in the transformation application. This thesis aims to address the
above limitations. To simultaneously translate a large number of source documents into
the same target document, this research proposes an XML transformation approach that
utilizes a structure-only clustering method as a pre-processing stage to group the source
documents into clusters having similar structures. The XSLT language is used instead
of other existing manipulation languages because it is the standard and most commonly
used language for XML transformation.
2.4 Summary
This chapter has reviewed the literature on XML clustering and XML transformation.
A number of different clustering approaches based on different data models, similarity
measures and clustering algorithms have been analysed. Furthermore, the chapter outlines
the drawbacks of the existing approaches and what gap in the literature this research tries
to fill.
The limitations of the current approaches which have been discussed in this chapter have
lead the research in this thesis. The next chapter of this thesis will describe the clustering
methods proposed in this research.
Chapter 3The Proposed Clustering Methods
This chapter describes the clustering methods which have been proposed in this research
to investigate the first key hypothesis of this thesis. That is the clustering methods
utilizing both the content and structure of XML documents produce a better
clustering solution in terms of accuracy than the clustering methods solely
utilizing the content-only and structure-only information of XML documents.
The proposed clustering methods are divided into two types. The first type of clustering
is the structure-only type. This type of clustering utilizes only the structure of the XML
documents. The second type is the content and structure-based clustering. It utilizes both
the content and structure of the XML documents.
This chapter begins with an overview of the proposed clustering methods. The methods
are then described in detail according to their data modelling and data similarity tasks.
A hybrid clustering algorithm, utilizing the clustering methods for the partitioning of the
XML documents, is introduced later in the chapter.
53
3.1. The Proposed Clustering Methods: Overview 54
3.1 The Proposed Clustering Methods: Overview
Figure 3.1 is an overview of all the proposed clustering methods. The input for the
clustering methods is a collection of XML documents. The clustering methods proposed
in this thesis are classified into two types: structure-based clustering, and content and
structure-based clustering.
There are two structure-only clustering methods. The first method is XML clustering
based on a Tree model (XCTree). This method utilizes a tree model and a tree similarity
measure (TSim) to compute the degree of similarity between XML documents. The second
method is XML clustering based on a path model (XCPath). A path model and a path
similarity measure (PSim) are defined and used for the grouping of XML documents by
the XCPath method.
In addition, there are two content and structure-only clustering methods. The first method
is XML clustering based on the linear combination of the structural and content similarity
measures (XCLComb). This method uses a linear combination measure (LCSim) to com-
bine the similarity values from a structure measure and a content measure for the overall
document similarity. The second method is XML clustering based on a text-path model
(XCTPath). It is a non-linear method which uses text-paths for representation of both
the structure and content of the XML documents. A text-path vector similarity measure
(TPVSim) is defined to compute the similarity between two sets of text-paths.
The proposed clustering methods use the same clustering algorithm called hybrid clus-
tering to group the XML documents into k number of clusters. The rest of this chapter
3.2. The Structure-Only Clustering Methods 55
explains each of the proposed clustering methods in more detail.
XML
Documents
CPSim
Hybrid Clustering
Path Data Modelling
Data Similarity
C1 C2 Ck
XCTree XCPath
Tree
TSim
XCLComb
LCSim
Tree Text Vector
TPVSim
Text-Path Vector
XCTPath
Structure-Only Clustering Method Content and Structure-based Clustering Method
Figure 3.1: An overview of the proposed clustering methods.
3.2 The Structure-Only Clustering Methods
The structure in XML documents is used to annotate content of the documents which
makes the XML documents different from normal text documents. The structure written
in XML is very flexible as it can be defined by the user. Thus, the same information may
not annotate in the same structure. There are many applications in which the structure-
only clustering methods can be utilized in such as schema integration, data warehouse and
3.2. The Structure-Only Clustering Methods 56
message exchange.
With many applications utilizing the structure for the clustering of XML documents,
this thesis proposes two structure-only clustering methods: the XCTree and the XCPath.
They differ according to the underlying data model and similarity measure. The XCTree
represents the structure of XML documents using a tree model, whereas the XCPath
represents the structure using a path model. In this research, the path and tree models
are used for the representation of the structure of XML documents because they are the
most commonly used models and are less complex than the graph model. The rest of
this section describes the two structure-only clustering methods in terms of their data
modelling and data similarity.
3.2.1 The XCTree Method
The XCTree method is one of the two structure-only clustering methods that is proposed
in this research. The method groups the XML documents using a tree model to capture
the structure embedded in the XML documents. A new tree similarity measure is then
defined in order to compute the degree of similarity between XML documents.
3.2.1.1 The Tree Model
To capture the structure of the XML documents, the XCTree method uses a tree model
called the summary tree structure. The summary tree structure encodes in the depth-first
string tree encoding format [13]. It is based on a rooted label tree structure. A rooted
label tree T is defined as T=V, E, L, r, where V is a set of nodes that exist in T, E is
3.2. The Structure-Only Clustering Methods 57
a set of edges, L is a set of node labels using certain letters of the alphabet, and r is the
root node. If nodes (ni, nj) ∈ E and ni = nj then (ni, nj) is an edge in which the node ni
is the parent of the node nj . A rooted label tree has the following properties: (1) There
is exactly one r where r ∈ V and r has no parent; (2) Every node, except r, has exactly
one parent; and (3) A node in V is reachable via edges from r.
An example of the rooted label tree structure is shown in figure 3.2 (a). The nodes that
do not have any children nodes or have a text node are called leaf nodes. The nodes that
contain other nodes are referred to as the internal nodes. This thesis will focus on the
labels of the element nodes and the element attributes as they are the most important
components in the structure of XML documents. Attributes are modelled and treated the
same way as the element leaf nodes.
company
address cname personnel
name
person
address name
person
address
(a)
company address -1 cname -1 personnel person name -1 address -1 -1 -1
(b)
Figure 3.2: An example of a tree structure (a) and its corresponding summary tree struc-ture in depth-first string tree encoding format (b).
Definition 1. A summary tree structure is a tree structure that records only the unique
3.2. The Structure-Only Clustering Methods 58
nodes. Two nodes that have the same label and the same type (a leaf or an internal)
will be replaced by a single occurrence in the summary tree structure. For a summary
tree structure T with only the root node r, the depth-first string of T is S(T ) = lr − 1,
where l is the label of r. Every node has a “-1” to represent backtracking. For each T
with many nodes, let the children nodes of r be r1, r2, ..., rk, the depth-first string of T is
S(T ) = lrS(r1)S(r2)...S(rk)− 1.
Figure 3.2 shows an example of a rooted label tree structure in (a) and its corresponding
summary tree structure in (b). Notice that the node person and its children only appear
once in the summary tree structure. The summary tree structure does not keep the
occurrence information (cardinality) of the nodes. Utilizing the occurrence information
of the nodes in the XCTree method might cause two similar documents to have a low
similarity value [16]. For instance, in two documents having exactly the same structure,
one document has a node repeated many times and the other document has the same node
but it occurs only once. In this scenario, by taking the occurrence information of the nodes
into consideration, the structure similarity between these two documents will be lower than
using the summary tree structure without considering the occurrence information of the
nodes.
3.2.1.2 The Tree Similarity Measure: TSim
Based on the summary tree structure, a new measure called the Tree Similarity (TSim) is
proposed to calculate the similarity between two summary tree structures which is defined
as follows:
3.2. The Structure-Only Clustering Methods 59
TSim(tx, ty) = max(SimTreeMatching(tx), SimTreeMatching(ty)) (3.1)
SimTreeMatching(tx) =
∑|tx|i=1 nodeSim[i]
|tx|(3.2)
The TSim measure is the best similarity value of the two SimTreeMatching values between
trees tx and ty. The SimTreeMatching(tx) is calculated by computing a treeMatching
algorithm from source tree tx to target tree ty. The output of the algorithm is an array
called the nodeSim which contains the similarity values of the nodes in tx that match
with the nodes in ty. The SimTreeMatching(tx) is the sum of the similarity values in
the nodeSim divided by the number of nodes in tx. The SimTreeMatching(ty), on the
other hand, is calculated by computing the treeMatching algorithm from source tree ty
to target tree tx in the same way.
The detail algorithm of the treeMatching is shown in Algorithm 1. The algorithm is not
transitive as the tree matching from source tree tx to target tree ty is different from source
tree ty to target tree tx. The similarity value of SimMatchingMatching(tx) therefore can
be different from the similarity value of SimMatchingMatching(ty). The treeMatching
algorithm starts the tree matching at the first node i in a source tree to the first node j in
a target tree and works its way down the source tree. If the label labeli of i is not equal
to the label labelj of j, the algorithm moves to the next node j + + in the target tree
structure and starts the node matching with i.
When labeli equals labelj , the similarity value similarity i of i with j is calculated by taking
3.2. The Structure-Only Clustering Methods 60
Algorithm 1 treeMatching
Input: Source tree tx, target tree ty, node similarity array nodeSim;
Output: nodeSim;
1. while node i ∈ tree tx /*starting with the first node in tx*/
2. double similarity i=0;
3. while node j ∈ tree ty /*starting with the first node in ty*/
4. if labeli and labelj the same
5. similarity i = lower(leveli,levelj)/higher(leveli,levelj);
6. treeMatching(subTreei, subTreej , nodeSim);
7. for each node s ∈ subTreei
8. add the similarity value of s ∈ nodeSim to similarity i;
9. reset the similarity value of s ∈ nodeSim to zero;
10. end for
11. if similarity i is larger than the similarity value of i ∈ nodeSim
12. set similarity value of i ∈ nodeSim to similarity i;
13. end if
14. process the next sibling of j ∈ ty;
15. else process the next node j++ ∈ ty;
16. end if
17. end while
18. if i finds a match with any node ∈ ty
19. process the next sibling of i ∈ tx;
20. else process the next node i++ ∈ tx;
21. end if
22. end while
23. return nodeSim;
3.2. The Structure-Only Clustering Methods 61
into account the node levels in the tree structure (Line 5 in Algorithm 1). If the matching
nodes are at the same level, a maximum similarity value of 1 is assigned. Otherwise a
penalty value is assigned according to the difference in level. The root node is in the first
level of a tree structure and its immediate children are in the second level and so on. The
penalty value is calculated by considering the lower level of the two node levels divided
by the higher level of the two nodes. For instance, given two nodes with the same label,
one node is at level 3 and the other is at level 2. The node similarity value of the two
nodes is 0.66 (2/3). The node similarity value of two matching nodes is stored in the
array nodeSim. The nodeSim contains the similarity values of the nodes in the source tree
which find a match with the nodes in the target tree. Therefore, the length of nodeSim
is equal to the number of nodes in the source tree.
Each time labeli is equal to labelj , the treeMatching algorithm starts again for the children
(referred to as subTree in Algorithm 1) of i and j. If either i or j does not have any children,
the treeMatching algorithm starts the node matching of i to the next sibling of j. If j
does not have any sibling, the treeMatching algorithm starts the node matching of i
to the next sibling of j’s ancestor. When the treeMatching algorithm finishes the node
matching for the children of i and j, the similarity values of the children of i are stored
in the nodeSim. The sum of the similarity values of the children of i is added to the
similarity i values of i. The similarity values of the children of i in the nodeSim are reset
to zero. If the similarityi is larger than the similarity value of i in the nodeSim, the
similarity value of i in the nodeSim is set to similarityi.
After the treeMatching algorithm finishes the node matching for i to the nodes in the
target tree and i finds a match with any of the nodes in the target tree, the treeMatching
3.2. The Structure-Only Clustering Methods 62
algorithm starts the node matching for the next sibling of i if there is any, otherwise it
moves to the next sibling of i’s ancestor. However, after finishing the node matching for i
and no match is found, the algorithm starts the node matching for the next node i++ in
the source tree structure. The treeMatching algorithm ends when the algorithm reaches
the end of the source tree structure.
The treeMatching can discover structural conflict such as nesting discrepancies. For
example, consider the following two paths, movie/title and actor/movie/title. These two
paths are similar but because of the nesting they may not be the same. The treeMatching
algorithm can resolve this type of conflict because the algorithm continues to execute the
next node if the first node in the hierarchical does not have a match. Also because the
treeMatching algorithm performs the tree matching from source tree tx to target tree ty
and from source tree ty to target tree tx, the algorithm can discover more structural con-
flicts. Although the treeMatching algorithm may not accurate discover structural conflict
such as backward path representations, for example paths title/movie and movie/title, the
algorithm still produces a similarity value which is greater than 0 between these two paths
such as 0.5.
Example 3. To understand the treeMatching algorithm further, consider the match-
ing between trees tx and ty as given in Figures 3.3 and 3.4. The arrows with numbers
in the figures show the sequence in which the treeMatching algorithm is progressed.
Using the example in Figure 3.3, at Step 1, the labels of the two root nodes are the
same. The treeMatching algorithm then processes the children of the two root nodes.
At Step 2, the node person in tx is compared with the node person in ty and their
labels are the same. However, it does not process further because the node person in
3.2. The Structure-Only Clustering Methods 63
ty does not have any children. At Step 3, the node person is compared to the next
node person in ty. As the two nodes match, the treeMatching algorithm processes
the children of the nodes. The nodes that do not have any arrows pointing in or out
in Figures 3.3 and 3.4 have not been processed by the treeMatching algorithm. The
output similarity value from the treeMatching(tx, ty) algorithm for the example in Fig-
ure 3.3 is three. Even though the treeMatching(tx, ty) has four matches, the node person
of the source tree tx matches twice to the nodes in the target tree ty therefore only
the best match similarity value is used. The best match value of a node is the highest
sum of the similarity values of the node’s descendants. In other words, the best match
of a node is when there are more matches in the node’s descendants. The same pro-
cess is repeated for finding treeMatching(ty, tx) which yields the value of 4. Finally,
SimTreeMatching(tx) = 3/5 and the SimTreeMatching(ty) = 4/5, the maximum value
of these two SimTreeMatching values is the TSim value for tx and ty.
personnel
person
name
firstName lastName
personnel
person
name address
person
1
2
4
3
tx ty
5
Figure 3.3: An example of the treeMatching algorithm from tx to ty.
3.2. The Structure-Only Clustering Methods 64
personnel
person
name
firstName lastName
personnel
person
name address
person
1
2
4
3
tx ty
5
Figure 3.4: An example of the treeMatching algorithm from ty to tx.
3.2.2 The XCPath Method
The second structure-only clustering method is the XCPath method. This method employs
a path model to capture the structure of the XML documents. A path similarity called
CPSim (Common Path Similarity) is defined to compute the degree of similarity between
XML documents using the path models.
3.2.2.1 The Path Model
The XCPath method represents the structure of an XML document using a set of complete
paths. A complete path contains the labels of the nodes from the root to the leaf node.
The complete paths can be extracted from the summary tree structure. Based on the
example in Figure 3.2, the summary tree structure can be broken down into the following
complete paths:
3.2. The Structure-Only Clustering Methods 65
company/address,
company/cname,
company/personnel/person/name,
company/personnel/person/address
3.2.2.2 The Path Similarity Measure: CPSim
The similarity between two documents dx and dy, represented by their sets of paths Px
and Py, is calculated using the CPSim measure which is defined as follows:
CPSim(Px, Py) =
∑|Px|i=1 max(Psim
|Py|j=1(pi, pj))
max(|Px| , |Py|)(3.3)
PSim(pi, pj) =max(CNC(pi, pj), CNC(pj , pi))
max(|pi| , |pi|)(3.4)
where |Px| and |Py| are the number of paths in Px and Py, respectively. The CPSim is
the sum of the best path similarity from the PSim measure for all paths in dx with the
paths in dy divided by the maximum number of paths of the two sets of Px and Py. Only
the path similarity value from PSim that exceeds a path threshold are considered in the
CPSim measure. The path threshold determines the lowest path similarity value that two
paths should have for them to be considered as a matching pair. The path threshold value
is a user-defined value and ranges from 0 to 1, where 1 is the highest indicates that the
structure of two paths is an exact match.
PSim of paths pi and pj is the best similarity value derived from two CNC (Common Node
3.2. The Structure-Only Clustering Methods 66
Coefficient) algorithms divided by the maximum number of nodes in the two paths. The
CNC (Common Node Coefficient) is the sum of the common nodes - that is the number
of nodes having the same label - by considering the hierarchical order of the nodes in
the paths. The way the CNC algorithm works is similar to the treeMatching algorithm.
The difference is that the CNC algorithm finds the common nodes between two paths
starting from the leaf node. However, the CNC algorithm is more time consuming than
the treeMatching algorithm since it operates between two paths in which the ancestors of
the leaf nodes need to be revisited a number of times. The aim of the CNC algorithm is
to find corresponding leaf nodes by considering the node labels as well as their ancestors.
This algorithm is appropriate for a schema matching system where all corresponding leaf
nodes between two data need to be identified.
The CNC algorithm is detailed in Algorithm 2. The CNC algorithm starts by two paths
matching at the leaf node. Each time a node in the source path finds a match with a
node in the target path, the two nodes parents are then processed, otherwise the CNC
algorithm will process the current unmatched node in the source path with the parent of
the unmatched node in the target path. The algorithm continues until all the nodes in
the source path find a match or when the target path reaches the root node. A match
in CNC is when the labels of two nodes are the same. Each time a match is found the
similarity value in the CNC algorithm between the source path and the target path is
incremented by 1. The CNC algorithm does not process the ancestors of a node in the
source path if the node cannot find a match with the nodes in the target path.
Also similar to the treeMatching algorithm, CNC algorithm is not transitive and it
can discover structural conflict such as nesting discrepancies. Like the treeMatching,
3.2. The Structure-Only Clustering Methods 67
Algorithm 2 CNC
Input: Paths pj and pj ;
Output: Int similarity;
1. int similarity = 0;
2. int z = 0;
3. for(int t = 0; t < |pi|; t++)
4. while z < |pj |5. if(nt == nz)
6. similarity+=1;
7. z++;
8. break from ’while’ loop;
9. else
10. z++;
11. end if
12. end while
13. end for
14. return similarity;
the CNC algorithm may not accurate discover structural conflict such as backward path
representations; however, it produces a similarity value which is greater than 0 because it
can still discover one element that is the same in the backward path representations.
Consider the examples in Figure 3.5. The CNC algorithm starts at the leaf node (the
node on the right hand side is the leaf node). In Figure 3.5(a) the leaf node name of the
source path py is compared to the leaf node lastName in the target path py. The arrows
in the figure number the sequence in which the matching process in the CNC algorithm
is executed. The output from the CNC algorithm for example (a) in Figure 3.5 is 4, and
the output for example (b) is 0.
3.3. The Content and Structure-based Clustering Methods 68
company personnel person name lastName
company personnel person name
px
py
1 2 3 4 5
(a)
company personnel person name lastName
company personnel person name
px
py
1 2
3 4
(b)
Figure 3.5: CNC matching
3.3 The Content and Structure-based Clustering Methods
The structure-only clustering methods described in the previous section utilize only the
structure of the XML documents. However, for the clustering of XML documents, the
content of the documents can also play an important role. For instance, documents having
the same structure might not contain the same content and vice versa. A good example
is that of two journal articles that have the same structure but different content; one is
about health science and the other is about data mining. This example shows that the
clustering based on the structure-only information might not produce a desirable content
and structure-based clustering solution.
Therefore, this section introduces two methods which utilize both structure and content
for the clustering of XML documents. The first method is the XCLComb which utilizes a
linear measure to combine the structure similarity values and content similarity value to
3.3. The Content and Structure-based Clustering Methods 69
compute the overall document similarity. The structure and content are represented using
different data models. The second method is the XCTComb which represents the content
and structure using the same data model. This is a non-linear method for the clustering
of the XML documents using both the structure and the content.
3.3.1 The XCLComb Method
Not many clustering methods can be applied efficiently to both the homogeneous as well
as heterogeneous XML document collections. A homogeneous collection generally does
not vary drastically in terms of structure, but mostly, it varies according to the content.
Different types of collections need different ways of measuring the document similarity.
For instance, documents from a homogeneous collection can be differentiated better in the
content than in the structure. On the other hand, a heterogeneous collection differs in
terms of the structure and the content. Based on the characteristics of homogeneous and
heterogeneous collections, it is not easy to propose an approach that works efficiently with
both types of collections.
In order to have a clustering method that can be applied on homogeneous as well as
heterogeneous collections, this thesis proposed the XCLComb method which uses different
data models and similarity measures for the content and structure. The similarity values
of the content and the structure are calculated separately then these values are combined
with different weightings to adjust the importance of the content similarity value and
the structure similarity value. For instance, homogeneous collections will obtain a higher
weight in the content similarity value than in the structure similarity value.
3.3. The Content and Structure-based Clustering Methods 70
3.3.1.1 The Tree Model and The Text Vector Model
The XCLComb method represents the structure of XML documents using the tree model
employed by the XCTree method, whereas the content of XML documents is based on a
text vector model.
The content of a document dj is represented using a text vector tvj = w1j , w2j , ..., wmj,
where m is the number of terms in the XML document collection in which document dj
is drawn and wi,j is the TF-IDF weighting of term ti in document dj defined as:
wi,j = TFi · log|D|
|ti ∈ |D||(3.5)
TFi is the term frequency of term ti in document dj divided by the total number of term
frequencies in document dj , log|D|
|ti∈|D|| is inverse document frequency (IDF), |D| is the
total number of documents in the collectionD, and |ti ∈ |D|| is the number of documents
in D containing the term ti.
3.3.1.2 The Linear Similarity Measure: LCSim
Given two documents dx and dy, the document similarity which is employed by the
XCLComb method is a linear combination of the structural and content Similarity (LC-
Sim) values which is defined as follows:
LCSim(dx, dy) = (TSim(tx, ty)× (1− λ) + (TV Sim(tvx, tvy)× λ) (3.6)
3.3. The Content and Structure-based Clustering Methods 71
TV Sim(tvx, tvy) =tvTxUkU
Tk tvy
|UTk tvx||UT
k tvy|(3.7)
where λ is a weighting value ranging from 0 to 1 defined by the user, tx and ty are the
summary tree structures of the documents dx and dy. The λ can be adjusted depending
on the importance of the content and structure in the input collection. The TVSim (Text
Vector-based Similarity) is a cosine measure using a kernel matrix Uk which is constructed
from LSI [15]. The construction of the Uk is discussed later in this section.
3.3.2 The XCTPath Method
Unlike the XCLComb method, the XCTPath method groups the XML documents using
one data model called the text-path model. Two reasons for using one data model to rep-
resent both the structure and the content are: (1) It does not need to adjust the weighting
value λ; and (2) It produces a more meaningful content and structure-based clustering
solution since the relationships between the structure and the content are maintained.
3.3.2.1 The Text-Path Vector Model
To represent the structure and the content of XML documents, the XCTPath method uses
text paths. The text paths contain the structure along with its content which is similar
to Yao et al. [83]. Given a collection of XML documents D = d1, d2, ..., dn, a set of text
paths TPV = tp1, tp2, ..., tpm are extracted from D after the stop-word removal and
stemming [58] are performed on the content.
3.3. The Content and Structure-based Clustering Methods 72
Definition 2. A text path is the partial path or complete path along with a term that
occurs under the leaf node and under the descendant leaf node of the path. A text path
always starts with the root node and ends with a term.
The text paths of document dj are represented using a vector tpvj = tpvj = w1j , w2j , ..., wmj,
where wi,j is the TF-IDF weighting of a text path tpi that appears in document dj , and
m is the number of text paths in the document collection D. Text paths that represent
the content and structure of an XML document can occur in many different path lengths.
The length of the text path can be adjusted by the user to include the desirable number of
ancestors for a term. For instance, when the length of a text path increases, the structure
plays an important role, and when the length of the text path decreases, the content plays
an important role. A document that contains a text path with the length of 1 means the
text path only contains the root node, and a term in the root node’s descendant leaf node.
The text path with the length of 2 contains the root node, the root node’s immediate child
node, and a term in the root node’s immediate child descendant leaf node, and so forth.
A document which has text paths with the maximum length of 3 means that the text paths
can be a collection of complete paths with the length lesser or equal to 3 along with their
content, and the partial paths contain only 3 ancestors in which terms are under starting
from the root node. This approach is similar to Yao et al. [84]. Take the example of XML
document in Figure 3.6. The XML document that has text paths with the maximum
length of 3 will contain the following text paths:
conf/id/IE06, conf/title/Conference, conf/title/Knowledge,
conf/title/Discovery, conf/title/Data, conf/title/mining, conf/title/KDD,
3.3. The Content and Structure-based Clustering Methods 73
<?xml version="1.0"?>
<!DOCTYPE conf SYSTEM "conf.dtd">
<conf id=”IE06”>
<title> The 16th ACM SIGKDD Conference on Knowledge Discovery and Data mining (KDD-2010)</title>
<year> 2010 </year>
<editor>
<person>
<name>”Peter Gavin”</name>
<email>[email protected]</email>
<phone>61-9828712</phone>
</person>
</editor>
<paper>
<title>”]Mining the structure for XML document clustering”</title>
<author>
<person>
<name>”Susan Smith”> </name>
<email>”[email protected]”</email>
</person>
</author>
<reference>
<paper>
<title>”A Survey of XML Similarity Measures” </title>
<author>
<person>
<name>”David MacDonald”> </name>
<email>”[email protected]”</email>
</person>
</author>
</paper>
</reference>
</paper>
</conf>
Figure 3.6: An example of a conference XML document
conf/year/2010, conf/editor/person/Peter, conf/editor/person/Gavin, etc.
3.3.2.2 The Non-Linear Measure: TPVSim
The XCTPath method uses a measure called Text Path Vector-based Similarity (TPVSim)
for the document similarity which is defined as follows:
3.3. The Content and Structure-based Clustering Methods 74
TPV Sim(tpvx, tpvy) =tpvTxUkU
Tk tpvy
|UTk tpvx||UT
k tpvy|(3.8)
where Uk is the kernel matrix constructed from LSI [15]. Different input XML document
collections will have a different Uk and the Uk in this method is different from the Uk in
the XCLComb method. The next section describes the construction of the Uk in more
detail.
3.3.3 The Kernel Construction Approach
To calculate the degree of similarity between text vectors or text-path vectors, a kernel
is used. A kernel is constructed using the Latent Semantic Indexing (LSI) [15]. LSI
can construct a semantic space wherein terms and documents that are closely associated
are placed beside one another, which reflects major associative patterns in the data and
ignores less important influence patterns.
The construction of the kernel is expensive in terms of memory usage (refer to section
2.2.2.1 of Chapter 2 that gives the background on semantic kernel construction) since
it needs to compute the Singular Value Decomposition(SVD). Therefore, this thesis in-
troduces a reduction method called XML Dimensional Document Reduction (XDDR) to
reduce the dimensional document of a feature-document matrix Xm×n to Xm×n′ , where
the feature is either term or text-path, m is the number of features, n is the number of doc-
uments in an input XML document collection, and n′ is the reduced number of documents
which is lesser than n. Each item in the matrix is the frequency of a feature occurring in a
document. This thesis tries to preserve the term dimensionality rather than the document
3.3. The Content and Structure-based Clustering Methods 75
dimensionality for the grouping of XML documents because the document dimensionality
might not be important in finding the associations between terms.
Algorithm 3 describes the algorithm of the XDDR method. Before the XDDR method is
executed, the input document collection D is first partitioned using one of the structure-
only clustering methods, the XCTree method or the XCPath method proposed in this
thesis. The clustering solution generated from the structure-only clustering method is
then processed by the XDDR method as follows. Let the structure-only clustering solution
be a collection of clusters SC = sc1, sc2, ..., sck, where (1) sci = d1, d2, ..., dn′′, (2)
sc1∩sc2∩ ...∩sck = d1, d2, ..., dn where n′′ < n is the number of documents of document
collection D, and (3) |sci| <= |sci+1|, i.e. clusters in SC are sorted in ascending order
according to the number of documents that they contain (Line 5 of Algorithm 3). Clusters
containing the smaller number of documents are processed before the larger sized clusters.
Let Ψ be the number of documents to be selected for the current cluster and η be the
number of documents to be selected for each cluster in SC. If the number of documents
in sci, denoted by |sci|, is equal or lesser than Ψ, then the documents belonging to sci
are added to a new document collection D′. If |sci| is lesser than Ψ then the remaining
numbers of Ψ, denoted by remNum, from sci is distributed evenly across the remaining
unprocessed clusters in SC. The Ψ and η are adjusted to consider the remaining numbers
remNum (line 14 to 25). For the clusters where |sci| > Ψ, document importance of each
document in cluster sci is calculated. The document importance (DI) of a document in a
cluster is measured as:
3.3. The Content and Structure-based Clustering Methods 76
Algorithm 3 The XDDR Algorithm
Input: structure-only clustering solution SC = sc1, sc2, ..., sck;User-defined numbers of dimensional document space r for matrix X;
Output: Document collection D′ = d1, d2, ..., dn′;
1. /*η is the number of selected documents for each cluster*/
2. int η = r/|SC|;3. /*Ψ is the number of selected documents for a current cluster*/
4. int Ψ = η;
5. sort the clusters in SC in ascending order according to the number
of documents in the clusters;
6. document collection D′ = empty;
7. for each cluster sci ∈ SC
8. if |sci| > Ψ
9. calculate the document importance DI for each document ∈ sci;
10. add Ψ-1 documents with the highest DI to D′;
11. merge the content of the left over documents into
a new document d′;
12. add d′ to D′;
13. Ψ = η;
14. else
15. add the documents ∈ sci to D′
16. /*the unselected numbers of Ψ from sci is to be distributed across
17. the remaining clusters that have not been processed yet*/
18. int remNum = Ψ− |sci|;19. if(remNum>0)
20. int distrNum= remNum/the number of clusters left in SC;
21. /*adjust η and Ψ*/
22. η = η+ distrNum;
23. Ψ = η+ (remNum -(distrNum × the number of clusters left in SC));
24. end if
25. end if
26. end for
27. /*document collection D′ = d1, d2, ..., dn′ where n′ <= r < n, where n
28. is the number of documents in the input document collection D */
29. return D′;
3.3. The Content and Structure-based Clustering Methods 77
DI(dj) =
∑m′
i=1 wi,j√∑m′
i=1 w2i,j
. (3.9)
wherem′ is the number of distinct terms extracted from dj and wi,j is the Tf×IDf weighting
of term ti in document dj . Refer to Equation 3.5 for more detail of the TF-IDF weighting.
The TF computed by the ratio of the number of times term ti appears in document dj
to the total number of term frequencies in document dj ; the IDfi is obtained by dividing
the number of documents in sci by the number of documents containing term ti, and then
taking the logarithm of that quotient. A high weight is yielded by a high term frequency
in a given document and a low document frequency of the term in the collection in SCi.
Documents with higherDI value are added toD′ (Line 10 of Algorithm 3). The documents
with lower DI values are merged their content into a new document d′ (Line 11). The
number of documents which is merged into d′ is equal to |sci| - (Ψ-1).
The output of the XDDR method is the document collection D′ = d1, d2, ..., dn′ where
n′ is lesser than the number of documents n in the clustering solution SC. The D′ is
then constructed into a feature-document matrix Xm×n′ . Then SVD is performed on the
matrix to obtain a kernel matrix Uk which is used in the TVSim measure(equation 3.7)
or the TPVSim measure (equation 3.8). Refer to Section 2.2.2.1 of Chapter 2 for more
detail of the SVD method.
3.4. The Hybrid Clustering 78
3.4 The Hybrid Clustering
The previous sections describe the data model and data similarity for the clustering meth-
ods which have been developed in this research. This section describes the clustering
algorithm employed by the clustering methods. The clustering algorithm proposed in
this thesis is a hybrid clustering algorithm which consists of three stages: incremental
clustering, iteration, and pair-wise clustering. The overview of the algorithm is shown in
Figure 3.7.
Incremental
Clustering
Ck’
C2
C1
Pair-wise
Clustering
Itera!on
Ck
C2
C1
Ck’
C2
C1
Hybrid Clustering Algorithm
Data Similarity
Data Modelling
XML
Documents
Figure 3.7: The hybrid XML clustering approach overview
It is a hybrid approach as it includes two types of clusterings. The first clustering is the
incremental clustering which is used at the beginning to group the large size of an input
XML collection in an intermediate numbers of clusters. The second clustering is the parti-
tioning method [30] which is a type of hierarchical clustering. It uses the pair-wise matrix
3.4. The Hybrid Clustering 79
as input. The pair-wise matrix is obtained by calculating the data similarity between all
possible pairs of cluster representations generated from the incremental clustering. The
hybrid clustering algorithm also has an iteration stage after the incremental clustering to
reassign the input document collection again to the clusters, which are generated in the
incremental clustering stage, having the maximum data similarity value. The iteration is
to address the sensitivity of the input document ordering in the incremental clustering.
The reason why the proposed algorithm utilizes the two types of clustering is because its
goal is to address the drawbacks of the two clusterings. These are the accuracy problem
in the incremental clustering, and the scalability problem in the pair-wise clustering. The
algorithm of the hybrid clustering is outlined in Algorithm 4. The three stages of the
hybrid clustering are described further in this section.
3.4.1 The Incremental Clustering Stage
The first stage of the proposed clustering algorithm is the incremental clustering (from
Line 2 to 17 of Algorithm 4). This clustering generates the number of clusters at run-time.
The incremental clustering begins with no cluster in the clustering solution C. Therefore,
the first document in the collection D makes a new cluster ci in C and becomes the cluster
representation ri of cluster ci. When there are clusters in the clustering solution C, the
documents in the collection D are compared with the clusters as follows. A document
is compared with the cluster representations using data similarity which is the similarity
measure employed by the proposed clustering methods (refer to Figure 3.1). If the best
data similarity value between a document and a cluster exceeds a user-defined clustering
3.4. The Hybrid Clustering 80
Algorithm 4 The hybrid clustering algorithm
Input: Document collection D = d1, d2, ..., dn, user-defined number of clusters β,
clustering threshold α;
Output: clustering solution C = c1, c2, ..., ck;
1. /*Incremental Clustering*/
2. for each document dj in dataset D
3. if clustering solution C is empty
4. create a new cluster ci in C;
5. assign dj to ci;
6. make dj the cluster representation ri of ci;
7. else
8. for each cluster ci in clustering solution C
9. compute the data similarity between dj and ri;
10. end for
11. if the highest data similarity value exceeds or equals to α
12. assign dj to cluster ci having the maximum data similarity value
13. change ri if applicable;
14. else execute steps 3 to 5;
15. end if
16. end if
17. end for
18.
19. /*Iteration*/
20. for each document dj in document file D
21. for each cluster ci in clustering solution C
22. compute the data similarity between dj and ci;
23. end for
24. assign dj to cluster ci having the maximum data similarity value;
25. end if
26.
27. if |C| > β
28. /*Pair-wise Clustering*/
29. generate a pair-wise matrix by computing the data similarity
30. between all pairs of clusters representations in C;
31. perform partitioning clustering on the distance matrix;
32. reassign the documents to new clusters based on the clustering result of the
partitioning clustering;
33. return C = c1, c2, ...cβ;34. else return C = c1, c2, ...ck<=β;35. end if
3.4. The Hybrid Clustering 81
threshold δ, the document is assigned to that cluster, otherwise the document makes a new
cluster in C and becomes the clustering representation of the new cluster. The clustering
process continues until all the documents in the document collection D are grouped into
clusters. The clustering threshold α is defined by the user and it is a value that determines
the degree of similarity that a document should have with a cluster in order to assign that
document to the cluster. The clustering threshold value is between 0 and 1, where 1 is the
highest, which indicates that the structure of a document is an exact match or a subset
of the cluster representation.
The incremental clustering requires a representation for the clusters in order to compare
the clustering objects with the new input objects. The selection of the cluster represen-
tation is important as it determines the accuracy of the clustering solution. There are
two types of cluster representations employed by the hybrid clustering: the common path
representation and the first document representation.
The common path representation. The common paths are the paths shared by all the
documents that exist in the cluster. The common path representation is mainly used by the
XCPath method as the method is based on the path model. The term “common” indicates
the degree of similarity between the paths that exceed a user-defined path threshold which
is described in Section 3.2.2.2. The initial clustering representation is based on the common
paths between the first two documents in a cluster. The cluster representation is expanded
by adding the paths with PSim (Equation 3.4) values that exceed the path threshold if
the paths do not already exist in the clustering representation. The cluster representation
for a cluster with only one document is the paths of that document in the cluster.
3.4. The Hybrid Clustering 82
Since the common path representation is a path model similar to the XCPath method, the
data similarity between a set of paths, representing a document, and a set of paths, rep-
resenting a cluster representation for a cluster only contains one document , is therefore
computed using the CPSim measure (Equation 3.3). The CPSim measure in Equa-
tion 3.3 is used to calculate the set of paths of a document and a cluster with only one
document. If there is more than one document in a cluster, then the common paths of
that document are used as the cluster representation. Equation 3.3 is altered to calculate
the similarity between a new document and a cluster representation, representing a cluster
with more than two documents. The new measure is defined below:
CPSim(dx, ry) =
∑|dx|i=1 max(Psim
|ry|j=1(pi, pj))
|dx|(3.10)
This measure is different to the measure in Equation 3.3 in that instead of dividing the
sum of the PSim values by the maximum number of paths of the document and the
cluster representation, this measure is divided by the number of paths in the document.
The reason is that if the number of paths in the cluster representation ry is large and if
the document is a sub-tree of the cluster representation, the CPSim value produced from
Equation 3.3 will be low.
The first document representation. The first document representation uses the feature,
structure and/or content of the document that makes a new cluster in a clustering solution
C as the cluster representation instead of using a common feature such as the common
path representation. The main reason why the first document representation is used
because to extract the common tree structure from all of the documents in a cluster is
3.4. The Hybrid Clustering 83
more complicated than to extract the common paths based only on the path model. The
time taken to perform such a task can slow down the clustering process. The idea of
using the documents that make new clusters as the cluster representations is based on the
assumption that the documents that make new clusters might contain some feature that
is different from already existing cluster representations and that the feature can be used
to cluster new documents having the similar feature. The first document representation
is employed by the XCTREE, XCLComb, and XCTPath methods for the clustering of
XML documents. As the cluster representation is also a document, the data similarity
employed by the clustering methods can be used to calculate the degree of similarity
between a document and a cluster representation.
3.4.2 The Iteration Stage
After the incremental clustering, the clustering process will execute an iteration stage. This
stage is to reassign the documents again according to the current cluster representations.
In the iteration stage it does not need the clustering threshold because the documents
are assigned to the clusters that have the highest degree of data similarity value. This
stage is important because the clustering solution generated in the incremental clustering is
sensitive to the input ordering. The iteration stage allows the input documents which have
been clustered earlier in the incremental clustering stage to be compared to the clusters
which are generated at a later stage. Throughout this stage, there is no alternation to the
cluster representations.
3.5. Summary 84
3.4.3 The Pair-wise Clustering Stage
After the iteration, if the generated number of clusters exceeds user-defined numbers of
clusters β then pair-wise clustering is executed. Each pair of the cluster representations
in the clustering solution C is compared to calculate the data similarity value. This in
turn generates a pair-wise similarity matrix. This matrix is then used as an input to a
partitioning method [30] for grouping the cluster representations so the number of clusters
generated is equal to β. The partitioning clustering method [30] first divides the distance
matrix into two groups, and then one of these two groups is chosen to be divided further.
The process is repeated until the number of divisions in the process is equal to the number
of user-defined clusters.
3.5 Summary
In summary, this chapter has described the clustering methods which have proposed in this
research. The clustering methods are divided into two types: structure-only clustering,
content and structure-based clustering. There are two structure-only clustering methods
which are based on the structure of the XML documents. These two methods vary in
terms of data modelling and data similarity. In addition, two content and structure-
based clustering methods are also developed. The two methods are different in terms of
how the content and structure are utilized for the document similarity. The chapter also
introduces a hybrid clustering algorithm which is employed by the proposed clustering
methods. The hybrid clustering algorithm consists of three stages: incremental clustering,
iteration, and pair-wise clustering. The hybrid clustering algorithm uses two different types
3.5. Summary 85
of clustering algorithms to group the XML documents. The hybrid clustering algorithm
aims to improve the scalability of the pair-wise clustering by performing clustering on
cluster representations instead of on the whole of the input documents. Also, to improve
the accuracy of the incremental clustering, the hybrid clustering algorithm includes an
iteration stage in order to reduce the sensitivity of the document ordering. The next
chapter is the empirical evaluation of the proposed clustering methods.
Chapter 4Empirical Evaluation of the Clustering
Methods
The previous chapter introduced four clustering methods for the clustering of XML doc-
uments. Two methods are based on the structure-only information, and the other two
methods are on the content and structure of the XML documents. In this chapter the four
clustering methods are evaluated and analysed.
This chapter describes the XML data collections which have been used for evaluating the
proposed clustering methods, the pre-processing of the data collection, the evaluation met-
rics, and the experimental results. In conclusion, there is the discussion and comparison
of all the clustering methods.
86
4.1. Data Collection 87
4.1 Data Collection
The proposed clustering methods presented in Chapter 3 focus on the clustering of XML
documents. Table 4.1 shows the XML document collections which have been used for eval-
uating the proposed clustering methods in this thesis. They are a mixture of homogeneous
and heterogeneous collections. The homogeneous collections have XML documents con-
forming to the same structural definition. On the other hand, heterogeneous collections
consist of XML documents conforming to more than one structural definition. All of them
are real life collections. Refer to the Appendix for the detail of the schema definitions of
these collections:
Niagara: This collection is derived from the Niagara Institution for Information
Retrieval System1 testing. It is a mixture of different XML documents conforming
to different schema definition as shown in Table 4.2.
Publication: This collection is derived from the Heterogeneous Track in INEX
2005 [17]. This collection relates to the publication domain. The documents are
a subset of four different sources, each of which has a different schema definition:
Berkeley, Computer Science, HCI Biliography, and DBPUB Bibliography.
DBLP: This collection is also derived from the Heterogeneous Track in INEX 2005 [17].
Even though the collection is derived from the Hetergeneous Track in INEX, in this
thesis the DBLP is treated as a homogeneous collection because the XML documents
in this collection is conformed to one schema definition. The structure and the con-
tent feature of the documents in this collection is small, therefore only a subset of the
1http://www.cs.wisc.edu/niagara/data.html
4.2. Data Pre-Processing 88
whole collection is used in the experiments to test the performance of the proposed
clustering methods.
IEEE: This collection comes from the INEX Document Mining Track in 2006 [17].
The IEEE corpus is composed of 12,000 scientific articles from IEEE journals from
year 2002 to 2005. In the 2006 INEX document mining track, a total of 6054
documents were used as a testing collection for the clustering task [17].
Table 4.1: Data collections for XML clusteringXML collection Number of Number of Collection type
documents classes
Niagara 5289 4 heterogeneous
Publication 460 22 heterogeneous
DBLP 4910 8 homogeneous
IEEE 6054 18 homogeneous
Table 4.2 describes the document classification of the XML document collections. The
table shows that the documents in the Niagara and DBLP are not evenly distributed
across the classes. Some classes only have one or two documents. In the IEEE collection,
on the other hand, the documents are evenly distributed across the classes.
4.2 Data Pre-Processing
This section looks at the pre-processing of the XML document collections as shown in
table 4.1. Two features of the XML documents needed to be extracted: the structure and
the content. To extract a feature of the XML document, SAX parsing technology is used.
It is faster than the DOM parsing. SAX parses the elements in the XML documents one
by one starting the root elements. The structure is extracted and modelled as described
in chapter 3.
4.2. Data Pre-Processing 89
Table 4.2: The classification of the data collections for XML clusteringXML Collection Classification Number of
documents
Niagara
Movie 37Actor 37Department 19Course 2Report 1Automobile 208Bibliography 16Profile 11Personal 12Quote 15Hospitality message 24Travel 10Order 10Auction data 4Appointment 2Document page 3Linux How-to documents 12Bookstore 2Shake 20Club 12Catalogue record 1Medicine Citation 1Nutrition 37
Publication
Berkeley 698dbpub Bibliography 364Computer Science 2878HCI Bibliography 1349
DBLP
books 1076conference 2065journals 1634miscellaneous 2persons 11phd 62technical report 27world wide web 33
IEEE
IEEE Annals of the History of Computing 156IEEE Computer Graphics and Applications 345Computer 963IEEE Computational Science and Engineering 286IEEE Design and Test of Computers 273IEEE Expert 351IEEE Internet Computing 266IT Professional 133IEEE Micro 284IEEE MultiMedia 235IEEE Parallel and Distributed Technology 192IEEE Software 460IEEE Transactions on Computers 516IEEE Transactions on Parallel 396and Distributed SystemsIEEE Transactions on Visualization 120and Computer GraphicsIEEE Transactions on Knowledge 291and Data EngineeringIEEE Transactions on Pattern Analysis 482and Machine IntelligenceIEEE Transactions on Software Engineering 305
4.2. Data Pre-Processing 90
The content of the documents is pre-processed as follows:
1. The text of the element nodes and attributes nodes are extracted;
2. The text is tokenized by spaces;
3. The numbers and special symbols are removed;
4. Common words known as stop-words such as the, and, their, my, etc. are removed;
5. The terms are stemmed; and
6. The terms in which their length is lesser than three characters are removed.
The pre-processing of the content is important as it helps remove insignificant terms and
improve the clustering based on the term feature.
Word Removal. In pre-processing, the text in the content of element and attribute nodes
is extracted. Stop-words are removed from the term collection. Stop-words are words such
as articles, prepositions, etc. that are common and have no significant meaning. Also
words that have fewer than three characters are removed since words with less than three
characters considered insignificant. Special symbols such as , !, @, #, %, etc. are also
discarded.
Stemming. Stemming refers to a process of reducing words to their ‘stems’, e.g. ‘banking’
to ‘bank’, ‘flooded’ to ‘flood’. Such abbreviation reduces a series of terms to a single
common concept.
Table 4.3 shows the detail of the pre-processed XML document collections, where # stands
4.2. Data Pre-Processing 91
for ‘the number of’. The maximum level and minimum level in Table 4.3 refer to the
number of levels in the hierarchical structure of the XML documents in the data collections.
Based on the pre-processing data in Table 4.3,the structure of the Niagara and the IEEE
collections are large and more complex than the Publication and DBLP collections. The
DBLP collection has a relatively small structure with the maximum level of four. In
the experiment, the attribute nodes in the IEEE collection are not processed or used in
the clustering of the documents. From previous experiments [72], attributes of element
nodes are not important for the clustering of XML documents, however, the inclusion of
attributes can lower the accuracy of the clustering solution.
Based on the number of terms after the pre-processing of the content in the document col-
lections, on average each document in the Niagara collection contains around one thousand
eight hundred and eighty two terms; each document in the Publication collection contains
around one hundred terms; each document in the DBLP collection contains around twenty
three terms; and each document in the IEEE collection contains around two thousand nine
hundred and twenty two terms. The documents in the IEEE collection are larger in their
content than the other collections. Not only is the structure of the DBLP collection small
but its term collection is small as well.
Even though the XML document collections used for the evaluation of the proposed clus-
tering methods are not very large in terms of documents, nevertheless the collections have
the following characteristics which make them good candidates for the evaluation of the
proposed clustering methods: (1) The collections are homogeneous as well as heteroge-
neous; (2) The distribution of the documents across classes are different; (3) The collections
4.3. Evaluation Metric 92
are real-life documents in XML format; and (4) The documents in the collections vary in
structure and content.
Table 4.3: Details of the pre-processed data collectionsNiagara Publication DBLP IEEE
#internal nodes 100682 48908 9820 922246
#leaf nodes 383810 122835 41195 2823062
#attributes 6067 54805 11288 -
#complete paths 390870 181589 52498 3156564
maximum level 16 6 4 19
minimum level 2 3 2 2
#terms 865846 532913 116960 17692610
#distinct terms 35826 40588 22259 224099
4.3 Evaluation Metric
To evaluate the proposed clustering methods, a number of evaluation metrics are used in
this thesis. The evaluation metrics are for evaluating the quality of the clustering solutions.
Usually, clustering is an off-line process therefore the accuracy of the clustering solution
is more important than the performance of the clustering methods.
There are a number of evaluation metrics available for document clustering. This thesis
uses three commonly used evaluation metrics to evaluate the quality of a clustering solu-
tion. They are Purity, Normalized Mutual Information (NMI), and F1-score [14]. Another
commonly used metric is the Entropy. The calculation of the NMI measure also considers
the Entropy metric, therefore the Entropy metric is not used directly in this research.
The evaluation metrics are used to calculate the external quality of the clustering solution
based on the comparison of cluster classes to known external classes. The values of the
evaluation metrics range from 0 to 1, where 1 is a perfect clustering solution.
4.3. Evaluation Metric 93
4.3.1 Purity
Given a set of clusters Ω = s1, s2, ..., sK and a set of classes C = c1, c2, ..., cR, the
purity(P) [49] of a cluster sk is defined as:
P (sk) =maxr(n
rk)
nk(4.1)
where nrk is the number of documents in class r that occurs most in cluster k and nk
is the number of documents in cluster k. The purity of the clustering solution Ω can
be calculated based on micro-averaging purity (micro purity) and macro-average purity
(macro purity). They are defined as:
micro purity(Ω) =
∑Kk=0(P (sk) ∗ nk)∑K
k=0 nrk
(4.2)
macro purity(Ω) =
∑Kk=0 P (sk)
R(4.3)
The micro purity and macro purity of the clustering solution Ω is obtained as a weighted
sum of individual cluster purity. The difference between micro purity measure and macro
purity measure is that the micro purity is more concerned with how the documents are
grouped in the clustering solution rather than in the number of true classes in which the
clustering solution has discovered. The problem with the purity metric is that as the
number of clusters approaches the same number of classes, the purity score will continue
4.3. Evaluation Metric 94
to improve until it reaches a perfect score when the number of clusters equals the number
of documents. Therefore using the purity metric, the clustering solution that has more
clusters tends to have a higher purity metric than the clustering solution that has a lesser
number of clusters.
4.3.2 Normalized Mutual Information
The Normalized Mutual Information(NMI) is considered to be an improvement over the
purity metric. The NMI metric balances the clustering solution against the number of
clusters. It is defined as:
NMI(Ω, C) =I(Ω;C)
[H(Ω) +H(C)]/2(4.4)
where I is mutual information similar to maximum likelihood estimates of the probabilities
which is defined as:
I(Ω;C) =∑k
∑r
Pro(sk∩
cr)logPro(sk
∩cr)
Pro(sk)Pro(cr)(4.5)
=∑k
∑r
|sk∩
cr|N
logN |sk
∩cr|
|sk||cr|(4.6)
where Pro(sk),Pro(cr), and Pro(sk∩
cr) are the probabilities of a document in cluster k,
class r, and in the intersection of k and r, respectively.
4.3. Evaluation Metric 95
H is entropy which is defined as:
H(Ω) = −∑k
Pro(sk)logPro(sk) (4.7)
= −∑k
|sk|n
log|sk|n
(4.8)
Refer to Christopher et al. [14] for more details of this metric. The mutual information
metric used alone suffers the same problem as the purity metric; however, normalizing it
by entropy fixes this problem since entropy tends to increase with the number of clusters.
For example, H(Ω) reaches its maximum log n for k=n, which ensures the NMI is low for
k=n where n is the number of documents in a data collection.
4.3.3 F1-Score
Another evaluation metric is F1-Score (F1) [17]. The difference between F1 and NMI is
that NMI calculates how many documents in a class are discovered by a cluster, whereas
the F1 calculates how many documents that are correctly classified together in a cluster
and how many documents are misclassified from the cluster. The F1-Score of a clustering
solution is calculated using recall and precision which can be calculated using micro-
averaging and macro-averaging.
Given a class cr and a cluster sk, true positive (TPr) is defined as the number of documents
which is in class cr but which appears in cluster sk; false positive (FPr) is defined as the
4.3. Evaluation Metric 96
number of documents which is not in class cr which appears in cluster sk; and true negative
(TNr) is defined as the number of documents which is in class cr but which is not in cluster
sk. The precision and recall for the micro-averaging F1 (micro F1) is defined as:
precision =
∑Rr=1 TPr∑R
r=1 TPr + FPr
(4.9)
recall =
∑Rr=1 TPr∑R
r=1 TPr + TNr
(4.10)
The precision and recall for the macro-averaging F1 (macro F1) is defined as:
precision =
∑Rr=1
TPr∑Rr=1 TPr+FPr
R(4.11)
recall =
∑Rr=1
TPr∑Rr=1 TPr+TNr
R(4.12)
The precision is to calculate how many incorrect documents have been classified in the
same cluster. Whereas, recall is to calculate how many correct documents are not grouped
in the same cluster. Based on the recall and precision, F1 is defined as:
F1 =2(precision× recall)
precision+ recall(4.13)
To obtain the micro-averaging F1 value, the recall and precision in equations 4.10 and 4.9
4.4. Benchmarks 97
are used in the F1. On the other hand, to obtain the macro-averaging F1 value, the recall
and precision in equations 4.12 and 4.11 are used. The micro-averaging F1 is to measure
the quality of clusters in a clustering solution in which the calculation of the recall and
precision does not consider the number of document classes. The macro-average F1, on
the other hand, is to measure the quality of the overall clustering solution in which the
measure takes the number of document classes into consideration when calculating the
recall and precision.
4.4 Benchmarks
The experiments for the proposed clustering methods are carried out to evaluate the
followings:
The structure-only clustering methods
– The sensitivity of the clustering method and the path threshold.
– The scalability of XCTree and XCPath methods in terms of processing time
which is measured in seconds.
– The comparison of the clustering solutions of XCTree and XCPath methods
through the three different stages of the hybrid clustering algorithm: incremen-
tal clustering, incremental plus iteration, pair-wise clustering.
– The comparison of the XCTree and XCPath with the XCLS method [46] and
Zhang & Shasha tree edit distance [89]. XCLS extends the transactional data
clustering algorithm for the clustering of XML documents. In contrast, the
4.4. Benchmarks 98
Zhang & Shasha tree edit distance computes the cost of transforming one tree to
another tree. Using this cost, the Zhang & Shasha tree edit distance computes
the similarity between trees. The XCLS and the Zhang and Shasha tree edit
distance are chosen for the methods comparison in this section as XCLS is a
fast method that outperform many methods based on incremental clustering,
whereas the Zhang & Shasha tree edit distance method is a family of the tree-
edit distance method based on the tree structure model.
The content and structure-based clustering methods
– The sensitivity of the dimensional document k for kernel matrix Uk and the
number of reduced number of document n′ for feature-document matrix Xm×n′ .
– The sensitivity of the λ in the XCLComb method and the sensitivity of the
path length in the XCTPath method.
– The comparison of the XCLComb method where the weight for the content
similarity value is one (content-only clustering solution) with CLUTO repeated
bisection algorithm using TF-IDF weights and CLUTO repeated bisection al-
gorithm [30] with BM25 weighting on the content-only information
– The comparison between the XCLComb method and the XCTPath method,
the linear combination of the structure and content method and the non-linear
combination of the structure and content of XML documents.
The comparison of all the clustering methods which also includes the XCLS method [46],
Zhang & Shasha tree edit distance method [89], and CLUTO repeated bisection
methods [30].
4.5. Results of Experiments 99
4.5 Results of Experiments
This section evaluates the performance of the clustering methods proposed in this the-
sis. Firstly the structure-only clustering methods are evaluated and analysed. After the
evaluation of the structure-only clustering methods, the results of the experiments on
content and structure-based clustering methods are presented. Finally, the section ends
with the evaluation of both the structure-only clustering methods, and the content and
structure-based clustering methods.
4.5.1 Analysing the Structure-only Clustering Methods
This section presents the results and evaluations of the two structure-only clustering meth-
ods: XCTree and XCPath. It evaluates the effect of the different clustering thresholds,
the scalability, the effect of the path thresholds on the XCPath method, the clustering
solution at the different stages of the hybrid clustering algorithm, and the comparison of
the methods.
4.5.1.1 Clustering Threshold
The hybrid clustering algorithm presented in Chapter 3 Section 3.3 is affected by a clus-
tering threshold used in the incremental clustering stage. Figure 4.1 shows the effect of
the different clustering thresholds on the XCTree and XCPath methods. The result of the
XCPath method is based on the path threshold of 0.7. The results in Figure 4.1 are based
on the clustering solutions after the iteration stage, because for some collections, the num-
ber of clusters generated by the XCTree and the XCPath methods is lesser than the actual
4.5. Results of Experiments 100
user-defined number of clusters in the incremental clustering stage. It can be ascertained
from the results in figure 4.1 that the clustering algorithm executes better with a high clus-
tering threshold. The reason behind that is that a high clustering threshold maximizes
the closeness of the documents in the same cluster. A high clustering threshold increases
the similarity of the documents within a cluster (intra-cluster similarity) and decreases
the similarity of the documents in different clusters (inter-cluster similarity). The results
show that the XCTree method (Figure 4.1(a)) is affected by the clustering threshold more
than the XCPath method (Figure 4.1(a)) because the XCPath method is based on the
common path representation, whereas the XCTree method is based on the first document
representation. The common path representation is the representation which consists of
all the common structures of the documents held within a cluster. Therefore the ordering
of the input documents in a collection does not affect the XCPath method much. With the
common path representation, the XCPath method can achieve a perfect macro F1 value
for the Niagara collection when the number of clusters are not refined down to twenty-two
clusters. This shows that documents in different classes of the Niagara collection share
different common structures.
0
0.2
0.4
0.6
0.8
1
1.2
0.1 0.3 0.5 0.7 0.9
ma
cro
F1
Clustering Threshold
Niagara
Publica on
DBLP
IEEE
0
0.2
0.4
0.6
0.8
1
1.2
0.1 0.3 0.5 0.7 0.9
ma
cro
F1
Clustering Threshold
Niagara
Publica on
DBLP
IEEE
(a) XCTree (b) XCPath
Figure 4.1: The effect of the clustering threshold on the XCTree and XCPath methods.
4.5. Results of Experiments 101
4.5.1.2 Scalability
With respect to the scalability, the XCTree method is more efficient than the XCPath
method. It computes the structural similarity using the tree structure rather than indi-
vidual paths in a tree as shown in figure 4.2. The time taken to process the IEEE collection
is longer than the other collections as its structure is much larger. It is significantly ex-
pensive to process the IEEE collection when the clustering threshold is around 0.7 and
above. From 0.7 and above, the clusters of the IEEE generated from the incremental clus-
tering stage up from 11 to 65, and to 1399 with the clustering threshold of 0.9 as shown in
Table 4.4 for the XCTree method. As the number of clusters increases so is the time taken
to compute them. Since from 0.7 and above, the time taken to run the XCPath method
on the IEEE collection is significantly long therefore no result is shown in Figure 4.2 or
Table 4.4 The DBLP takes less time to process than the Niagara even though the DBLP
has more documents than the Niagara collections. The reason is that the structure of the
documents in DBLP collection is small, therefore, the number of clusters generated by the
clustering method is fewer when compared to the Niagara. The XCTree method generates
fewer clusters when compared with the XCPath using the same document collections as
shown in Tables 4.4. The XCPath works with paths, therefore it has more features to be
considered in calculating the structural similarity between documents when compared to
the XCTree method.
4.5. Results of Experiments 102
0
100
200
300
400
500
600
700
800
900
0.1 0.3 0.5 0.7 0.9
Tim
e (
se
c)
Clustering Threshold
XCTree XCPath
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0.1 0.3 0.5 0.7 0.9
Tim
e (
se
c)
Clustering Threshold
XCTree XCPath
(a) Niagara (b) Publication
0
50
100
150
200
250
0.1 0.3 0.5 0.7 0.9
Tim
e (
se
c)
Clustering Threshold
XCTree XCPath
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
0.1 0.3 0.5 0.7 0.9
Tim
e (
se
c)
Clustering Threshold
XCTree XCPath
(c) DBLP (d) IEEE
Figure 4.2: The processing time of the structure-only clustering methods.
Table 4.4: The number of clusters generated at the incremental clustering stage withdifferent clustering thresholds.
— — 0.1 0.3 0.5 0.7 0.9
XCTree
Niagara 16 23 26 34 53Publication 3 3 5 12 56DBLP 1 1 1 4 15IEEE 1 4 11 65 1399
XCPath
Niagara 29 40 48 51 66Publication 8 12 18 44 198DBLP 8 8 8 14 14IEEE 1 13 104 - -
4.5.1.3 Path Threshold
XCTree method is affected only by the clustering threshold, whereas, the XCPath method
is affected by the path threshold as well as the clustering threshold. The path threshold is
the lowest similarity values between two paths for these paths to be considered as similar.
4.5. Results of Experiments 103
The effect of the path threshold is shown in Figure 4.3 using the clustering threshold of 0.9.
The effect of the path threshold has not been done on the IEEE collection because there
is no result using the clustering threshold of 0.9 as mentioned before in Section 4.5.1.2.
It is a significantly long run on the IEEE collection using the XPath method. From the
figures, the optimum path threshold is at around 0.7 since it is more flexible than the path
threshold of 0.9. With the path threshold of 0.9 it is equivalent to saying that two paths
should be exactly the same for them to be considered as common paths. The effect of the
path threshold using the clustering threshold of 0.9 performs differently on the Niagara
and Publication collections. The path threshold does not affect the clustering solutions
of the Niagara collection much because the Niagara collection can perform equally well
with a low clustering threshold as shown in Figure 4.1. On the other hand, the clustering
solutions of the Publication collection significantly improve after the path threshold of 0.5
showing that the documents from different classes in the Publication collection are much
closer than the documents in the Niagara collection.
The DBLP collection does not change much with the different path thresholds at the
clustering threshold of 0.9 which is shown in Figure 4.3. Therefore, the effect of the path
threshold on the DBLP collection is analysed further using different clustering thresholds
as shown in Figure 4.4. The results show the different path thresholds on the different
clustering thresholds (CT) using the macro F1 and NMI metrics. With clustering threshold
of 0.5 and over, the path threshold does not affect the clustering solution of the DBLP
collection at all. This happens due to the fact that structure of the documents in the
DBLP collection is small and the structure of the documents from different classes is
closely related, especially for the path matching begins at the leaf nodes (refer to the
4.5. Results of Experiments 104
0
0.2
0.4
0.6
0.8
1
1.2
0.1 0.3 0.5 0.7 0.9
Accu
ra
cy
Path Threshold
micro purity
macro purity
micro F1
macro F1
NMI
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1 0.3 0.5 0.7 0.9
Accu
ra
cy
Path Threshold
micro purity
macro purity
micro F1
macro F1
NMI
(a) Niagara (b) Publication
0
0.2
0.4
0.6
0.8
1
1.2
0.1 0.3 0.5 0.7 0.9
Accu
ra
cy
Path Threshold
micro purity
macro purity
micro F1
macro F1
NMI
(c) DBLP
Figure 4.3: The effect of the path thresholds with the clustering threshold of 0.9.
Appendix for the schema definition of the DBLP collection). Therefore using a low path
threshold on a high clustering threshold still produces the same clustering solution with
a high path threshold. For instance, when the structure of the documents from different
classes is closely related, the lowest path similarity values between the document structure
from the classes may be at around 0.8. In this scenario, the clustering solution with the
setting of the path threshold of 0.1 will have the same clustering solution with the path
threshold of 0.8 with the same clustering threshold.
4.5. Results of Experiments 105
0
0.2
0.4
0.6
0.8
1
1.2
0.1 0.3 0.5 0.7 0.9
ma
cro
F1
Path Threshold
CT-0.1
CT-0.3
CT-0.5
CT-0.7
CT-0.9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1 0.3 0.5 0.7 0.9
NM
I
Path Threshold
CT-0.1
CT-0.3
CT-0.5
CT-0.7
CT-0.9
(a) DBLP (b) DBLP
Figure 4.4: The effect of the path threshold with different clustering thresholds on theXCPath method.
4.5.1.4 Three Stages of the Hybrid Clustering Algorithm
Figures 4.5 and 4.6 show the results of the clustering solutions at the three different stages
of the hybrid clustering algorithm for the XCTree and XCPath methods, respectively.
There are three stages in the hybrid clustering algorithm: incremental clustering, iteration
and pair-wise clustering. A number of observations can be made from the results in
Figures 4.5 and 4.6. The first observation is that in most evaluation metrics and for most
of the data collections, with unrestricted numbers of clusters generated at the incremental
clustering stage, the accuracy of the clustering solution is much better than the results
of the methods in which the number of clusters is refined down due to the generation
of user-defined number of clusters. The results in Figures 4.5 and 4.6 shows that the
incremental clustering is able to discover the natural groupings that exist in the data
collection. However, by forcing the clusters down to produce a required number of clusters
produces a less effective clustering solution than in the incremental clustering stage.
The second observation is that the clustering solutions produced using the incremental
4.5. Results of Experiments 106
clustering with the iteration stage do not improve much. This shows that based on the
data collections, the first document representation employed by the XCTree and the com-
mon path representation employed by the XCPath is not affected much by the document
ordering in the data collections. Therefore, there is little difference between the clustering
solutions generated from the incremental clustering stage and the clustering solutions gen-
erated from incremental clustering plus the iteration stage. Nevertheless, there is a small
improvement of the clustering solutions on the DBLP collection shown in figure 4.5(c).
This improvement shows that the iteration stage is useful for documents in which the
structures from different classes are highly related to one another, therefore the order-
ing of the input document becomes sensitive to the incremental clustering. There is no
improvement in the DBLP collection using the XCPath method because using the com-
mon path representation the iteration does not require since the representation is the
global representation of the common paths of the documents within a cluster. Due to the
time complexity, for instance the incremental clustering alone will have the complexity of
O(nlogn) but with the iteration it will be O((nlogn) × 2), therefore the iteration stage
may not be needed in the hybrid clustering algorithm especially for the common path
representation.
The final observation is that the NMI measure takes into account the Entropy measure
generated in the clustering solution in its calculation, thus, except for the clustering solu-
tion on IEEE collection in the XCTree method and the Niagara collection in the XCPath
method, the accuracy of the NMI tends to be higher in comparison to the other evaluation
metrics in the final result of the structure-only clustering methods. The Entropy measure
in the NMI measure tends to increase as the number of clusters increases. The NMI value
4.5. Results of Experiments 107
is low for the IEEE collection in the final solution of the XCTree method due to the fact
that the clustering solutions generated in the previous stages (incremental clustering and
iteration) are low and the structural relationships between the documents in the IEEE
collection are so much overlapping between clusters that the Entropy value in the NMI
metric for the final solution of the XCTree is still very high. The same applies to the
Niagara collection. With the perfect solution generated at the incremental clustering and
the iteration stages, the Entropy in the NMI measure is high when the number of clusters
is refined (down) to the user-defined number of clusters in the final clustering solution of
the XCPath method.
0
0.2
0.4
0.6
0.8
1
1.2
micro purity macro purity micro F1 macro F1 NMI
Accuracy
Evalua on Metric
Incremental Clustering Incremental Clustering +Itera!on XCTree
0
0.2
0.4
0.6
0.8
1
1.2
micro purity macro purity micro F1 macro F1 NMI
Accuracy
Evalua on Metric
Incremental Clustering Incremental Clustering +Itera!on XCTree
(a) Niagara (b) Publication
0
0.2
0.4
0.6
0.8
1
1.2
micro purity macro purity micro F1 macro F1 NMI
Accuracy
Evalua on Metric
Incremental Clustering Incremental Clustering +Itera!on XCTree
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
micro purity macro purity micro F1 macro F1 NMI
Accuracy
Evalua on Metric
Incremental Clustering Incremental Clustering +Itera!on XCTree
(c) DBLP (d) IEEE
Figure 4.5: The accuracy of the clustering solution at the three stages of the XCTreemethod at clustering threshold 0.9.
4.5. Results of Experiments 108
0
0.2
0.4
0.6
0.8
1
1.2
micro purity macro purity micro F1 macro F1 NMI
Accuracy
Evalua on Metric
Incremental Clustering Incremental Clustering + Itera!on XCPath
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
micro purity macro purity micro F1 macro F1 NMI
Accuracy
Evalua on Metric
Incremental Clustering Incremental Clustering + Itera!on XCPath
(a) Niagara (b) Publication
0
0.2
0.4
0.6
0.8
1
1.2
micro purity macro purity micro F1 macro F1 NMI
Accuracy
Evalua on Metric
Incremental Clustering Incremental Clustering + Itera!on XCPath
(c) DBLP
Figure 4.6: The accuracy of the clustering solution at the three stages of the XCPathmethod at the clustering threshold of 0.9 and path threshold of 0.7.
4.5.1.5 Methods Comparison
The accuracy of the structure-only clustering methods is compared to the other structure-
only clustering methods, namely XCLS [46], and Zhang & Shasha tree edit distance [89].
To generate a pair-wise similarity matrix using Zhang & Shasha tree edit distance is time
consuming. Thus, in the experiments the hybrid clustering algorithm proposed in this
thesis uses the Zhang & Shasha tree edit distance to generate the clustering solution. The
accuracy of the results is shown in Figure 4.7. The results of the XCPath method is based
on the path threshold of 0.7. Both the XCPath and XCTree methods use the clustering
4.5. Results of Experiments 109
threshold of 0.9. The Zhang & Shasha method uses the clustering threshold of 0.9. From
the results, the Zhang and Shasha tree edit distance method is the worst of the methods.
The reason might be that the Zhang & Shasha tree edit distance is not applicable for
incremental clustering. The XCTree method performs consistently better than the other
methods in most data collections.
0
0.2
0.4
0.6
0.8
1
1.2
micro purity macro purity micro F1 macro F1 NMI
Accuracy
Evalua on Metric
XPATHClust XTREEClust XCLS ZhangShasha_distance
0
0.2
0.4
0.6
0.8
1
1.2
micro purity macro purity micro F1 macro F1 NMI
Accuracy
Evalua on Metric
XPATHClust XTREEClust XCLS ZhangShasha_distance
(a) Niagara (b) Publication
0
0.2
0.4
0.6
0.8
1
1.2
micro purity macro purity micro F1 macro F1 NMI
Accuracy
Evalua on Metric
XPATHClust XTREEClust XCLS ZhangShasha_distance
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
micro purity macro purity micro F1 macro F1 NMI
Accuracy
Evalua on Metric
XPATHClust XTREEClust XCLS ZhangShasha_distance
(c) DBLP (d) IEEE
Figure 4.7: The comparison of different structure-only clustering methods.
In terms of scalability the XCLS is faster than the XCPath, XCTree, and Zhang & Shasha
tree edit distance. To perform the clustering it takes less than one second to cluster
collections such as the Niagara, dblp, and for collections such as the Publication and IEEE,
it takes less than two minutes with any clustering thresholds. Even with a high clustering
4.5. Results of Experiments 110
threshold of 0.9, the number of clusters generated by the XCLS for IEEE collection is fewer
than eighteen classes suggesting that using the summary representation for the clusters
tends to lead the documents to the cluster with large representation. Since XCLS uses the
global structure of the documents within a cluster to be the cluster representation, the
method tends to be faster than the other clustering methods because it creates far fewer
clusters especially for IEEE collection.
The hybrid clustering algorithm takes longer to compute the Zhang & Shash tree edit
distance measure for the document clustering when compared to the measures employed
by the XCTree and XCPath methods. The proposed structure-only clustering methods
in this thesis and Zhang & Shasha tree edit distance are slower than the XCLS method
however they can exploit the structure of the XML documents in more detail, therefore
they tend to create more clusters. The proposed structure-only clustering methods and
Zhang & Shasha method, therefore, are more applicable for applications such as schema
matching and XML data integration, whereas the XCLS is more useful in the information
retrieval area where speed is important.
4.5.2 Analysing the Content and Structure-based Clustering Methods
In addition to the structure-only clustering methods, this thesis also proposes two content
and structure-based clustering methods; the first method is a linear combination of the
structure and content measures and the other is a non-linear combined method.
4.5. Results of Experiments 111
4.5.2.1 Kernel
The content in the XCLComb method and the text paths utilized by the XCTPath method
are calculated using a kernel described in Chapter 4. The kernel is sensitive to the selection
of the k value which is the reduced dimensional document space for the kernel constructed
from the SVD method. Figure 4.8 shows the sensitivity of the different k values against
the reduced number of documents n′ for the feature-document matrix X. The results in
Figure 4.8 are generated by XCLComb method with λ equals 1 (content-only measure)
using the clustering threshold of 0.9 for the DBLP and Publication collections and the
clustering threshold 0.7 for the IEEE collection. Here, the analysis is based on the NMI
and macro F1 values. Based on the results in Figure 4.8, the optimal reduced number of
documents for matrix X is 1500 with the k values between 600 and 800. The Publication
collection is better with more documents selected with a high k value. As for the IEEE
collection, it performs best from 1000 documents upward with k values between 600 and
800. As the for DBLP collection, its performance patterns are more irregular than the
other two collections. The best result is with 2500 documents with k value of 800, or with
1500 documents with k of 1000. For all k values, the performance increases at the reduced
number of 1000 documents but decreases as the lower number of documents increases and
then improves at 2500 documents. The k value of 200 has the worst result for most reduced
number of documents for matrix X and for all collections. Figure 4.8 does not show the
results for the Niagara collection because the collection is so small to analyse, however,
the best k value for the Niagara collection is also around 800 with the reduced number of
300 documents. Based on the results in Figure 4.8, the XCLComb and XCTPath use the
kernels with k value of 800 and the reduced number of documents of 1500 for IEEE and
4.5. Results of Experiments 112
Publication collections; the k value of 800 and the reduced number of documents of 2500
for DBLP; the k value of 800 and the reduced number of 300 documents for the Niagara
collection for all the experiments presented in this section.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
500 1000 1500 2000 2500
ma
cro
F1
Number of documents
k=200
k=400
k=600
k=800
k=1000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
500 1000 1500 2000 2500
NM
INumber of documents
k=200
k=400
k=600
k=800
k=1000
(a) Publication (b) Publication
0.347
0.348
0.349
0.35
0.351
0.352
0.353
0.354
0.355
0.356
0.357
500 1000 1500 2000 2500
ma
cro
F1
Number of documents
k=200
k=400
k=600
k=800
k=1000
0.54
0.55
0.56
0.57
0.58
0.59
0.6
0.61
500 1000 1500 2000 2500
NM
I
Number of documents
k=200
k=400
k=600
k=800
k=1000
(c) DBLP (d) DBLP
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
500 1000 1500 2000 2500
ma
cro
F1
Number of documents
k=200
k=400
k=600
k=800
k=1000
0.29
0.3
0.31
0.32
0.33
0.34
0.35
0.36
0.37
500 1000 1500 2000 2500
NM
I
Number of documents
k=200
k=400
k=600
k=800
k=1000
(c) IEEE (d) IEEE
Figure 4.8: The sensitivity of the k value on the kernel.
4.5. Results of Experiments 113
4.5.2.2 Weighting in the XCLComb method.
The XCLComb method combines the structure measure and content measure using the
weight of λ which is in the range of 0 to 1. A higher λ value indicates more importance
is given to the content measure and vice versa. When λ equals 0, it indicates that the
clustering is mostly based on the structure measure. When λ equals 1, it indicates that the
clustering is mostly based on the content measure. Figure 4.9 shows the effect of the λ on
the data collections. For the Publication collection, the λ does not make much impact after
0.1. This means that even though the Publication collection is a heterogeneous collection,
the inclusion of the content in the Publication collection also plays a role in finding the
true grouping of the documents; however, the collection does not necessarily require a high
weighting for the content similarity value to obtain a good clustering solution. For the
Niagara and DBLP collections, the accuracy of the clustering solutions declines when the
λ is high, showing that the content similarity value is not as important as the structural
similarity value. This decline in accuracy shows that the structure of the documents in
the Niagara collection is more important than the content and this is true because the
Niagara collection is a heterogeneous collection. On the other hand, even though the
DBLP collection is a homogeneous collection the inclusion of the content degrades its
performance showing that the structure is more distinguishable in this collection than the
content. As for the IEEE the impact is the inverse, the increase of the λ improves the
clustering solution. This is true because the IEEE is a homogeneous collection where the
content of the documents is more distinguishable than the structure.
4.5. Results of Experiments 114
0
0.2
0.4
0.6
0.8
1
1.2
0.1 0.3 0.5 0.7 0.9
Accuracy
λ
micro purity
macro purity
micro F1
macro F1
NMI
0
0.2
0.4
0.6
0.8
1
1.2
0.1 0.3 0.5 0.7 0.9
Accuracy
λ
micro purity
macro purity
micro F1
macro F1
NMI
(a) Niagara (b) Publication
0
0.2
0.4
0.6
0.8
1
1.2
0.1 0.3 0.5 0.7 0.9
Accuracy
λ
micro purity
macro purity
micro F1
macro F1
NMI
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.1 0.3 0.5 0.7 0.9
Accuracy
λ
micro purity
macro purity
micro F1
macro F1
NMI
(c) DBLP (d) IEEE
Figure 4.9: The effect of the lambda of the XCLComb method.
4.5.2.3 Content-Only Comparison
Figure 4.10 shows the performance of the different content-only clustering methods. The
XCLComb method uses the clustering threshold of 0.9 and the λ of 1 (content-only mea-
sure). The clustering solution is compared to the repeated bisections (rbr) in CLUTO [30]
with the different weighting of the terms using TF-IDF and BM25. Based on the results in
Figure 4.10, for most data collections the results generated by the XCLComb method out-
perform the clustering solutions generated by the CLUTO method using TF-IDF weight-
ing, however the results of the CLUTO method with BM25 weighting outperform the
XCLComb method in the IEEE collection. The XCLComb method also uses TF-IDF
4.5. Results of Experiments 115
weighting but unlike the CLUTO method, the term feature of the documents is measured
using a kernel which can learn the associations between the term concepts better.
The IEEE collection, on the other hand, performs better using BM25 weighting indicating
that the documents in the IEEE collection are very different in document length in regard
to the number of terms. The difference between the BM25 and TFIDF weightings is
that BM25 weighting has two tune parameters b and K1 to tune the impact of document
length and/or the term frequency. Therefore, the IEEE collection works better with the
BM25 weighting than the other collections. For collections such as the Publication and
DBLP collection, all methods perform almost the same, showing that term collection
of the documents from different classes has different term concepts. For the Niagara
collection, the clustering solution of the CLUTO using BM25 weighting performs the
worst which highlights that term collection of the documents in different classes of the
Niagara collection varies in term concepts but the length of the documents in regard to
the number of terms does not vary much. Therefore using the kernel works better for the
Niagara collection.
4.5.2.4 Path Length in the XCTPath Method
Figure 4.11 shows the results of the XCTPath method with various path lengths. Where
TPath 1 in Figure 4.11 means the length of the text path is 1 which contains only the
root node and a term, TPath 2 means the length of the text path is 2 and so on. The
results of the XCTPath method is based on the clustering threshold of 0.9. The results
of the IEEE collection again highlight that the BM25 weighting is the best for the IEEE
4.5. Results of Experiments 116
0
0.2
0.4
0.6
0.8
1
1.2
micro purity macro purity micro F1 macro F1 NMI
Accuracy
Evalua on Metric
XCLComb_λ=1 CLUTO_"idf_Content_Only CLUTO_BM25_Content_Only
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
micro purity macro purity micro F1 macro F1 NMI
Accuracy
Evalua on Metric
XNLCClust_λ=1 CLUTO_"idf CLUTO_BM25
(a) Niagara (b) Publication
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
micro purity macro purity micro F1 macro F1 NMI
Accuracy
Evalua on Metric
XCLComb_λ=1 CLUTO_"idf_Content_Only CLUTO_BM25_Content_Only
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
micro purity macro purity micro F1 macro F1 NMI
Accuracy
Evalua on Metric
XCLComb_λ=1 CLUTO_"idf_Content_Only CLUTO_BM25_Content_Only
(c) DBLP (d) IEEE
Figure 4.10: The comparison of the different content clustering methods.
collection. The result of the XCTPath in the IEEE collection is lower than the Cluto with
the BM25 weighting shows that the representation of the structure and content as text
paths produces many unrelated concepts which cannot be discovered using a kernel. The
clustering solutions generated using a kernel by the XCTPath method again outperform
the Cluto method with TFIDF weighting for most collections. Similar to the results of the
XCLComb method, the results for the Niagara and DBLP collections improve when the
path length increases showing that including more structure in the text paths improves
the clustering solutions.
4.5. Results of Experiments 117
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
TPath_1 TPath_2 TPath_3 TPath_4
NM
I
Text Path Length
XCTPath Cluto_bm25 Cluto_ idf
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
TPath_1 TPath_2 TPath_3 TPath_4
NM
I
Text Path Length
XCTPath Cluto_bm25 Cluto_ idf
(a) Niagara (b) Publication
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
TPath_1 TPath_2 TPath_3 TPath_4
NM
I
Text Path Length
XCTPath Cluto_bm25 Cluto_ idf
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
TPath_1 TPath_2 TPath_3 TPath_4
NM
I
Text Path Length
XCTPath Cluto_bm25 Cluto_ idf
(c) DBLP (d) IEEE
Figure 4.11: The comparison of the different path length of the XCTPath method.
4.5.2.5 Content and Structure-based Methods Comparison.
Figure 4.12 presents the results of the XCTPath and XCLComb methods with the Cluto
method using the TFIDF and BM25 weightings. The results of the XCTPath and the
Cluto methods are based on the text path length of 2 (TPath 2). The results are based
on the inclusion of both the structure and the content measures. The clustering solutions
generated by the XCLComb method using different λ settings outperform the other meth-
ods. The results in Figure 4.12 highlights that the relationships between the structure
and content cannot be measured effectively in one data model for document clustering.
The XCTPath method with the path length of 2 is the worst of all the other methods
4.6. Discussion 118
in the Publication collection. For the Publication collection, by combining the structure
and content in text paths results in many unrelated feature concepts in which the kernel
cannot discover any associations between the text paths.
0
0.2
0.4
0.6
0.8
1
1.2
micro purity macro purity micro F1 macro F1 NMI
Accuracy
Evalua on Metric
XCTPath_TPath_2 XCLComb_λ=0.3
CLUTO_"idf_TPath_2 CLUTO_BM25_TPath_2
0
0.2
0.4
0.6
0.8
1
1.2
micro purity macro purity micro F1 macro F1 NMI
Accuracy
Evalua on Metric
XCTPath_TPath_2 XCLComb_λ=0.9
CLUTO_"idf_TPath_2 CLUTO_BM25_TPath_2
(a) Niagara (b) Publication
0
0.2
0.4
0.6
0.8
1
1.2
micro purity macro purity micro F1 macro F1 NMI
Accuracy
Evalua on Metric
XCTPath_TPath_2 XCLComb_λ=0.1
CLUTO_"idf_TPath_2 CLUTO_BM25_TPath_2
0
0.1
0.2
0.3
0.4
0.5
micro purity macro purity micro F1 macro F1 NMI
Accuracy
Evalua on Metric
XCTPath_TPath_2 XCLComb_λ=0.9
CLUTO_"idf_TPath_2 CLUTO_BM25_TPath_2
(c) DBLP (d) IEEE
Figure 4.12: The comparison of the clustering methods utilizing semantic kernel.
4.6 Discussion
The previous section presents the experimental results of the structure-only clustering
methods, and content and structure-based clustering methods. Figures 4.13, 4.14, 4.15,
and 4.16 display the best results of all clustering methods.
4.6. Discussion 119
Based on the overall results, a number of findings have been obtained. The first finding
is that for the structure-only clustering methods, clustering based on the tree model and
tree similarity by the XCTree method is better than the XCPath method which is based
on the path model and path similarity. However, The XCPath method is not sensitive
to the clustering threshold since the method uses the common path representation which
is the global path structure representation of the documents in a cluster. The XCTree
method uses the first document representation which is sensitive to the input document
ordering therefore the method performs best with a high clustering threshold.
The second finding is that in homogeneous collections such as the IEEE collection, the
grouping of the documents is based mainly on the content since the documents are con-
formed to the schema definition and the classification of the documents is based on the
content. The DBLP collection, on the other hand, is also homogeneous, but since the
classification of the documents in the DBLP is based on structure, the clustering solu-
tions therefore use the structure-only clustering methods such as the XCTree, XCPath
and XCLS methods which outperform the content-based clustering solutions. For hetero-
geneous collections such as the Niagara and Publication collections, the structure and the
content both play an important role. However, the structure is more distinguishable in
heterogeneous collections than in the homogeneous collections. The results of the Niagara
collection in figure 4.13 and the results of the Publication collection in Figure 4.14 highlight
the idea that the content-only clustering solutions cannot outperform the structure-only
clustering solutions or the content and structure-based clustering solutions.
The third finding is that the clustering solutions generated by the XCLComb method
outperform the clustering solutions by the XCTPath method since the XCLComb method
4.7. Summary 120
allows the user the flexibility to adjust the λ weighting of the content and structural
similarity values. The content and structure of the document are calculated using two
different data models. Even though the XCTPath method also allows users the flexibility
to adjust the path length, nevertheless using one data model in representing both the
structure and content creates many unrelated concepts that cannot be discovered efficiently
by the XCTPath method when using a large path length.
The clustering methods proposed in this thesis and the experiments conducted in this
chapter were to investigate the first hypothesis of this thesis which is that utilizing both
the structure and content of the XML documents can produce a better clustering solution
than the content-only clustering solution, and the content and structure-based clustering
solution. The results of the experiments in this chapter verify this hypothesis. The results
generated by the XCLComb method, except for the IEEE collection in which there is
no association between the structure and the content at all, outperform the clustering
solutions generated by the other clustering methods.
In terms of scalability, the construction of the semantic kernel is expensive in terms of time
and memory consumption. However, the kernel is useful for clustering algorithms such
as the incremental clustering in which the input documents are compared to the cluster
representations .
4.7 Summary
To summarize, this chapter has evaluated the clustering methods proposed in the pre-
vious chapter with two types of XML data collections: homogeneous and heterogeneous
4.7. Summary 121
0
0.2
0.4
0.6
0.8
1
1.2
micro
purity
macro
purity
micro F1 macro F1 NMI
Accuracy
Evalua on Metric
XCTPath_TPath_2
XCLComb_λ=0.3
XCLComb_λ=1
CLUTO_"idf_TPath_2
CLUTO_BM25_TPath_2
CLUTO_"idf_Content_Only
CLUTO_BM25_Content_Only
XCPath
XCTree
XCLS
Figure 4.13: The comparison of all methods on the Niagara collection.
0
0.2
0.4
0.6
0.8
1
1.2
micro
purity
macro
purity
micro F1 macro F1 NMI
Accuracy
Evalua on Metric
XCTPath_TPath_2
XCLComb_λ=0.9
XCLComb_λ=1
CLUTO_"idf_TPath_2
CLUTO_BM25_TPath_2
CLUTO_"idf_Content_Only
CLUTO_BM25_Content_Only
XCPath
XCTree
XCLS
Figure 4.14: The comparison of all methods on the Publication collection.
collections. The evaluation of the clustering methods is based on a number of evaluation
metrics. The clustering methods are evaluated in terms of accuracy as well as scalability
of the hybrid clustering algorithm. This chapter has also analysed the following parame-
ters for the proposed clustering methods described in Chapter 3: (1) the sensitivity of the
4.7. Summary 122
0
0.2
0.4
0.6
0.8
1
1.2
micro
purity
macro
purity
micro F1 macro F1 NMI
Accuracy
Evalua on Metric
XCTPath_TPath_2
XCLComb_λ=0.1
XCLComb_λ=1
CLUTO_"idf_TPath_2
CLUTO_BM25_TPath_2
CLUTO_"idf_Content_Only
CLUTO_BM25_Content_Only
XCPath
XCTree
XCLS
Figure 4.15: The comparison of all methods on the DBLP collection.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
micro
purity
macro
purity
micro F1 macro F1 NMI
Accuracy
Evalua on Metric
XCTPath_TPath_2
XCLComb_λ=0.9
XCLComb_λ=1
CLUTO_"idf_TPath_2
CLUTO_BM25_TPath_2
CLUTO_"idf_Content_Only
CLUTO_BM25_Content_Only
XCPath
XCTree
XCLS
Figure 4.16: The comparison of all methods on the IEEE collection.
clustering threshold to the hybrid clustering algorithm; (2) the scalability of the hybrid
clustering algorithm; (3) the sensitivity of the path threshold of the XCPath method; (4)
the sensitivity of k value on the kernel; and (5) the effect of λ in the XCLComb. The chap-
ter ends with the discussion and analysis of the experimental results from all clustering
4.7. Summary 123
methods.
Chapter 5XML Transformation Approach
The previous chapter has evaluated and analysed the results of the clustering methods
which have been proposed in this thesis. The clustering solutions generated by the pro-
posed clustering methods can be utilized in applications such as XML integration and
Information Retrieval. To further the research in this thesis, one of the clustering methods
namely the XCTree method is modified and utilized in an XML transformation approach,
which is to be discussed in this chapter, as a pre-processing stage. The aim of this chapter
is to investigate the second key hypothesis of this thesis. That is XML clustering based
on the structural information of XML documents can improve the transfor-
mation process in terms of time and accuracy for the conversion of more than
two source documents into the same target document.
In this chapter, an XML transformation approach is proposed to convert data from a
collection of XML documents to another structure format. Unlike other transformation
approaches, the proposed approach does not use schema definitions but instead, it uses
124
5.1. The XML Transformation Approach: Overview 125
the summary tree structure of the input source XML documents for generating XSLT
scripts. Dealing with large input XML documents can be complex, therefore, the proposed
approach first applies the XCTree method on the input source XML documents before
entering the actual transformation process. The input source XML documents are grouped
into a number of clusters where each cluster will have a global structure summary of the
documents within. The global structure summary acts as a source document which is
then used as an input to the transformation process. The proposed XML transformation
approach creates an XSLT script for every cluster in the clustering solution. The source
documents can use the XSLT script associated to the cluster for the conversion.
This chapter begins with an overview of the stages in the XML transformation approaches.
The XML transformation approach has four stages: pre-processing, element matching,
transformation operator, and XSLT script generator. After explaining the four stages of
the XML transformation approach, this chapter then evaluates and analyses the XML
transformation approach.
5.1 The XML Transformation Approach: Overview
The focus of the proposed XML transformation approach is to transform large XML docu-
ments into the same target document more efficiently. To simplify the structure integration
process of the documents in this research, the transformation problem is addressed using
the structure of XML documents from homogeneous collections. Homogeneous collections
have documents sharing the same or similar structure definition. The idea behind this is
if the transformation approach works for homogeneous collections then it would work for
5.1. The XML Transformation Approach: Overview 126
heterogeneous as it is easier to distinguish those documents in similar structures.
The proposed approach is an XML clustering-based transformation (XCTrans). Figure 5.1
illustrates the stages in the XCTrans approach. The input to the XCTrans is a collection
of XML documents known as source documents. In this thesis, homogeneous collections
are used as input collections. Before performing the transformation process, a clustering
algorithm is first applied on the input source document to group them based on their
common structures. The XCTrans approach modifies the XCTree method for grouping
of the input source documents. The XCTree is used rather than the XCPath method
because the tree model utilizing by the XCTree method preserves the sibling relationships
between nodes. Furthermore, the tree model can be easy broken down into paths which
are commonly used in the schema matching stage for finding corresponding nodes between
a source and a target.
After the clustering of the source documents, each cluster is represented by a global sum-
mary structure of the documents within the cluster. The global summary structures of
the clusters in the clustering solution are then used as input to the schema matching
stage. The schema matching stage, or element matching stage, is then executed between
the global summary structures and a target structure document. The element mapping
results generated from the schema matching process are then used in the XSLT script
generator stage to create XSLT scripts. Each cluster will have an associated XSLT script
for converting the content of the documents in the cluster to the target structure. The
generation of the XSLT script in this thesis is inspired from the work of Boukottaya et
al. [8].
5.2. Pre-processing 127
Clustering
Process
Source XML
Document
Collection
Target
XML
Documents
Source XML
Document
Collection
C1
C2
Cn
Schema
Matching
Target
Document
Element Mapping
Result
Transformation
Operators XSLT
Scripts
Transformation
Processor
Figure 5.1: The XCTrans approach.
5.2 Pre-processing
The first stage of the XCTrans approach is the pre-processing. This stage involves the
clustering of the source documents into a number of clusters. The XCTree method is
utilized by the XCTrans for the clustering of the source documents. The summary tree
structure in the XCTree method is extended with addition of the quantifiers for the nodes.
For example, take the structure tree c in Figure 5.2, the depth-first string format of c
will be company(1,1) address(1,1) cname(1,1) -1 personnel(1,1) person(1,2) name(1,1)
first(1,1) last(1,1) -1 -1 -1 -1, where each node is associated with two numbers between
the brackets. The first number indicates the minimum occurrence that the node appears
under its parent node, and the second number indicates the maximum occurrence that
the node appears under its parent node.
The cluster representation of the XCTree method is also different for it to be utilized in the
XCTrans method. Instead of using the first document representation, it uses the summary
tree structure called global summary tree structure which consists of all the structure of
the documents in a cluster. An example of the global summary tree structure is shown in
5.2. Pre-processing 128
Figure 5.3 which is extracted from the collection of documents in Figure 5.2. Each element
in the global summary structure is associated with two numbers. These numbers are used
to identify the quantifier, or cardinality operator, of the element node. For example in
DTD schema definition, + quantifier indicates that an element can appear in its parent
content model one or more times, * indicates zero or more, and ? indicates zero or one
time. The first number appearing with an element in the summary structure indicates
the number of times the element appears in the documents of the cluster. This number
helps to identify the existence of quantifiers in the elements. To identify the minimum
occurrence of an element node the first number is divided by the number of the documents
in the cluster. If the result is lesser than 1 then the element is optional else it must exist
at least once. The second number indicates the maximum number of times the element
is appearing under its parent in any of documents in the cluster. If the number is greater
than 1 then the element can occur multiple times. For example, the person(3,2) node
in Figure 5.3 has two numbers, 3 and 2 associated to it. These numbers indicate that
the person node has the + quantifier because (1) the division of the first number by the
number of documents in the cluster yields a value equal to 1 and (2) the second number
is larger than 1.
Definition 3. A cluster representation of a cluster is the a global summary tree structure.
A global summary tree structure is the integrated tree structure of the summary tree struc-
tures which currently exist in the cluster. The integration of the summary tree structure
is simply the union of the nodes from the same level. There exist a node i and a node j
on the same level and ni = nj provided that their node labels and node types are the same,
only one node is presented in the global summary tree summary.
5.2. Pre-processing 129
For a document to be assigned to a cluster, the data similarity (Equation 3.1 between the
document and a cluster has to exceed a clustering threshold β as well as the union of the
document structure and the cluster representation has to exceed a integration threshold
δ. The integration threshold δ is a value to control and determine whether the structures
of two summary trees should be integrated or not. The integration threshold is calculated
by considering a number η which is defined by the user between 1 and 2. The δ is defined
by the total number of nodes of a summary tree structure (of a document) and a global
summary tree structure (a cluster representation) divided by η. For instance, let η be
1.3, the number of nodes in a summary tree structure is 10 and the number of nodes in
a cluster representation is 12. If the number of nodes in the integrated structure of the
two structures is 15, the two structures can be integrated since 15 does not exceed the
integration threshold δ of 17 nodes ((10+22)/1.5)).
If the maximum data similarity value between a document and a cluster exceeds the
clustering threshold and the union structure between the document structure and the
cluster representation does not exceed the integration threshold, the document is assigned
to the cluster and the union structure becomes the new cluster representation for the
cluster. On the other hand, if the maximum data similarity value between the document
and a cluster does not exceed the clustering threshold or the union structure exceeds the
integration threshold, the document is assigned to a new cluster. Each time a document
creates a new cluster in a clustering solution, the structure of that document becomes the
cluster representation for comparing and grouping new input documents.
5.3. Element Matching 130
company
address cname personnel
person
name
first last
company
address cname personnel
person
name
first last
person
name
first last
(c) (b) (a)
company
address cname personnel
person
name email
first last
Figure 5.2: An example of source document structures in the same cluster.
company(3,1)
address(3,1) cname(3,1) personnel(3,1)
person(3,2)
name(3,1) email(1,1)
first(3,1) last(3,1)
Figure 5.3: An example of a source summary structure format.
5.3 Element Matching
Before executing the element matching stage (or schema matching), the structure of the
cluster representations and the target structure document needed to be processed. Let
Figure 5.4 be an example of a target structure and Figure 5.3 be the global summary
structure of a cluster. The input target structure document can be in DTD or XML-
Schema definition format. These schema definitions are converted and also represented in
5.3. Element Matching 131
company
address cname personnel
+ street city postal state
name
person
?
Figure 5.4: An example of a target structure definition represented in a tree formats.
Table 5.1: Quantifier mapping between XSD and DTDQuantifier minOccurs maxOccurs No. of ChildOperator Element(s)
none 1 1 once and only once
? 0 1 zero and one
* 0 unbounded zero or more
+ 1 unbounded one or more
a tree structure like the one shown in Figure 5.4. The mapping of the different quantifiers
in DTD and XML-Schema definition is shown in Table 5.1. If there is no quantifier
indicated for a particular element then that element exists once and only once under its
parent content model.
For the element matching process the tree structure of a target document and a cluster
representation are broken down into a collection of paths. Each path in the path collection
contains the elements from the root to the element containing leaf nodes. It also contains
a set of leaf nodes belonging to the path like the example below:
Source:
5.3. Element Matching 132
p1 : company/cname, address
p2 : company/personnel/person/name/first, last
p3 : company/personnel/person/email
Target:
p1 : company/address/street,city, state, postal
p2 : company/cname
p3 : company/personnel/person/name, email
After the paths are extracted, these paths are then used in the element matching stage.
The element matching stage is to find the corresponding elements between a global sum-
mary structure of a cluster and the target document structure. It is divided into two
stages: (1) discovery of corresponding leaf elements and (2) discovery of all corresponding
elements between a source and a target structure.
5.3.1 Discovery of Corresponding Leaf Elements
Before finding corresponding leaf elements between the input source and target structure,
path similarity is first calculated between the extracted path collections. Let px and py
are the two sets of nodes that exist in the paths. The path similarity measure which is
defined in Equation 5.2 between px and py is the intersection of the two sets of nodes
divided by the total number of nodes in px and py. For example, the pathSim between
p1 (company/cname,address) in the source structure and p2 (company/cname) in the
5.3. Element Matching 133
target structure is (2× 2)/(3 + 2) equal 0.8.
pathSim(px, py) =2× |px ∩ py||px|+ |px|
(5.1)
The leaf elements of the target are compared with the leaf elements of the source if their
path similarity values exceed a path threshold φ which is defined by the user ranges from
0 to 1, where 1 means is an exact match. The leaf similarity measure is defined as follows:
leafSim(ei, ej) = γlabelSim(ei, ej) + µancestorSim(ei, ej) + ωleafSiblingSim(ei, ej);
(5.2)
where the weightings γ, µ, ω are defined by the users to adjust the important of the
similarity measures. To compute the leaf similarity measure the following similarities are
used:
labelSim(ei, ej) where element ei ∈ px and element ej ∈ py: This is to measure the
name similarity of the leaf elements ei and ej . We use the n-gram method as defined
in Equation 5.3 where A is the number of unique n-grams in the first element name,
B is the number of unique n-gram in the second, and C is the number of unique
n-grams common of the two names. For example, let the two names of the elements
be company and company1. If we apply a string matching method then the labelSim
will be 0. Using the n-gram method, for instance the 2-grams (di-grams) for company
is co, om, mp,pa, an, ny. The labelSim between ei and ej will be (2(6)/(6+7)) 0.92.
5.3. Element Matching 134
labelSim =2C
(A+B)(5.3)
ancestorSim(ei, ej): There are two different similarity measures in ancestorSim, one
measure is to count the common ancestors of ei and ej without considering the hier-
archical order of the ancestors divided by the maximum number of ancestors of ei and
ej , which is denoted as nonLevelSim, and the other measure is to count the common
ancestors of ei and ej occurring in the same hierarchical level divided by the maxi-
mum number of ancestors of ei and ej , which is denoted as levelSim. The average
sum of these two measures becomes the ancestorSim value. To find the common an-
cestors between two leaf elements, a intersection operator is applied on the ancestor
sets. For example let the ancestors of ei and ej be company/personnel/person/name
and company/personnel/name/person respectively, the nonLevelSim will be 1 (4/4)
and LevelSim will be 0.5 (2/4). The ancestorSim of these two elements will be the
average of nonLevelSim and LevelSim similarities, which is equal to 0.75.
leafSiblingSim(ei, ej): This similarity measure counts the number of sibling ele-
ments that ei and ej have in common times. The number of common siblings, which
is the intersection of the sibling sets of ei and ej , is multiplied by 2. The similarity
is normalized by the total number of sibling elements of ei and ej . If the two leaf el-
ements have no leaf siblings then the leafSiblingSim will be 1. For example let the
siblings of ei and ej be name, email and email respectively, the leafSiblingSiim
of ei and ej is 0.66 ((1× 2)/(2+1)).
All the above similarity measures are range from 0 and 1 where is the exact match.
5.3. Element Matching 135
The higher the similarity values the higher the chance that the element ei and ej are a
corresponding pair.
5.3.2 Discovery of All Corresponding Elements
After the leafSim is calculated for all leaf elements from the target structure to the
source structure, the matching result of the leaf elements can be confirmed by the user
or in the absence of any user approval, the best pair is selected for each leaf element in
the target document. The next step of the element matching process is to discover all
corresponding elements between the target structure and the source structure. This is
done automatically. For example, Table 5.2 shows the result of the corresponding leaf
elements. The result is then used in this stage to find all corresponding elements which
can be used for the generation of the XSLT script. Figure 5.5 is the algorithm for finding
all corresponding elements. Let elemMapSet stores the mapping pairs. For each mapping
pair in Table 5.2, the following steps are applied:
For each element in a path Px of the target document, starting from the root element
– If there exists an element on the same level in the path Py of the source structure
and they are the same element type (i.e. complex or leaf element), generates a
mapping pair between the two. The element of the target document is stored
with the source element in elementMapSet if it is not already there. The source
element is stored as relative path, path containing elements from the root to
the mapping element.
– If the element of the target document is not the leaf element and could not find
5.4. Transformation Operator 136
a match in the source document on the same level because |px| > |py| then it
is stored in the elemMapSet with a Null value.
if the element in Px is a leaf element then the element is mapped with the leaf
element in the source and the mapping result is stored in elemMapSet.
For example. Consider a mapping path pair company/address/street and company/address
of the target and the source structure respectively. The corresponding elements gen-
erated from the mapping pair are: company− > company, address− > Null, and
street− > company/address. Even though both paths contain the address element on
the same level, however, one is a leaf element and the other is a complex element.
Table 5.2: The leaf element mapping resultTarget Doc Source Doc
company/address/street company/address
company/address/city company/address
company/address/state company/address
company/address/postal company/address
company/cname company/cname
company/personnel/person/name company/personnel/person/name/first
company/personnel/person/name company/personnel/person/name/last
company/personnel/person/email company/personnel/person/email
5.4 Transformation Operator
Before generating the XSLT script, transformation operator should be identified for each
mapping pair discovered in the element mapping process. Three transformation operators
are considered to be important in the proposed XML transformation approach, they are
connect, join, and split operators.
5.4. Transformation Operator 137
Figure 5.5: Element mapping algorithm
Input: Set completePathMapSet //contains complete path mapping pairs;
Output: Set elemMapSet //contains element mapping pairs
1. Set elemMapSet = null;
2. String map = null;
3. for each mapping ∈ completePathMapSet;
4. Set T = getTargetPathElem();
5. Set S = getSourcePathElem();
6. for (j = 1; j <= T.length; j ++)
7. if(j < S.length∥(j == T.length&T.length == S.length))
8. map = Tj− > S[1...j];
9. if(!elemMapSet.contain(map))
10. elemMapSet.add(map);
11. end if;
12. else if(j => S.length)
13. map = TT.length− > S[1...S.length];
14. if(!elemMapSet.contain(map))
15. elemMapSet.add(map);
16. end if;
17. for j to T.length− 1
18. map = Tj− > Null
19. if(!elemMapSet.contain(map))
20. elemMapSet.add(map);
21. end if
22. end for
23. break for
24. else if(j = T.length)
25. map = TT.length− > S[j...S.length];
26. if(!elemMapSet.contain(map))
27. elemMapSet.add(map);
28. break for;
29. end if;
30. end if;
31. end for;
32. end for;
5.4. Transformation Operator 138
connect: t=connect(s), it copies the content from a source element s to a target ele-
ment t with no modification to the structure of the source document. This operator
is used for one-to-one mapping result where no modification is necessary.
join: t=join(s), it joins the content of 2 or more elements in the source document
structure into one element in the target document structure. This operator is re-
quired for one-to-many mapping relationships.
split: t=split(s), it splits the content of an element in the source document structure
into two or more elements in the target. This operator is applied on many-to-one
relationships.
Table 5.3 shows the corresponding elements between the target and the source structure.
Each corresponding element pair is assigned with a transformation operator. The assigna-
tion of the transformation operator is done automatically. To assign a connect operator to
an element in the target document, the mapping pair of the element in the elementMapSet
should occur only once. If an element of the target document occur more than once in
the elemMapSet with the same source path in the elemMapSet then a split operator is
assigned to the mapping pair of the element. To assign a join operator to an element
in a target document, the element occurs many times in the elemMapSet with different
source paths. The elements in the target document which do not matched with any of the
elements in the source structure will be assigned with a null operator.
Using the mapping result established in Figure 5.3, for each mapping pair, the following
information is stored: ID, source access path list, target element list, child mapping list,
child mapping type list, quantifier, transformation operator, and condition. Their detail
5.4. Transformation Operator 139
Table 5.3: Transformation operators for corresponding elementsTarget Source TransformationDoc Access Paths Operator
company company connect
address null null
street, company/address splitcity,state,postal
cname company/cname connect
personnel company/personnel connect
person company/personnel/person connect
name company/personnel/name/first, joincompany/personnel/name/last
email company/email connect
is as follows:
ID - each mapping result in Table 5.3 will have a unique ID. The mapping result of
the root element will have an ID of 1. The child index will have its parent ID before
its ID. For example, if element company has the ID of 1 then its first child mapping
result ID will be 1.1, its second child mapping result ID will be 1.2 and so on. In
this way, the mapping result can be processed in hierarchical structure of the target
document in which the mapping result of the root element serves as a starting point
to process the mapping results.
source access path list - a list containing all the source access paths that are match
to the elements in the target element list.
target element list - a list containing all the target elements that are match to the
elements in source access path list.
child mapping list - a list containing all the IDs of the target child mapping results.
5.4. Transformation Operator 140
ID,source access path list,target element list,Child mapping list,Child Mapping Type list,quantifier,transformation,condition
1,company,company,1.1,1.2,1.3,null,one-to-one,one-to-one,none,connect,null
1.1,,address,1.1.1,many-to-one,none,null,null
1.1.1,company/adress,street,city,state,postal,,,none,split,null
1.2,company/cname,cname,,,none,connect,null
1.3,company/personnel,personnel,1.3.1,one-to-one,none,connect,null
1.3.1,company/personnel/person,person,1.3.1.1,1.3.1.2,one-to-many,one-to-one,+,connect,null
1.3.1.1,company/personnel/person/first,company/personnel/person/last,name,,,none,join,null
1.3.1.2,company/personnel/person/email,email,,,?,join,null
Figure 5.6: Element mapping result.
child mapping type list - a list containing all the types of the child mapping results,
i.e. one-to-one, one-to-many or many-to-one.
quantifier - if the quantifier is none then the mapping pair occurs once and only
once under its parent node. Quantifier such as ? (optional) is treated the same way
as mapping pair with a none quantifier.
transformation operator - the transformation operator of the mapping result such as
split,connect,join or null. The null operator indicates that there is no mapping for
the target element.
condition - this information is to indicate the special condition for the mapping
result to be possible
For example, using the mapping results in Table 5.3, the mapping result is processed into
the example in as shown Figure 5.6. Figure 5.6 shows the final mapping result between
the elements in the target document and the source documents. This result is used and
processed by the next stage which is the XSLT script generator stage.
5.5. XSLT Script Generator 141
5.5 XSLT Script Generator
The XSLT script generator stage is inspired by the work of Boukottaya et al. [8]. An
XSLT1 program relies on XPath expressions to navigate a source document. There are
two technique to generate the XSLT script: pull and push. Push means emitting output
whenever some conditions are satisfied by the nodes (elements) in the source document.
Pull technique usually refers to the process that walks through an output template and
retrieves data from the nodes in the source document. An example of a push technique is
the use of “match” and “apply-templates” to generate the output by further processing
all the children of the matched node. An example of pull technique is the use of ”select”
to query the source instance and extract the value of the source selected node. In the
proposed XML transformation approach, these two techniques are used in generation of
the XSLT script. An XSLT template generally takes the following form:
<xsl:template match=pattern name=qname priority=number mode=qname>
construction rules which possibly call/apply other templates
< /xsl:template>
Three kinds of XSLT template are used in here. They are pattern template, mode template
and named template. The pattern templates do not need a name or mode attribute.
They can be called by an xsl:apply-templates element without a mode or name attribute.
Similarly, the mode templates can be called by an xsl:apply-templates element but with
1http://www.w3.org/TR/xslt
5.5. XSLT Script Generator 142
a mode attribute. The mode templates can be used to enforce a particular construction
phase by restricting processing to a set of templates that will be called during that phase.
Lastly, the named templates can be used to give the flexibility to call a specific template
whenever necessary at any construction phase. They can be called by an xsl:call-template
element with a matched name attribute.
Using the format of the mapping result shown in Figure 5.6, the XSLT script is generated
using the following steps below. An example of the output XSLT script is shown in
Figure 5.7:
1. Initializing the translation - Take the first mapping element which is the mapping of
the root element. It is assumed that the root element always has one-to-one mapping
relationship. Once the first root element mapping result is located, the generation
of the template rules can begin.
2. Traverse the mapping result as shown in Figure 5.6 in depth-first manner meaning
processing the mapping element result in hierarchical order according to the ID of
the mapping element result.
(a) generate a construction template for the current mapping element pair
(b) for each child mapping elements, adjust the above template for inserting more
construction or apply-template rules whenever necessary.
(c) add the templates to the XSLT stylesheet.
(d) if there is more mapping element pair needed to be processed, loop back to
2(a).
5.5. XSLT Script Generator 143
<xsl:transform
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="company">
<company>
<xsl:apply-templates select="address"/>
<xsl:apply-templates select="cname"/>
<xsl:apply-templates select="personnel"/>
</company>
</xsl:template>
<xsl:template match="address">
<address>
<street>
<xsl:value-of select = "substring-before(.,',')"/>
</street>
<city>
<xsl:value-of select="substring-before(substring-after(.,','),',')"/>
</city>
<state>
<xsl:value-of select="substring-before(substring-after(substring-after(.,','),','),',')"/>
</state>
<postal>
<xsl:value-of select ="substring-after(substring-after(substring-after(.,','),','),',')"/>
</postal>
</address>
</xsl:template>
<xsl:template match="cname">
<cname>
<xsl:value-of select="."/>
</cname>
</xsl:template>
<xsl:template match="personnel">
<personnel>
<xsl:for-each select="person">
<person>
<xsl:call-template name="person-trans"/>
</person>
</xsl:for-each>
</personnel>
</xsl:template>
<xsl:template name="person-trans">
<name>
<xsl:value-of select="name/first"/>
<xsl:text> </xsl:text>
<xsl:value-of select="name/last"/>
</name>
<email>
<xsl:value-of select="email"/>
</email>
</xsl:template>
</xsl:transform>
Figure 5.7: An example of an XSLT Script.
5.6. Results of Experiments 144
The element matching, transformation operator and the XSLT script generator stages
are executed for each global summary structure generated in the pre-processing stage.
Eventually, each cluster will have an XSLT script associated with it which is used by an
XSLT processor2 to convert the input source documents into the target document format.
5.6 Results of Experiments
The experiments are carried out to measure the performance of the proposed XML trans-
formation approach in comparison to the traditional way of transforming XML data.
Furthermore the accuracy of the element matching employed by the proposed approach is
also evaluated.
5.6.1 Data Collection
The detail of the input source data collections which are used in the experiments is shown
in Table 5.4. The input source data collections are derived from the XML document
collections which have been used to evaluate the clustering methods proposed in this thesis
(Table 4.2). The Movie and the Bibliography collections are homogeneous collections.
The DBLP collection is a homogeneous collection however it has the characteristic of a
heterogeneous collection which contains 8 different structure formats for books, conference,
journals, MS, persons, PhD, Tr and WWW as shown in Table 4.2. The structure of the
input source collections is not large, however it is chosen for the evaluation of the proposed
XML transformation approach because with small structure it is easier to analyse and
2http://saxon.sourceforge.net/
5.6. Results of Experiments 145
evaluate the performance of the XCTrans approach.
Table 5.4: Data collections for XML transformationcollection No. of Docs No. of No. of
Hierarchical DistinctLevels Elements
DBLP 4910 4 32
Movie 37 4 12
Bibliography 16 5 14
The target DTD documents for each data collection are manually defined. Refer to the
Appendix for the source DTD documents of the data collections in Table 5.4 and target
DTD documents which are used for the testing of the proposed transformation approach.
5.6.2 Evaluation Metric
In the evaluation of the XCTrans approach, the processing time of using the XSLT scripts
generated from the XCTrans to transform the input source collections in the target format
is measured in seconds. For the evaluation of the element matching in the XCTrans
approach, recall and precision methods are used. Let A is a set of correct matches detected
by a human, C is a set of mappings generated by an automatic matching system. Precision
is the ratio between the number of correct mappings generated by the system and the
total number of mappings in C. It indicates of how many incorrect mappings have been
discovered by the element matching stage.
precision =C ∩A
C(5.4)
Recall is the ratio between the number of correct mappings generated by the matching
5.6. Results of Experiments 146
system and the total number of correct mappings (i.e., mappings discovered by human).
It gives an indication of how many correct mappings are missed by the matcher.
recall =C ∩A
A(5.5)
5.6.3 Scalability
Figure 5.8 shows the performance of the transformation processing time of using the
XSLT scripts generated from the XCTrans approach and the processing time using the
One-Script method. The One-Script method is the traditional way of doing transformation
in which each source data collection is associated with an XSLT script. The XSLT script
is generated using the data collection’s DTD schema definition (refer to the Appendix).
The XCTrans approach, on the other hand, has many XSLT scripts for transforming the
input source data collections. The number of generated XSLT scripts corresponds to
the number of generated clusters after executing the XCTree method on the input source
collections. For example, after the grouping of the DBLP collection based on the document
structure, 8 different clusters might be produced from the XCTree method, therefore, eight
XSLT scripts are produced from the proposed approach. From the results in Figure 5.8,
it can be seen that in terms of speed the XSLT scripts generated by XCTrans approach
performs better in the DBLP collection which has more documents and the structure
of the documents are larger than the Movie and the Bibliography collections. For small
structure such as the Movie and Bibliography collections, the performance of the XCTrans
approach is equivalent or slightly worser than the One-Script method due to the searching
5.6. Results of Experiments 147
and loading of the different XSLT scripts.
0
10
20
30
40
50
60
dblp movies bibliography
pro
cess
ing
m
e (
sec)
Data Collecon
XCTrans
One-Script Method
Figure 5.8: XML transformation process time on the dataset.
Figure 5.9 displays the difference in the transformation processing time in relation to the
size of the DBLP collection. The graph in Figure 5.9 shows that for the DBLP collection,
the larger the collection size, the greater is the difference in the processing time between
the XCTrans and the One-Script method. The whole collection of the DBLP is around
157120 documents as discussed in Chapter 4 Section 4.1.
0
50
100
150
200
250
300
350
400
450
pro
cess
ng
m
e (
sec)
No. of docs
XCTrans
One-Script Method
Figure 5.9: The processing time in second in relation to the number of documents in theDBLP collection.
5.6. Results of Experiments 148
Furthermore, experiments are carried out to test the processing time of the XCTrans
approach with the different numbers of generated clusters (scripts). Figure 5.10 shows
the processing time in seconds in relation to the number of clusters generated on the
DBLP collection. From the result, it can be ascertained that there exist an optimal
number of clusters (scripts) that needed to reach in order to have the best performance,
however, when the number of clusters is beyond the optimal number the performance
of the transformation process will degrade because of the need of higher indexing and
retrieving of the scripts.
0
50
100
150
200
250
1 2 3 4 5 6 7 8 9 10 11 12 13 14
pro
cess
ing
m
e (
sec)
No. of clusters generated
XCTrans
Figure 5.10: The processing time in seconds with the different numbers of clusters on theDBLP collection
5.6.4 Element Mapping
The accuracy of the mapping results generated by the element matching process are com-
pared with the Similarity Flooding (SF) method [42]. The SF approach is based on
labelled graphs which are used in an iterative fixpoint computation whose results tell us
what nodes in one graph are similar to nodes in the second graph. For computing the
5.6. Results of Experiments 149
similarities, it relies on the intuition that elements of two distinct models are similar when
they occur in similar contents, i.e., when their adjacent elements are similar. In other
words, a part of similarity of two elements propagates to their respective neighbours. Fig-
ures 5.11 and 5.12 show the results on the recall and precision of the employed element
matching process and the SF method respectively. The XCTrans outperforms the SF
method. The string matching measures combined with a propagation process employed
by the SF method is not flexible enough to achieve better results.
0
0.2
0.4
0.6
0.8
1
1.2
DBLP Movies Bibliography
Re
call
Data Collec!on
XCTrans
SF Matching
Figure 5.11: The mapping accuracy based on recall measure.
0
0.2
0.4
0.6
0.8
1
1.2
DBLP Movies Bibliography
Pre
cisi
on
Data Collec!on
XCTrans
SF Matching
Figure 5.12: The mapping accuracy based on precision measure.
5.7. Discussion 150
5.7 Discussion
The evaluation of the XCTrans approach in this thesis is not analysed extensively. The
main focus of the thesis is to investigate the usability of the XML clustering in XML
transformation. From the results of the experiments, the XSLT scripts generated by
the XCTrans approach performs better on the input data collections such as the DBLP
collection which are large in the size of the documents and complex in structure. As
the number of documents in the input source data collections increases, the greater is the
difference in the processing time between the XTrans and the One-Script method as shown
in Figure 5.9.
Even though the structure of the input source collections is not very large in terms of the
number of documents or complex in structure, the XCTrans still shows an improvement in
the transformation process. If this approach can be modified and used on heterogeneous
collections, it is believed that the approach can be very useful. This has shown to be
true for DBLP collection which is a homogeneous collection but has the characteristic of
a heterogeneous collection.
5.8 Chapter Summary
In this chapter, the XCTree method has been utilized in the proposed XML transforma-
tion approach for translating large input source XML document collections into a target
structure. Firstly, the XCTree method is applied on the input source data collections to
reduce the complexity of the structure integration of the whole input source documents.
5.8. Chapter Summary 151
Each cluster has a global summary structure representation of the documents in the clus-
ter. The global summary structure representation acts as a source structure which is used
as input to the schema matching stage for the generation of transformation script. The
experiments show an improvement in the performance of the XCTrans approach in terms
of the processing time and the accuracy of the element mapping.
Chapter 6Conclusion
As the popularity of the XML data increases so does the amount of XML data on the
Web. With the increasing amount of XML data, there is a necessity to better manage
and analyse of large collections of XML data. For better management of the XML data,
XML clustering has played an important role. There are still many problems existing in
the XML clustering task such as the nature of the roles for the structure and content of
the XML documents in the clustering of the documents. Therefore, the first main research
question of this thesis is: Can the accuracy of the clustering solution be improved by using
both the structure and content of XML documents?
In response to the first question, this thesis has proposed a number of clustering methods
for the clustering of XML documents using the structure-only information and using both
the content and structure of documents. The results of the experiments verify that for most
data collections, the clustering solutions using both the structure and content, especially
for the linear combination of the structure and content, outperform the results of the
152
6.1. Summary of Findings 153
structure-only clustering and the content-only clustering.
The existing transformation approaches discussed in Chapter 2 address only the transfor-
mation problem for XML data between a source data and a target data at a time. However,
to perform the transformation process for much source data sharing similar characteristics
can be time consuming. Therefore, the second research question in this thesis is: Given
a collection of source XML documents and a target document, can the grouping of the
source documents into small sets of similar structures improve the processing time and
accuracy of the XML transformation?
In response to the second research question, an XML transformation approach has been
proposed in this thesis which incorporates the clustering process as a pre-processing stage
in the XML transformation approach for converting many source documents into the same
target document more efficiently. This confirms the hypothesis that XML clustering based
on the structural information of XML documents can improve the transformation process
in terms of time and accuracy for the conversion of more than two source documents.
6.1 Summary of Findings
For the proposed clustering methods, a number of findings have been made which have
been discussed in Chapter 4 in the discussion section. To summarize, the results of the
experiments which have been conducted on the proposed clustering methods illustrate
that the proposed clustering methods perform differently on different types of XML data
collections. For homogeneous collections such as the IEEE collection the content-only
clustering solutions outperform the structure-only clustering solutions and the content and
6.2. Summary of Contributions 154
structure-based clustering solutions. However for the classification of the documents from
homogeneous collections based on the structure such as the DBLP collection, the structure
also plays an important role. As for the heterogeneous collections, both the structure-only
clustering solutions and the content and structure-based clustering solutions outperform
the content-only clustering solutions. The results generated by the XCLComb method
which linearly combines the structural similarity value and the content similarity value
outperforms the other methods for most collections used in this thesis. Thus, the first
hypothesis of this thesis is verified.
In addition, the findings of the proposed XML transformation approach have been dis-
cussed in Chapter 5. To summarize, the proposed XML transformation approach using
a structure-only clustering method such as the XCTree method in this thesis as the pre-
process stage for the transformation task improves the transformation process for convert-
ing many source documents into the same target document at the same time. Since the
XML transformation approach uses a global representation of the source documents in
each clusters as the source documents, the errors of the schema-matching process in the
XML transformation process also reduced in element matching.
6.2 Summary of Contributions
This thesis provides an overview of XML data, XML clustering and XML transformation.
Based on the literature review of current work, a number of XML clustering methods, as
well as a novel XML transformation approach have been proposed in this research. The
main contributions are summarised below.
6.3. Limitations and Future Work 155
Developed clustering methods to deal with both homogeneous and heterogeneous
collections.
Combined structure and content to improve the quality of the clustering solution on
both homogeneous and heterogeneous collections.
Proposed clustering methods to assist the schema matching process in data integra-
tion application, as well as in XML transformation application.
Proposed a novel XML transformation approach to incorporate the XML clustering
process to improve the transformation process in transforming more than one XML
document into the same target document.
6.3 Limitations and Future Work
Several extensions can be made to improve the current proposed methods in the future.
Extend the clustering methods so that they can apply to the clustering of XML
schema definition data. Current clustering methods only address the problem of
XML clustering at the document level, however, the methods can easily be extended
for the clustering of XML schema definition data.
Improve the similarity measure using external sources such as WordNet [22] to learn
the synonyms between tag names and content. Current methods do not use any
external sources for finding the synonyms between tag names or the content.
Extend the proposed XML transformation approach for converting a collection of
6.3. Limitations and Future Work 156
XML schema definition data. The evaluation of the XML transformation approach at
the moment is not sufficient to verify the second hypothesis of this thesis, therefore an
extensive evaluation of the XML transformation is still required on a more complex
source data collections. At the moment, the current XML transformation approach
address only the problem of transforming more than one source documents such
as XML documents. The performance of the transformation process improves but
not that significantly. However, if this proposed transformation system is applied
to XML schema definitions, the performance of the transformation process might
improve significantly. By applying the proposed transformation approach on the
XML schema definitions, more work will need to be done the schema matching
process.
Chapter 7Appendix
7.1 DTD Definitions of the Data Collections for XML Clus-
tering Methods
This section contains the schema definitions for the data collections which are used to
evaluate the proposed clustering methods in this thesis. Since the schema definitions of the
data collections are very long, therefore the examples as shown in Figures 7.1, 7.2, 7.3, 7.4
show only a portion of the schema definitions.
7.2 DTD definitions for the XML Transformation Approach
This section contains the DTD definitions of the source data collections and the DTD
definitions of the target DTD definitions for evaluation of the XML transformation ap-
proach proposed in this thesis. Figures 7.9 and 7.10 show only a portion of the source
157
7.2. DTD definitions for the XML Transformation Approach 158
…
<!ELEMENT article (fno, doi?, fm, bdy, bm?)>
<!ELEMENT fno (#PCDATA)> <!-- article ID (no entity references) -->
<!ATTLIST fno fid NMTOKEN #IMPLIED>
<!ELEMENT doi (#PCDATA)>
<!-- ============ -->
<!-- FRONT MATTER -->
<!-- ============ -->
<!ELEMENT fm (hdr?, (edinfo|au|tig|pubfm|abs|edintro|kwd|fig|figw)*)>
<!-- ++++++ -->
<!-- HEADER -->
<!-- ++++++ -->
<!ELEMENT hdr (fig?, hdr1, hdr2)>
<!ELEMENT hdr1 (#PCDATA|crt|obi|pdt|pp|ti)*>
<!ELEMENT hdr2 (#PCDATA|crt|obi|pdt|pp|ti)*>
…
Figure 7.1: An example of the IEEE article DTD definition
...
<!ELEMENT USMARC (Leader, Directry, VarFlds)>
<!ATTLIST USMARC Material (BK|AM|CF|MP|MU|VM|SE|AU) "BK"
id CDATA #IMPLIED>
<!ELEMENT Leader (LRL, RecStat, RecType, BibLevel, UCP, IndCount, SFCount,
BaseAddr, EncLevel, DscCatFm, LinkRec, EntryMap)>
<!ELEMENT Directry (#PCDATA)>
<!ELEMENT VarFlds (VarCFlds, VarDFlds)>
<!ELEMENT LRL (#PCDATA)>
<!ELEMENT RecStat (#PCDATA)>
<!ELEMENT RecType (#PCDATA)>
<!ELEMENT BibLevel (#PCDATA)>
<!ELEMENT UCP (#PCDATA)>
<!ELEMENT IndCount (#PCDATA)>
<!ELEMENT SFCount (#PCDATA)>
<!ELEMENT BaseAddr (#PCDATA)>
<!ELEMENT EncLevel (#PCDATA)>
<!ELEMENT DscCatFm (#PCDATA)>
<!ELEMENT LinkRec (#PCDATA)>
<!ELEMENT EntryMap (FLength, SCharPos, IDLength, EMUCP)>
<!ELEMENT FLength (#PCDATA)>
<!ELEMENT SCharPos (#PCDATA)>
<!ELEMENT IDLength (#PCDATA)>
<!ELEMENT EMUCP (#PCDATA)>
...
Figure 7.2: An example of the Berkeley article DTD definition
7.2. DTD definitions for the XML Transformation Approach 159
<!ELEMENT entry (article|book|booklet|manual|manuscript|phdthesis|mastersthesis|proceedings|
inproceedings|incollection|inbook|techreport|unpublished|misc)+>
<!ATTLIST entry
id NMTOKEN #REQUIRED>
<!ELEMENT article ((author | altauthor |title |year |journal |conference |volume |number |
mrnumber |govnumber |pages |note |contents |copyright |price |annote |titletranslation |
keywords |free-terms |general-terms |abstract |reviewer |classification-codes |
subject-descriptors |language |links | doi | url |entrydate |key |issn |institution |
provider |english |ideanresearch.com |mixed)+)>
<!ATTLIST article
id NMTOKEN #REQUIRED
provider (LEABIB|CP|DBLP|TCCSB) #IMPLIED
mdate CDATA #REQUIRED
date CDATA #IMPLIED>
<!ELEMENT book ((author | altauthor |editor |translation |title |year |publisher |conference |volume |
series |address |edition |note |govnumber |mrnumber |contents |copyright |price |annote |
titletranslation |keywords |free-terms |general-terms |abstract |reviewer |classification-codes |
subject-descriptors |language |entrydate |key |links | doi | url |isbns |institution |provider |
english |ideanresearch.com |mixed)+)>
<!ATTLIST book
id NMTOKEN #REQUIRED
provider CDATA #IMPLIED
mdate CDATA #REQUIRED
date CDATA #IMPLIED>
...
Figure 7.3: An example of the HCI article DTD definition
7.2. DTD definitions for the XML Transformation Approach 160
<!ELEMENT bibliography (
article|book|dissertation|proceedings|inproceedings|incollection|techreport|misc)+>
<!ELEMENT article (
author?,title,year?,journal?,volume?,number?,pages?,note?,titletranslation?,keywords?,abstract?,
reviewer?,classification?,language?,links?,issns?,affiliation?,provider?,mixed?)>
<!ATTLIST article
id NMTOKEN #REQUIRED
provider (LEABIB|CP|DBLP|TCCSB) #IMPLIED
mdate CDATA #REQUIRED
date CDATA #IMPLIED>
<!ELEMENT book (author?,editor?,title,year?,publisher?,volume?,series?,address?,edition?,note?,
titletranslation?,keywords?,abstract?,reviewer?,classification?,language?,links?,isbns?,
affiliation?,provider?,mixed?)>
<!ATTLIST book
id NMTOKEN #REQUIRED
provider CDATA #IMPLIED
mdate CDATA #REQUIRED
date CDATA #IMPLIED>
<!ELEMENT dissertation (author?,title,year?,school?,address?,month?,note?,titletranslation?,keywords?,
abstract?,reviewer?,classification?,language?,links?,isbns?,affiliation?,provider?,mixed?)>
<!ATTLIST dissertation
id NMTOKEN #REQUIRED
provider CDATA #IMPLIED
mdate CDATA #REQUIRED
date CDATA #IMPLIED>
…
Figure 7.4: An example of the DBLP article DTD definition
7.2. DTD definitions for the XML Transformation Approach 161
<?xml encoding="ISO-8859-1"?>
<!ELEMENT bib (vendor)*>
<!ELEMENT vendor (name, email, phone?, book*)>
<!ATTLIST vendor id ID #REQUIRED>
<!ELEMENT book (title, publisher?, year?, author+, price)>
<!ELEMENT author (firstname?, lastname)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT publisher (#PCDATA)>
<!ELEMENT year (#PCDATA)>
<!ELEMENT firstname (#PCDATA)>
<!ELEMENT lastname (#PCDATA)>
<!ELEMENT price (#PCDATA)>
Figure 7.5: The source Bibliography article DTD definition
<?xml encoding="ISO-8859-1"?>
<!ELEMENT bib (vendor)*>
<!ELEMENT vendor (name, email, book*)>
<!ATTLIST vendor id ID #REQUIRED>
<!ELEMENT book (title, publisher?, year?, author+)>
<!ELEMENT author (name)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT publisher (#PCDATA)>
<!ELEMENT year (#PCDATA)>
Figure 7.6: The target Bibliography article DTD definition
DTD definition and target DTD definition of the DBLP collection, respectively.
7.2. DTD definitions for the XML Transformation Approach 162
<?xml encoding="ISO-8859-1"?>
<!ELEMENT W4F_DOC (Movie)>
<!ELEMENT Movie (Title,Year,Directed_By,Genres,Cast)>
<!ELEMENT Title (#PCDATA)>
<!ELEMENT Year (#PCDATA)>
<!ELEMENT Directed_By (Director)*>
<!ELEMENT Director (#PCDATA)>
<!ELEMENT Genres (Genre)*>
<!ELEMENT Genre (#PCDATA)>
<!ELEMENT Cast (Actor)*>
<!ELEMENT Actor (FirstName,LastName)>
<!ELEMENT FirstName (#PCDATA)>
<!ELEMENT LastName (#PCDATA)>
Figure 7.7: The source Movies DTD definition
<?xml encoding="ISO-8859-1"?>
<!ELEMENT Movie (Title,Year,Directed_By,Genres,Cast)>
<!ELEMENT Title (#PCDATA)>
<!ELEMENT Year ((#PCDATA)>
<!ELEMENT Directed_By (Director)*>
<!ELEMENT Director (FirstName, LastName)>
<!ELEMENT Genres (Genre)*>
<!ELEMENT Genre (#PCDATA)>
<!ELEMENT Cast (Actor)*>
<!ELEMENT Actor (FirstName,LastName)>
<!ELEMENT FirstName (#PCDATA)>
<!ELEMENT LastName (#PCDATA)>
Figure 7.8: The target Movies DTD definition
7.2. DTD definitions for the XML Transformation Approach 163
<!ELEMENT bibliography (
article|book|dissertation|proceedings|inproceedings|incollection|techreport|misc)+>
<!ELEMENT article (
author?,title,year?,journal?,volume?,number?,pages?,note?,titletranslation?,keywords?,abstract?,
reviewer?,classification?,language?,links?,issns?,affiliation?,provider?,mixed?)>
<!ATTLIST article
id NMTOKEN #REQUIRED
provider (LEABIB|CP|DBLP|TCCSB) #IMPLIED
mdate CDATA #REQUIRED
date CDATA #IMPLIED>
<!ELEMENT book (author?,editor?,title,year?,publisher?,volume?,series?,address?,edition?,note?,
titletranslation?,keywords?,abstract?,reviewer?,classification?,language?,links?,isbns?,
affiliation?,provider?,mixed?)>
<!ATTLIST book
id NMTOKEN #REQUIRED
provider CDATA #IMPLIED
mdate CDATA #REQUIRED
date CDATA #IMPLIED>
<!ELEMENT dissertation (author?,title,year?,school?,address?,month?,note?,titletranslation?,keywords?,
abstract?,reviewer?,classification?,language?,links?,isbns?,affiliation?,provider?,mixed?)>
<!ATTLIST dissertation
id NMTOKEN #REQUIRED
provider CDATA #IMPLIED
mdate CDATA #REQUIRED
date CDATA #IMPLIED>
…
Figure 7.9: A portion of the source DBLP DTD definition
<!ELEMENT bibliography (type,
author?,editor?,title,year?,publisher?,volume?,series?,address?,edition?,note?,
titletranslation?,keywords?,abstract?,reviewer?,classification?,language?,links?,isbns?,
affiliation?,provider?,mixed?)>
<!ELEMENT type(#PCDATA)>
…
Figure 7.10: A portion of the target DBLP DTD definition
Publications 164
Publications from this Thesis
1. Tien Tran, Sangeetha Kutty and Richi Nayak. Utilizing the Structure and Con-
tent Information for XML Document Clustering. In Shlomo Geva, Jaap Kamps,
and Andrew Trotman, editors, Advances in Focused Retrieval, pages 460-468, 2009.
Springer Berlin / Heidelberg.
2. Tien Tran, Richi Nayak, and Peter Bruza (2008). Combining structure and content
similarities for xml document clustering. In: Proceedings of the 7th Australasian
data mining conference (AusDM). Adelaide, Australia.
3. Tien Tran, Richi Nayak, and Peter Bruza. Document Clustering Using Incremental
and Pairwise Approaches. In Norbert Fuhr, Jaap Kamps, Mounia Lalmas, and
Andrew Trotman, editors, Focused Access to XML Documents, pages 222-233, 2008.
Springer Berlin / Heidelberg.
4. Tien Tran, Richi Nayak, and Peter Bruza. Evaluating the Performance of XML
Document Clustering by Structure Only. In Norbert Fuhr, Mounia Lalmas, and
Andrew Trotman, editors, Comparative Evaluation of XML Information Retrieval
Systems, pages 473-484, 2007. Springer Berlin / Heidelberg.
Bibliography
[1] Xsl transformations (xslt) 2.0. http://www.w3.org/TR/xslt20/, 2002.
[2] S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to
Semistructured Data and XML. Morgan Kaufmann, San Francisco, California, 2000.
[3] Charu C. Aggarwal, Na Ta, Jianyong Wang, Jianhua Feng, and Mohammed Zaki.
Xproj: a framework for projected structural clustering of xml documents. In KDD
’07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 46–55, San Jose, California, USA, 2007.
[4] Mohamad Alishahi, Mahmoud Naghibzadeh, and Baharak Shakeri Aski. Tag name
structure-based clustering of xml documents. International Journal of Computer and
Electrical Engineering,, 2(1):1793–8163, 2010.
[5] Gunter Saake Alsayed Algergawy, Richi Nayak. Element similarity measures in xml
schema matching. Information Science, 189(24):4975–4998, 2010.
165
BIBLIOGRAPHY 166
[6] Panagiotis Antonellis, Christos Makris, and Nikos Tsirakis. XEdge: clustering ho-
mogeneous and heterogeneous XML documents using edge summaries. Proceedings
of the 2008 ACM symposium on Applied computing. ACM, Fortaleza, Ceara, Brazil,
2008. 1363940.
[7] R. Baeza-Yates and G. Navarro. Integrating contents and structure in text retrieval.
ACM SIGMOD, 25(1), 1996.
[8] Aida Boukottaya, Christine Vanoirbeek, Federica Paganelli, and Omar Abou Khaled.
Automating xml document transformations: A conceptual modelling based approach.
In JOHN F. RODDICK (Ed.) SVEN HARTMANN, editor, First Asia-Pacific Con-
ference on Conceptual Modelling, pages 81–90. Dunedin, New Zealand, January 2004.
[9] Emmanuel Bruno, Jacques Le Maitre, and Elisabeth Murisasco. Extending xquery
with transformation operators. In ACM symposium on Document engineering, Greno-
ble, France, 2003.
[10] S. Cha. Comprehensive survey on distance/similarity measures between probability
density functions. International Journal of Mathematical Models and Methods in
Applied Sciences, 1(4):300–307, 2007.
[11] S. Chawathe. Comparing hierarchical data in external memory. In Twenty-fifth Int.
Conf. on Very Large Data Bases, pages 90–101, 1999.
[12] SS Chawathe and H. Garcia-Monlina. Meaningful change detection in structured
data. In Proceedings of the 1997 ACM SIGMOD Int. conf. on management of data
(SIGMOD), pages 26–37, New York, USA, 1997.
BIBLIOGRAPHY 167
[13] Yun Chi, Richard R. Muntz, Siegfried Nijssen, and Joost N. Kok. Frequent subtree
mining - an overview. Fundam. Inf., 66(1-2):161–198, 2004.
[14] Manning D. Christopher, Raghavan Prabhakar, and Hinrich Schtze. Introduction to
Information Retrieval. Cambridge University Press, 1 edition edition, 2008.
[15] N. Cristianini, J. Shawe-Taylor, and H. Lodhi. Latent semantic kernels. Journal of
Intelligent Information Systems (JJIS), 18(2), 2002.
[16] Theodore Dalamagas, Tao Cheng, Klaas-Jan Winkel, and Timos Sellis. A methodol-
ogy for clustering xml documents by structure. Information Systems, 31(3):187–228,
2006. 0306-4379 doi: DOI: 10.1016/j.is.2004.11.009.
[17] L. Denoyer, P. Gallinari, and Anne-Marie Vercoustre. Report on the xml mining
track at inex 2005 and inex 2006. In INEX 2006, pages 432–443, Dagstuhl Castle,
Germany, 2006.
[18] H. H. Do and E. Rahm. Coma - a system for flexible combination of schema matching
approaches. In 28th VLDB, Hong Kong, China, 2002 August. propose a hybrid
matching algorithm using the modulation of veraious approaches. They support user
feedback and reuse previous matchings (one to one matching).
[19] A. Doan, R. Domingos, and A. Y. Halevy. Reconciling schemas of disparate sources:
a machine-learning approach. In ACM SIGMOD, Santa Barbara, California, United
States., 2001.
BIBLIOGRAPHY 168
[20] Carina Friedrich Dorneles, Rodrigo Goncalves, and Ronaldo dos Santos Mello. Ap-
proximate data instance matching: a survey. Knowledge and Information Systems,
pages 1–21, 2010.
[21] A. Doucet and H. A. Myka. Naive clustering of a large xml document collection. In
INEX Annual ERCIM Workshop, pages 81–88, 2002.
[22] C. Fellbaum. Wordnet: An electronic lexical database. MIT Press, 1998.
[23] S. Geva. K-tree: a height balanced tree structured vector quantizer. In IEEE Signal
Neural Networks for Signal Processing Workshop 2000 (NNSP-2000), pages 11–13,
Sydney, 2000.
[24] F. Giunchiglia, P. Shvaiko, and M. Yatskevich. S-match: an algorithm and an imple-
mentation of semantic matching, 2004.
[25] J. Han and M. Kamber. Data Mining: Concepts and Techiques. San Diego, USA:
Morgan Kaufmann, 2001.
[26] Z. Huang. A fast clustering algorithm to cluster very large categorical data sets in data
mining. In SIGMOD Workshop on Research Issues on Data Mining and Knowledge
Discovery, 1997.
[27] Jeong Hee Hwang and Keun Ho Ryu. Clustering and retrieval of xml documents
by structure. In Osvaldo Gervasi, Marina Gavrilova, Vipin Kumar, Antonio Lagan,
Heow Lee, Youngsong Mun, David Taniar, and Chih Tan, editors, Computational
Science and Its Applications ICCSA 2005, volume 3481 of Lecture Notes in Computer
Science, pages 925–935. Springer Berlin / Heidelberg, 2005.
BIBLIOGRAPHY 169
[28] A. Hyvarinen and Oja E. Independent component analysis: Algorithms and applica-
tions. Neural Networks, 13(4-5):2000, 2000.
[29] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing
Surveys (CSUR), 31(3):264–323, 1999.
[30] George Karypis. Cluto - software for clustering high-dimensional datasets — karypis
lab.
[31] Eila Kuikka, Paula Leinonen, and Martti Penttonen. Towards automating of doc-
ument structure transformations. In ACM Symposium on Document Engineering,
pages 103–110, McLean, Virginia, USA, 2002.
[32] Sangeetha Kutty, Richi Nayak, and Yuefeng Li. Hcx: an efficient hybrid clustering
approach for xml documents. In DocEng ’09: Proceedings of the 9th ACM symposium
on Document engineering, pages 94–97, Munich, Germany, 2009.
[33] Sangeetha Kutty, Tien Tran, Richi Nayak, and Yuefeng Li. Clustering xml docu-
ments using closed frequent subtrees: A structural similarity approach. In Norbert
Fuhr, Jaap Kamps, Mounia Lalmas, and Andrew Trotman, editors, Focused Access to
XML Documents, volume 4862 of Lecture Notes in Computer Science, pages 183–194.
Springer Berlin / Heidelberg, 2008.
[34] T. K. Landauer, P. W. Foltz, and D. Laham. An introduction to latent semantic
analysis. Discourse Processes, (25):259–284, 1998.
BIBLIOGRAPHY 170
[35] J. W. Lee and S. S. Park. Finding maximal similar paths between xml documents
using sequential patterns. In ADVIS, pages 96–106, Izmir, Turkey, 2004, October
20-24.
[36] Jun-Seung Lee and Kyong-Ho Lee. Computing simple and complex matchings be-
tween xml schemas for transforming xml documents. Special Issue Section: Dis-
tributed Software Development, 48(9):937–946, September 2006.
[37] L. M. Lee, L. H. Yang, W. Hsu, and X. Yang. Xclust: Clustering xml schemas
for effective integration. In 11th ACM International Conference on Information and
Knowledge Management (CIKM’02), Virginia, 2002, November.
[38] Ho-pong Leung, Fu-lai Chung, S.C.F. Chan, and R. Luk. Xml document clustering
using common xpath. In International Workshop on Challenges in Web Information
Retrieval and Integration (WIRI ’05), pages 91–96, 2005.
[39] Zhiwei Lin, Hui Wang, S. McClean, and Haiying Wang. All common embedded
subtrees for clustering xml documents by structure. In Machine Learning and Cyber-
netics, 2009 International Conference on, volume 1, pages 13–18, 2009.
[40] Jianghui Liu, J.T.L. Wang, W. Hsu, and K.G. Herbert. Xml clustering by principal
component analysis. Tools with Artificial Intelligence, 2004. ICTAI 2004. 16th IEEE
International Conference on, pages 658–662, 2004.
[41] J. Madhavan, P. A. Bernstein, and E. Rahm. Generic schema matching with cupid.
In 27th VLDB, Roma, Italy, 2001.
BIBLIOGRAPHY 171
[42] S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity flooding: a versatile graph
matching algorithm and its application to schema matching. In 18th ICDE, 2002.
[43] N.K. Nagwani and A. Bhansali. Clustering homogeneous xml documents using
weighted similarities on xml attributes. In 2010 IEEE 2nd International Advance
Computing Conference (IACC), pages 369–372, Patiala, 2010.
[44] R. Nayak and W. Iryadi. Xml schema clustering with semantic and hierarchical
similarity measures. Knowledge-Based Systems, 20(4):336–349, 2007.
[45] R. Nayak and T. Tran. A progressive clustering algorithm to group the xml data by
structural and semantic similarity. IJPRAI, 21(3):1–21, 2007.
[46] R. Nayak and S. Xu. Xcls: A fast and effective clustering algorithm for heterogenous
xml documents. In PAKDD’2006, Singapore, 2006.
[47] Richi Nayak. Fast and effective clustering of xml data using structural information.
Knowledge and Information Systems, 14(2):197–215, 2008. 0219-1377.
[48] Richi Nayak and Wina Iryadi. Xmine: A methodology for mining xml structure. In
Xiaofang Zhou, Jianzhong Li, Heng Shen, Masaru Kitsuregawa, and Yanchun Zhang,
editors, Frontiers of WWW Research and Development - APWeb 2006, volume 3841
of Lecture Notes in Computer Science, pages 786–792. Springer Berlin / Heidelberg,
2006.
[49] Richi Nayak, Christopher M. De Vries, Sangeetha Kutty, Shlomo Geva, Ludovic De-
noyer, and Patrick Gallinari. Overview of the inex 2009 xml mining track: Clustering
and classification of xml documents. In Shlomo Geva, Jaap Kamps, and Andrew
BIBLIOGRAPHY 172
Trotman, editors, Focused Retrieval and Evaluation, volume 6203 of Lecture Notes in
Computer Science, pages 366–378. Springer Berlin / Heidelberg, 2010.
[50] Richi Nayak and F. B. Xia. Automatic integration of heterogeneous xml-schemas.
In Int. Conf. on Information Integration and Web-based Applications and Services,
pages 427–437, Jakarta, Indonesia, 2004.
[51] H-Q. Nguyen, D. Taniar, J. W. Rahaya, and K. Nguyen. Double-layered schema
integration of heterogeneous xml sources. Systems and Software, 84(1):63–76, 2011.
[52] A. Nierman and H. V. Jagadish. Evaluating structural similarity in xml documents.
In 5th International Conference on Computational Science (ICCS’05), Wisconsin,
USA, 2002.
[53] K Ono, T. Koyanagi, M. Abe, and M. Hori. Xslt stylesheet generation by example
with wysiwyg editing. In 2002 International Symposium on Applications and the
Internet, Nara, Japan, March 2002.
[54] T. Pankowski. Specifying transformations for xml data. In Pre-Conference Workshop
of VLDB, Berlin, 2003.
[55] Tadeusz Pankowski. A high-level language for specifying xml data transformations.
In A. Benczur, J. Demetrovics, and G. Gottlob, editors, ADBIS, pages 159–172,
Budapest, Hungary, 2004.
[56] K. Pearson. On lines and planes of closest fit to systems of points in space. Philo-
sophical Magazine, 2(6):559–572, 1901.
BIBLIOGRAPHY 173
[57] M Peltier, J Bzivin, and G Guillaume. Mtrans: A general framework based on xslt
for model transformations. In WTUML, Genova, Italy, 2001.
[58] M.F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
[59] K. Saleem, Z. Bellahsene, and E. Hunt. Porche: Performance oriented schema medi-
ation. Information System, 33(7-8):637–657, 2008.
[60] G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing.
Communication of ACM, 18(11):613–620, 1975.
[61] Dong-Hoon Shin and Kyong-Ho Lee. Towards the faster transformation of xml doc-
uments. Journal of Information Science, 32:261–276, 2006.
[62] Pavel Shvaiko and Jerome Euzenat. A survey of schema-based matching approaches.
Journal on Data Semantics IV, pages 146–171, 2005.
[63] Marko Smiljanic, Maurice van Keulen, and Willem Jonker. Using element clustering
to increase the efficiency of xml schema matching. In 2nd International Workshop on
Challenges in Web Information Retrieval and Integration (WIRI’06), pages 95–104,
2006 April.
[64] Ian Stuart. Xml schema, a brief introduction, 2004.
[65] Hong Su, Harumi Kuno, and Elke A. Rundensteiner. Automating the transformation
of xml documents. In ACM Symposium on Dcoument Engineering, 2001.
[66] Hong Su, Harumi Kuno, and Elke A. Rundensteiner. Automating the transformation
of xml documents. In ACM Symposium on Dcoument Engineering, 2001.
BIBLIOGRAPHY 174
[67] X. Tang and F.W. Tompa. Specifying transformations for structured documents. In
International Workshop on the Web and Databases, 2001.
[68] Joe Tekli, Richard Chbeir, and Kokou Yetongnon. A hybrid approach for xml sim-
ilarity. In Jan van Leeuwen, Giuseppe Italiano, Wiebe van der Hoek, Christoph
Meinel, Harald Sack, and Frantiek Plil, editors, SOFSEM 2007: Theory and Prac-
tice of Computer Science, volume 4362 of Lecture Notes in Computer Science, pages
783–795. Springer Berlin / Heidelberg, 2007.
[69] Rouset M.-C. Sebag M. Termier, A. Treefinder: a first step towards xml data mining.
In IEEE International Conference on Data Mining, 2002.
[70] A. Theobald and G.Weikum. The index-based xxl search engine for querying xml
data with relevance ranking. In Proceedings of the EBDT Conference, 2002.
[71] Tien Tran, Sangeetha Kutty, and Richi Nayak. Utilizing the structure and content
information for xml document clustering. In Shlomo Geva, Jaap Kamps, and Andrew
Trotman, editors, Advances in Focused Retrieval, volume 5631 of Lecture Notes in
Computer Science, pages 460–468. Springer Berlin / Heidelberg, 2009.
[72] Tien Tran and Richi Nayak. Evaluating the performance of the xml document clus-
tering by structure only. In 5th International Workshop of the Initiative for the Eval-
uation of XML Retrieval, INEX, pages 473–484, Dagstuhl Castle, Germany, 2006.
[73] Tien Tran, Richi Nayak, and Peter Bruza. Combining structure and content similar-
ities for xml document clustering. In Proceedings of the 7th Australasian data mining
conference (AusDM),, pages 219–226, Adelaide, Australia, 2008.
BIBLIOGRAPHY 175
[74] Athena Vakali, Jaroslav Pokorn, and Theodore Dalamagas. An overview of web data
clustering practices. In Wolfgang Lindner, Marco Mesiti, Can Trker, Yannis Tz-
itzikas, and Athena Vakali, editors, Current Trends in Database Technology - EDBT
2004 Workshops, volume 3268 of Lecture Notes in Computer Science, pages 500–501.
Springer Berlin / Heidelberg, 2005.
[75] Christopher M. De Vries and Shlomo Geva. Document clustering with k-tree. In
Shlomo Geva, Jaap Kamps, and Andrew Trotman, editors, Advances in Focused Re-
trieval, volume 5631 of Lecture Notes in Computer Science, pages 420–431. Springer
Berlin / Heidelberg, 2009.
[76] R. Wagner and M. Fisher. The string-to-string correction problem. ACM, 21(1):168–
173, 1974.
[77] S. Waworuntu and J. Bailey. Xsltgen: A system for automatically generating xml
transformations via semantic mappings. In 23rd International Conference on Con-
ceptual Modeling (ER2004), 2004.
[78] Erik Wstner, Thorsten Hotzel, and Peter Buxmann. Converting business documents:
A classification of problems and solutions using xml/xslt. In WECWIS, California,
USA, 2002.
[79] L. Xu and D. W. Embley. Discovering direct and indirect matches for schema ele-
ments. In 8th International Conference on Database Ssytems for Advanced Applica-
tions, 2003.
[80] Jianwu Yang, W.K. Cheung, and Xiaoou Chen. Learning the kernel matrix for xml
document clustering. In e-Technology, e-Commerce and e-Service, 2005.
BIBLIOGRAPHY 176
[81] J.W. Yang and X.O. Chen. A semi-structured document model for text mining.
Journal of Computer Science and Technology, 17(5):603–610, 2002.
[82] Y. Yang, X. Guan, and J. You. Clope: A fast and effective clustering algorithm
for transaction data. In 8th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2002.
[83] Jin Yao and Nadia Zerida. Rare patterns to improve path-based clustering. In 6th
International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX
2007, Dagstuhl Castle, Germany, 2007.
[84] Jin Yao and Nadia Zerida. Rare patterns to improve path-based clusteringof wikipedia
articles. In 6th International Workshop of the Initiative for the Evaluation of XML
Retrieval, INEX 2007, Dagstuhl Castle, Germany, Dec 17-19, 2007.
[85] Guo Yongming, Chen Dehua, and Le Jiagin. Clustering xml documents by combining
content and structure. In Information Science and Engineering, 2008. ISISE ’08.
International Symposium on, volume 1, pages 583–587, Shanghai, 2008.
[86] J. Yoo, V. Raghavan, and L. Kerschberg. Bitcube: Clustering and statistical anal-
ysis for xml documents. In Thirteenth International Conference on Scientific and
Statistical Database Management, Fairfax, Virginia, 2001.
[87] Jin-sha Yuan, Xin-ye Li, and Li-na Ma. An improved xml document clustering using
path feature. In Fuzzy Systems and Knowledge Discovery, 2008. FSKD ’08. Fifth
International Conference on, volume 2, pages 400–404, Shandong, 2008.
[88] M. J. Zaki. Efficiently mining frequent trees in a forest. In SIGKDD, 2002.
BIBLIOGRAPHY 177
[89] K. Zhang and D. Shasha. A simple fast algorithms for the editing distance between
trees and related problems. SIAM Journal of Computing, 18(6):1245–1262, 1989.