Download - Clustering With Multiviewpoint-Based
-
7/27/2019 Clustering With Multiviewpoint-Based
1/14
Clustering with Multiviewpoint-BasedSimilarity Measure
Duc Thang Nguyen, Lihui Chen, Senior Member, IEEE, and Chee Keong Chan
AbstractAll clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similarity
between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel multiviewpoint-based similarity
measure and two related clustering methods. The major difference between a traditional dissimilarity/similarity measure and ours is
that the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects
assumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment
of similarity could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions for
document clustering are proposed based on this new measure. We compare them with several well-known clustering algorithms that
use other popular similarity measures on various document collections to verify the advantages of our proposal.
Index TermsDocument clustering, text mining, similarity measure.
1 INTRODUCTION
CLUSTERING is one of the most interesting and importanttopics in data mining. The aim of clustering is to findintrinsic structures in data, and organize them into mean-ingful subgroups for further study and analysis. There havebeen many clustering algorithms published every year.They can be proposed for very distinct research fields, anddeveloped using totally different techniques and ap-proaches. Nevertheless, according to a recent study [1],more than half a century after it was introduced, the simplealgorithm k-means still remains as one of the top 10 data
mining algorithms nowadays. It is the most frequently usedpartitional clustering algorithm in practice. Another recentscientific discussion [2] states that k-means is the favoritealgorithm that practitioners in the related fields choose touse. Needless to mention, k-means has more than a few basicdrawbacks, such as sensitiveness to initialization and tocluster size, and its performance can be worse than otherstate-of-the-art algorithms in many domains. In spite of that,its simplicity, understandability, and scalability are thereasons for its tremendous popularity. An algorithm withadequate performance and usability in most of applicationscenarios could be preferable to one with better performancein some cases but limited usage due to high complexity.While offering reasonable results, k-means is fast and easy tocombine with other methods in larger systems.
A common approach to the clustering problem is to treatit as an optimization process. An optimal partition is foundby optimizing a particular function of similarity (ordistance) among data. Basically, there is an implicit
assumption that the true intrinsic structure of data couldbe correctly described by the similarity formula defined andembedded in the clustering criterion function. Hence,effectiveness of clustering algorithms under this approachdepends on the appropriateness of the similarity measure tothe data at hand. For instance, the original k-means hassum-of-squared-error objective function that uses euclideandistance. In a very sparse and high-dimensional domainlike text documents, spherical k-means, which uses cosinesimilarity (CS) instead of euclidean distance as the measure,
is deemed to be more suitable [3], [4].In [5], Banerjee et al. showed that euclidean distance wasindeed one particular form of a class of distance measurescalled Bregman divergences. They proposed Bregman hard-clustering algorithm, in which any kind of the Bregmandivergences could be applied. Kullback-Leibler divergencewas a special case of Bregman divergences that was said togive good clustering results on document data sets. Kullback-Leibler divergence is a good example of nonsymmetricmeasure. Also on the topic of capturing dissimilarity in data,Pakalska et al. [6] found that the discriminative power ofsome distance measures could increase when their none-uclidean and nonmetric attributes were increased. They
concluded that noneuclidean and nonmetric measures couldbe informative for statistical learning of data. In [7], Pelilloeven argued that the symmetry and nonnegativity assump-tion ofsimilarity measures wasactually a limitation ofcurrentstate-of-the-art clustering approaches. Simultaneously, clus-tering still requires more robust dissimilarity or similaritymeasures; recent works such as [8] illustrate this need.
The work in this paper is motivated by investigationsfrom the above and similar research findings. It appears tous that the nature of similarity measure plays a veryimportant role in the success or failure of a clusteringmethod. Our first objective is to derive a novel method for
measuring similarity between data objects in sparse andhigh-dimensional domain, particularly text documents.From the proposed similarity measure, we then formulatenew clustering criterion functions and introduce their
988 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
. The authors are with the Division of Information Engineering, School ofElectrical and Electronic Engineering, Nanyang Technological University,Block S1, Nanyang Avenue, Republic of Singapore 639798.E-mail: [email protected], {elhchen, eckchan}@ntu.edu.sg.
Manuscript received 16 July 2010; revised 25 Feb. 2011; accepted 10 Mar.
2011; published online 22 Mar. 2011.Recommended for acceptance by M. Ester.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2010-07-0398.Digital Object Identifier no. 10.1109/TKDE.2011.86.
1041-4347/12/$31.00 2012 IEEE Published by the IEEE Computer Society
-
7/27/2019 Clustering With Multiviewpoint-Based
2/14
respective clustering algorithms, which are fast and scalablelike k-means, but are also capable of providing high-qualityand consistent performance.
The remaining of this paper is organized as follows: In
Section 2, we review related literature on similarity andclustering of documents. We then present our proposal fordocument similarity measure in Section 3.1. It is followedby two criterion functions for document clustering and theiroptimization algorithms in Section 4. Extensive experimentson real-world benchmark data sets are presented anddiscussed in Sections 5 and 6. Finally, conclusions andpotential future work are given in Section 7.
2 RELATED WORK
First of all, Table 1 summarizes the basic notations that willbe used extensively throughout this paper to representdocuments and related concepts. Each document in acorpus corresponds to an m-dimensional vector d, wherem is the total number of terms that the document corpushas. Document vectors are often subjected to some weight-ing schemes, such as the standard Term Frequency-InverseDocument Frequency (TF-IDF), and normalized to haveunit length.
The principle definition of clustering is to arrange dataobjects into separate clusters such that the intraclustersimilarity as well as the intercluster dissimilarity ismaximized. The problem formulation itself implies thatsome forms of measurement are needed to determine suchsimilarity or dissimilarity. There are many state-of-the-artclustering approaches that do not employ any specific formof measurement, for instance, probabilistic model-basedmethod [9], nonnegative matrix factorization [10], informa-tion theoretic coclustering [11] and so on. In this paper,though, we primarily focus on methods that indeed doutilize a specific measure. In the literature, euclideandistance is one of the most popular measures
Distdi; dj kdi djk: 1
It is used in the traditional k-means algorithm. The objectiveof k-means is to minimize the euclidean distance betweenobjects of a cluster and that clusters centroid
minXkr1
Xdi2Sr
kdi Crk2: 2
However, for data in a sparse and high-dimensional space,such as that in document clustering, cosine similarity ismore widely used. It is also a popular similarity score intext mining and information retrieval [12]. Particularly,similarity of two document vectors di and dj, Simdi; dj, isdefined as the cosine of the angle between them. For unitvectors, this equals to their inner product
Simdi; dj cosdi; dj dtidj: 3
Cosine measure is used in a variant of k-means calledspherical k-means [3]. While k-means aims to minimizeeuclidean distance, spherical k-means intends to maximizethe cosine similarity between documents in a cluster andthat clusters centroid
maxXkr1
Xdi2Sr
dtiCrkCrk
: 4
The major difference between euclidean distance and cosinesimilarity, and therefore between k-means and spherical k-
means, is that the former focuses on vector magnitudes,while the latter emphasizes on vector directions. Besidesdirect application in spherical k-means, cosine of documentvectors is also widely used in many other documentclustering methods as a core similarity measurement. Themin-max cut graph-based spectral method is an example[13]. In graph partitioning approach, document corpus isconsider as a graph G V ; E, where each document is avertex in V and each edge in E has a weight equal to thesimilarity between a pair of vertices. Min-max cut algorithmtries to minimize the criterion function
minXk
r1
SimSr; Sn Sr
SimSr; Sr
where SimSq; Sr1q;rk
X
di2Sq;dj2Sr
Simdi; dj;
and when the cosine as in (3) is used, minimizing thecriterion in (5) is equivalent to
minXkr1
DtrD
kDrk2: 6
There are many other graph partitioning methods withdifferent cutting strategies and criterion functions, such asAverage Weight [14] and Normalized Cut [15], all of which
have been successfully applied for document clusteringusing cosine as the pairwise similarity score [16], [17]. In[18], an empirical study was conducted to compare avariety of criterion functions for document clustering.
Another popular graph-based clustering technique isimplemented in a software package called CLUTO [19]. Thismethod first models the documents with a nearest neighborgraph, and then splits the graph into clusters using a min-cutalgorithm. Besides cosine measure, the extended Jaccardcoefficient can also be used in this method to representsimilarity between nearest documents. Given nonunit docu-ment vectors ui, uj di ui=kuik; dj uj=kujk, their ex-tended Jaccard coefficient is
SimeJaccui; uj utiuj
kuik2 kujk
2 utiuj: 7
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 989
TABLE 1Notations
-
7/27/2019 Clustering With Multiviewpoint-Based
3/14
Compared with euclidean distance and cosine similarity, theextended Jaccard coefficient takes into account both themagnitude and the direction of the document vectors. If thedocuments are instead represented by their correspondingunit vectors, this measure has the same effect as cosinesimilarity. In [20], Strehl et al. compared four measures:euclidean, cosine, Pearson correlation, and extended Jaccard,and concluded that cosine and extended Jaccard are the bestones on web documents.
In nearest neighbor graph clustering methods, such as theCLUTOs graph method above, the concept of similarity issomewhat different from the previously discussed methods.Twodocuments mayhave a certain value of cosinesimilarity,but if neither of them is in the other ones neighborhood, theyhave no connection between them. In such a case, somecontext-based knowledge or relativeness property is alreadytaken into account when considering similarity. Recently,Ahmad andDey [21] proposed a methodto compute distancebetween two categorical values of an attribute based on theirrelationship with all other attributes. Subsequently, Ienco
et al. [22] introduced a similar context-based distancelearning method for categorical data. However, for a givenattribute, they only selected a relevant subset of attributesfrom the whole attribute set to use as the context forcalculating distance between its two values.
More related to text data, there are phrase-based andconcept-based document similarities. Lakkaraju et al. [23]employed a conceptual tree-similarity measure to identifysimilar documents. This method requires representingdocuments as concept trees with the help of a classifier.For clustering, Chim and Deng [24] proposed a phrase-based document similarity by combining suffix tree model
and vector space model. They then used HierarchicalAgglomerative Clustering algorithm to perform the cluster-ing task. However, a drawback of this approach is the highcomputational complexity due to the needs of building thesuffix tree and calculating pairwise similarities explicitlybefore clustering. There are also measures designed speci-fically for capturing structural similarity among XMLdocuments [25]. They are essentially different from thedocument-content measures that are discussed in this paper.
In general, cosine similarity still remains as the mostpopular measure because of its simple interpretation andeasy computation, though its effectiveness is yet fairly
limited. In the following sections, we propose a novel way toevaluate similarity between documents, and consequentlyformulate new criterion functions for document clustering.
3 MULTIVIEWPOINT-BASED SIMILARITY
3.1 Our Novel Similarity Measure
The cosine similarity in (3) can be expressed in thefollowing form without changing its meaning:
Simdi; dj cosdi0; dj0 di0tdj0; 8
where 0 is vector 0 that represents the origin point.
According to this formula, the measure takes 0 as one andonly reference point. The similarity between two documentsdi and dj is determined w.r.t. the angle between the twopoints when looking from the origin.
To construct a new concept of similarity, it is possible touse more than just one point of reference. We may have amore accurate assessment of how close or distant a pair ofpoints are, if we look at them from many differentviewpoints. From a third point dh, the directions anddistances to di and dj are indicated, respectively, by thedifference vectors di dh and dj dh. By standing atvarious reference points dh to view di, dj and working on
their difference vectors, we define similarity between thetwo documents as
Simdi; djdi;dj2Sr
1
nnr
Xdh2SnSr
Simdidh; djdh: 9
As described by the above equation, similarity of twodocuments di and djgiven that they are in the sameclusteris defined as the average of similarities measuredrelatively from the views of all other documents outside thatcluster. What is interesting is that the similarity here isdefined in a close relation to the clustering problem. Apresumption of cluster memberships has been made prior tothe measure. The two objects to be measured must be in thesame cluster, while the points from where to establish thismeasurement must be outside of the cluster. We call thisproposal the Multiviewpoint-basedSimilarity, or MVS. Fromthis point onwards, we will denote the proposed similaritymeasure between two document vectors di and dj byMVSdi; djjdi; dj2Sr, or occasionally MVSdi; dj for short.
The final form of MVS in (9) depends on particularformulation of the individual similarities within the sum. Ifthe relative similarity is defined by dot-product of thedifference vectors, we have
MVSdi; djjdi; dj 2 Sr
1
nnr
Xdh2SnSr
didhtdjdh
1
nnr
Xdh
cosdidh; djdhkdidhkkdjdhk:
10
The similarity between two points di and dj inside clusterSr, viewed from a point dh outside this cluster, is equal tothe product of the cosine of the angle between di and djlooking from dh and the euclidean distances from dh tothese two points. This definition is based on the assumptionthat dh is not in the same cluster with di and dj. The smaller
the distances kdi dhk and kdj dhk are, the higher thechance that dh is in fact in the same cluster with di and dj,and the similarity based on dh should also be small to reflectthis potential. Therefore, through these distances, (10) alsoprovides a measure of intercluster dissimilarity, given thatpoints di and dj belong to cluster Sr, whereas dh belongs toanother cluster. The overall similarity between di and dj isdetermined by taking average over all the viewpoints notbelonging to cluster Sr. It is possible to argue that whilemost of these viewpoints are useful, there may be some ofthem giving misleading information just like it may happenwith the origin point. However, given a large enough
number of viewpoints and their variety, it is reasonable toassume that the majority of them will be useful. Hence, theeffect of misleading viewpoints is constrained and reducedby the averaging step. It can be seen that this method offers
990 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
-
7/27/2019 Clustering With Multiviewpoint-Based
4/14
more informative assessment of similarity than the singleorigin point-based similarity measure.
3.2 Analysis and Practical Examples of MVS
In this section, we present analytical study to show that theproposed MVS could be a very effective similarity measurefor data clustering. In order to demonstrate its advantages,
MVS is compared with cosine similarity on how well theyreflect the true group structure in document collections.First, exploring (10), we have
MVS di; djjdi; dj 2 Sr
1
n nr
Xdh2SnSr
dtidj d
tidh d
tjdh d
thdh
dtidj1
nnrdti
Xdh
dh1
nnrdtj
Xdh
dh1; kdhk1
dtidj 1
n nrdtiDSnSr
1
n nrdtjDSnSr 1
dtidj dtiCSnSr d
tjCSnSr 1;
11
where DSnSr Pdh2SnSrdh is the composite vector of all the
documents outside cluster r, called the outer compositew.r.t. cluster r, and CSnSr DSnSr =nnr the outer centroidw.r.t. cluster r, 8r 1; . . . ; k. From (11), when comparingtwo pairwise similarities MVSdi; dj and MVSdi; dl,
document dj is more similar to document di than the otherdocument dl is, if and only if
dtidj dt
jCSnSr > dtidl d
tl CSnSr
, cosdi; dj cosdj; CSnSr kCSnSr k >
cosdi; dl cosdl; CSnSr kCSnSr k:
12
From this condition, it is seen that even when dl is considered
closer to di in terms of CS, i.e., cosdi; dj cosdi; dl, dl canstill possibly be regarded as less similar to di based on MVS
if, on the contrary, it is closer enough to the outer centroidCSnSr than dj is. This is intuitively reasonable, since the
closer dl is to CSnSr , the greater the chance it actuallybelongs to another cluster rather than Sr and is, therefore,less similar to di. For this reason, MVS brings to the table anadditional useful measure compared with CS.
To further justify the above proposal and analysis, wecarried out a validity test for MVS and CS. The purpose ofthis test is to check how much a similarity measurecoincides with the true class labels. It is based on oneprinciple: if a similarity measure is appropriate for theclustering problem, for any of a document in the corpus, thedocuments that are closest to it based on this measure
should be in the same cluster with it.The validity test is designed as following: For each type
of similarity measure, a similarity matrix A faijgnn iscreated. For CS, this is simple, as aij dtidj. The procedurefor building MVS matrix is described in Fig. 1.
First, the outer composite w.r.t. each class is determined.Then, for each row ai of A, i 1; . . . ; n, if the pair ofdocuments di and dj; j 1; . . . ; n are in the same class, aij iscalculated as in line 10, Fig. 1. Otherwise, dj is assumed tobe in dis class, and aij is calculated as in line 12, Fig. 1. Aftermatrix A is formed, the procedure in Fig. 2 is used to get itsvalidity score.
For each document di corresponding to row ai of A, weselect qr documents closest to di. The value of qr is chosenrelatively as percentage of the size of the class r that containsdi, where percentage 2 0; 1. Then, validity w.r.t. di iscalculated by the fraction of these qr documents having thesameclasslabelwithdi,asinline12,Fig.2.Thefinalvalidityisdetermined by averaging over all the rows ofA, as in line 14,Fig. 2. It is clear that validity score is bounded within 0 and 1.The higher validity score a similarity measure has, the moresuitable it should be for the clustering task.
Two real-world document data sets are used as examplesin this validity test. The first is reuters7, a subset of thefamous collection, Reuters-21578 Distribution 1.0, of Reutersnewswire articles.1 Reuters-21578 is one of the most widely
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 991
Fig. 1. Procedure: Build MVS similarity matrix.
Fig. 2. Procedure: Get validity score.
1. http://www.daviddlewis.com/resources/testcollections/reuters21578/.
-
7/27/2019 Clustering With Multiviewpoint-Based
5/14
used test collection for text categorization. In our validitytest, we selected 2,500 documents from the largest sevencategories: acq, crude, interest, earn, money-fx,ship, and trade to form reuters7. Some of the documentsmay appear in more than one category. The second data setis k1b, a collection of 2,340 webpages from the Yahoo! subject
hierarchy, including six topics: health, entertainment,sport, politics, tech, and business. It was createdfrom a past study in information retrieval called WebAce[26], and is now available with the CLUTO toolkit [19].
The two data sets were preprocessed by stop-wordremoval and stemming. Moreover, we removed words thatappear in less than two documents or more than 99.5 percentof the total number of documents. Finally, the documentswere weighted by TF-IDF and normalized to unit vectors.The full characteristics of reuters7 and k1b are presented inFig. 3.
Fig. 4 shows the validity scores of CS and MVS on the
two data sets relative to the parameter percentage. Thevalue ofpercentage is set at 0.001, 0.01, 0.05, 0.1, 0:2; . . . ; 1:0.According to Fig. 4, MVS is clearly better than CS for bothdata sets in this validity test. For example, with k1b data setat percentage 1:0, MVS validity score is 0.80, while that ofCS is only 0.67. This indicates that, on average, when wepick up any document and consider its neighborhood ofsize equal to its true class size, only 67 percent of thatdocuments neighbors based on CS actually belong to itsclass. If based on MVS, the number of valid neighborsincreases to 80 percent. The validity test has illustrated thepotential advantage of the new multiviewpoint-based
similarity measure compared to the cosine measure.
4 MULTIVIEWPOINT-BASED CLUSTERING
4.1 Two Clustering Criterion Functions IRIR and IVIVHaving defined our similarity measure, we now formulateour clustering criterion functions. The first function, calledIR, is the cluster size-weighted sum of average pairwisesimilarities of documents in the same cluster. First, let usexpress this sum in a general form by function F
F Xk
r1
nr1
n2r
Xdi;dj2Sr
Simdi; dj
24
35: 13
We would like to transform this objective function intosome suitable form such that it could facilitate the
optimization procedure to be performed in a simple, fast
and effective way. According to (10)
Xdi;dj2Sr
Simdi; dj
X
di;dj2Sr
1
n nr
Xdh2SnSr
di dh t dj dh
1
n nr
Xdi;dj
Xdh
dtidj d
tidh d
tjdh d
thdh
:
Since
Xdi2Sr
di Xdj2Sr
dj Dr;
Xdh2SnSr
dh D Dr and kdhk 1;
we have
Xdi;dj2Sr
Simdi; dj
X
di;dj2Sr
dtidj 2nr
n nr
Xdi2Sr
dtiX
dh2SnSr
dh n2r
DtrDr 2nr
n nrDtrD Dr n
2r
n nrn nr
kDrk2
2nrn nr
DtrD n2r :
Substituting into (13) to get
F Xkr1
1
nr
n nrn nr
kDrk2
n nrn nr
1
DtrD
! n:
Because n is constant, maximizing F is equivalent to
maximizing F
F Xk
r1
1
nr
nnr
nnrkDrk
2 nnr
nnr 1 Dt
rD !: 14
If comparing F with the min-max cut in (5), both functions
contain the two terms kDrk2 (an intracluster similarity
992 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
Fig. 3. Characteristics of reuters7 and k1b data sets.
Fig. 4. CS and MVS validity test.
-
7/27/2019 Clustering With Multiviewpoint-Based
6/14
measure) and DtrD (an intercluster similarity measure).
Nonetheless, while the objective of min-max cut is to
minimize the inverse ratio between these two terms, our
aim here is to maximize their weighted difference. In F, this
difference term is determined for each cluster. They are
weighted by the inverse of the clusters size, before summed
up over all the clusters. One problem is that this formula-
tion is expected to be quite sensitive to cluster size. From theformulation of COSA [27]a widely known subspace
clustering algorithmwe have learned that it is desirable
to have a set of weight factors frgk1 to regulate the
distribution of these cluster sizes in clustering solutions.
Hence, we integrate into the expression of F to have it
become
F Xkr1
rnr
nnrnnr
kDrk2
nnrnnr
1
DtrD
!: 15
In common practice, frgk1 are often taken to be simple
functions of the respective cluster sizes fnrgk1 [28]. Let us
use a parameter called the regulating factor, which hassome constant value ( 2 0; 1), and let r n
r in (15), the
final form of our criterion function IR is
IRXkr1
1
n1r
nnrnnr
kDrk2
nnrnnr
1
DtrD
!: 16
In the empirical study of Section 5.4, it appears that IRsperformance dependency on the value of is not very
critical. The criterion function yields relatively good
clustering results for 2 0; 1.In the formulation ofIR, a cluster quality is measured by
the average pairwise similarity between documents withinthat cluster. However, such an approach can lead tosensitiveness to the size and tightness of the clusters. WithCS, for example, pairwise similarity of documents in asparse cluster is usually smaller than those in a densecluster. Though not as clear as with CS, it is still possiblethat the same effect may hinder MVS-based clustering if
using pairwise similarity. To prevent this, an alternativeapproach is to consider similarity between each documentvector and its clusters centroid instead. This is expressed inobjective function G
G Xkr1
Xdi2Sr
1
nnr
Xdh2SnSr
Sim didh;Cr
kCrkdh
G Xkr1
1
nnr
Xdi2Sr
Xdh2SnSr
didh t Cr
kCrkdh
:
17
Similar to the formulation of IR, we would like to express
this objective in a simple form that we could optimize more
easily. Exploring the vector dot product, we get
Xdi2Sr
Xdh2SnSr
di dh t Cr
kCrk dh
X
di
Xdh
dtiCr
kCrk dtidh d
th
CrkCrk
1
nnrDtr
DrkDrk
DtrDDr nrDDrt Dr
kDrk
nrn nr; sinceCr
kCrk Dr
kDrk
n kDrk kDrk nr kDrk DtrD
kDrk nrn nr:
Substituting the above into (17) to have
G Xkr1
nkDrk
nnrkDrk
nkDrk
nnr 1
DtrD
kDrk
! n:
Again, we could eliminate n because it is a constant.Maximizing G is equivalent to maximizing IV below
IV Xkr1
nkDrknnr
kDrk nkDrknnr
1
DtrD
kDrk
!: 18
IV calculates the weighted difference between the twoterms: kDrk and D
trD=kDrk, which again represent an
intracluster similarity measure and an intercluster similar-ity measure, respectively. The first term is actuallyequivalent to an element of the sum in spherical k-meansobjective function in (4); the second one is similar to anelement of the sum in min-max cut criterion in (6), but withkDrk as scaling factor instead of kDrk
2. We have presentedour clustering criterion functions IR and IV in the simple
forms. Next, we show how to perform clustering by using agreedy algorithm to optimize these functions.
4.2 Optimization Algorithm and Complexity
We denote our clustering framework by MVSC, meaningClustering with Multiviewpoint-based Similarity. Subse-quently, we have MVSC-IR and MVSC-IV, which are MVSCwith criterion function IR and IV, respectively. The maingoal is to perform document clustering by optimizing IR in(16) and IV in (18). For this purpose, the incremental k-wayalgorithm [18], [29]a sequential version of k-meansisemployed. Considering that the expression of IV in (18)depends only on nr and Dr, r 1; . . . ; k, IV can be written in
a general form
IV Xkr1
Ir nr; Dr ; 19
where Irnr; Dr corresponds to the objective value of clusterr. The same is applied to IR. With this general form, theincremental optimization algorithm, which has two majorsteps Initialization and Refinement, is described in Fig. 5.
At Initialization, k arbitrary documents are selected to bethe seeds from which initial partitions are formed. Refine-ment is a procedure that consists of a number of iterations.
During each iteration, the n documents are visited one byone in a totally random order. Each document is checked ifits move to another cluster results in improvement of theobjective function. If yes, the document is moved to the
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 993
-
7/27/2019 Clustering With Multiviewpoint-Based
7/14
cluster that leads to the highest improvement. If no clustersare better than the current cluster, the document is notmoved. The clustering process terminates when an iterationcompletes without any documents being moved to newclusters. Unlike the traditional k-means, this algorithm is astepwise optimal procedure. While k-means only updatesafter all n documents have been reassigned, the incremental
clustering algorithm updates immediately whenever eachdocument is moved to new cluster. Since every move whenhappens increases the objective function value, convergenceto a local optimum is guaranteed.
During the optimization procedure, in each iteration, themain sources of computational cost are
. Searching for optimum clusters to move individualdocuments to: Onz k.
. Updating composite vectors as a result of suchmoves: Om k.
where nz is the total number of nonzero entries in alldocument vectors. Our clustering approach is partitional andincremental; therefore, computing similarity matrix isabsolutely not needed. If denotes the number of iterationsthe algorithm takes, since nzis often several tens times largerthan m for document domain, the computational complexityrequired for clustering with IR and IV is Onz k .
5 PERFORMANCE EVALUATION OF MVSC
To verify the advantages of our proposed methods, weevaluate their performance in experiments on documentdata. The objective of this section is to compare MVSC-IRand MVSC-IV with the existing algorithms that also use
specific similarity measures and criterion functions fordocument clustering. The similarity measures to be com-pared includes euclidean distance, cosine similarity, andextended Jaccard coefficient.
5.1 Document Collections
The data corpora that we used for experiments consist of
20 benchmark document data sets. Besides reuters7 and k1b,
which have been described in details earlier, we included
another 18 text collections so that the examination of the
clustering methods is more thorough and exhaustive.
Similar to k1b, these data sets are provided together with
CLUTO by the toolkits authors [19]. They had been used
for experimental testing in previous papers, and theirsource and origin had also been described in details [30],
[31]. Table 2 summarizes their characteristics. The corpora
present a diversity of size, number of classes and class
balance. They were all preprocessed by standard proce-
dures, including stop-word removal, stemming, removal of
too rare as well as too frequent words, TF-IDF weighting
and normalization.
5.2 Experimental Setup and Evaluation
To demonstrate how well MVSCs can perform, we compare
them with five other clustering methods on the 20 data sets
in Table 2. In summary, the seven clustering algorithms are
. MVSC-IR: MVSC using criterion function IR
. MVSC-IV: MVSC using criterion function IV
. k-means: standard k-means with euclidean distance
. Spkmeans: spherical k-means with CS
. graphCS: CLUTOs graph method with CS
. graphEJ: CLUTOs graph with extended Jaccard
. MMC: Spectral Min-Max Cut algorithm [13].
Our MVSC-IR and MVSC-IV programs are implemented in
Java. The regulating factor in IR is always set at 0.3 during
the experiments. We observed that this is one of the most
appropriate values. A study on MVSC-IRs performancerelative to different values is presented in a later section.
The other algorithms are provided by the C library interface
which is available freely with the CLUTO toolkit [19]. For
994 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
Fig. 5. Algorithm: Incremental clustering.
TABLE 2Document Data Sets
-
7/27/2019 Clustering With Multiviewpoint-Based
8/14
each data set, cluster number is predefined equal to thenumber of true class, i.e., k c.
None of the above algorithms are guaranteed to findglobal optimum, and all of them are initialization-dependent. Hence, for each method, we performedclustering a few times with randomly initialized values,and chose the best trial in terms of the correspondingobjective function value. In all the experiments, each testrun consisted of 10 trials. Moreover, the result reportedhere on each data set by a particular clustering method isthe average of 10 test runs.
After a test run, clustering solution is evaluated bycomparing the documents assigned labels with their truelabels provided by the corpus. Three types of externalevaluation metric are used to assess clustering performance.They are the FScore, Normalized Mutual Information (NMI),and Accuracy. FScore is an equally weighted combination ofthe precision (P) and recall (R) values used ininformation retrieval. Given a clustering solution, FScore isdetermined as
FScore Xki1
nin
maxj
Fi;j
where Fi;j 2 Pi;j Ri;j
Pi;j Ri;j; Pi;j
ni;jnj
; Ri;j ni;jni
;
where ni denotes the number of documents in class i, nj thenumber of documents assigned to cluster j, and ni;j thenumber of documents shared by class i and cluster j. Fromanother aspect, NMImeasures the information the true classpartition and the cluster assignment share. It measures howmuch knowing about the clusters helps us know about the
classes
NMI
Pki1
Pkj1 ni;j log
nni;jninj
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPk
i1 ni lognin
Pkj1 nj log
njn
r :
Finally, Accuracy measures the fraction of documents thatare correctly labels, assuming a one-to-one correspondencebetween true classes and assigned clusters. Let qdenote anypossible permutation of index set f1; . . . ; kg, Accuracy iscalculated by
Accuracy 1n
maxq
Xki1
ni;qi:
The best mapping q to determine Accuracy could be foundby the Hungarian algorithm.2 For all three metrics, theirrange is from 0 to 1, and a greater value indicates a betterclustering solution.
5.3 Results
Fig. 6 shows the Accuracy of the seven clustering algorithmson the 20 text collections. Presented in a different way,clustering results based on FScore and NMIare reported inTables 3 and 4, respectively. For each data set in a row, the
value in bold and underlined is the best result, while thevalue in bold only is the second to best.
It can be observed that MVSC-IR and MVSC-IV performconsistently well. In Fig. 6, 19 out of 20 data sets, exceptreviews, either both or one of MVSC approaches are in thetop two algorithms. The next consistent performer isSpkmeans. The other algorithms might work well on certaindata set. For example, graphEJ yields outstanding result onclassic; graphCS and MMC are good on reviews. But they donot fare very well on the rest of the collections.
To have a statistical justification of the clusteringperformance comparisons, we also carried out statisticalsignificance tests. Each of MVSC-IR and MVSC-IV waspaired up with one of the remaining algorithms for a pairedt-test [32]. Given two paired sets X and Y of N measuredvalues, the null hypothesis of the test is that the differencesbetween X and Y come from a population with mean 0. Thealternative hypothesis is that the paired sets differ fromeach other in a significant way. In our experiment, thesetests were done based on the evaluation values obtained onthe 20 data sets. The typical 5 percent significance level wasused. For example, considering the pair (MVSC-IR, k-means), from Table 3, it is seen that MVSC-IR dominates k-means w.r.t. FScore. If the paired t-test returns a p-valuesmaller than 0.05, we reject the null hypothesis and say thatthe dominance is significant. Otherwise, the null hypothesisis true and the comparison is considered insignificant.
The outcomes of the paired t-tests are presented inTable 5. As the paired t-tests show, the advantage of MVSC-IR and MVSC-IV over the other methods is statisticallysignificant. A special case is the graphEJ algorithm. On theone hand, MVSC-IR is not significantly better than graphEJif based on FScore or NMI. On the other hand, when MVSC-IR and MVSC-IV are tested obviously better than graphEJ,
the p-values can still be considered relatively large,although they are smaller than 0.05. The reason is that, asobserved before, graphEJs results on classic data set arevery different from those of the other algorithms. Whileinteresting, these values can be considered as outliers, andincluding them in the statistical tests would affect theoutcomes greatly. Hence, we also report in Table 5 the testswhere classic was excluded and only results on the other19 data sets were used. Under this circumstance, bothMVSC-IR and MVSC-IV outperform graphEJ significantlywith good p-values.
5.4 Effect of on MVSC-IRIRs PerformanceIt has been known that criterion function-based partitionalclustering methods can be sensitive to cluster size andbalance. In the formulation of IR in (16), there existsparameter which is called the regulating factor, 2 0; 1.To examine how the determination of could affect MVSC-IRs performance, we evaluated MVSC-IR with differentvalues of from 0 to 1, with 0.1 incremental interval. Theassessment was done based on the clustering results inNMI, FScore, and Accuracy, each averaged over all the20 given data sets. Since the evaluation metrics for differentdata sets could be very different from each other, simply
taking the average over all the data sets would not be verymeaningful. Hence, we employed the method used in [18]to transform the metrics into relative metrics before aver-aging. On a particular document collection S, the relative
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 995
2. http://en.wikipedia.org/wiki/Hungarian_algorithm.
-
7/27/2019 Clustering With Multiviewpoint-Based
9/14
FScore measure of MVSC-IR with i is determined asfollowing:
relative FScoreIR; S; i maxj FScoreIR; S; j
FScoreIR; S; i
;
where i; j 2 f0:0; 0:1; . . . ; 1:0g, FScoreIR; S; i is theFScore result on data set S obtained by MVSC-IR with i. The same transformation was applied to NMI andAccuracy to yield relative_NMIand relative_Accuracy, respec-tively. MVSC-IR performs the best with an i if its relativemeasure has a value of 1. Otherwise its relative measure isgreater than 1; the larger this value is, the worse MVSC-IRwith i performs in comparison with other settings of .Finally, the average relative measures were calculated over
all the data sets to present the overall performance.Fig. 7 shows the plot of average relative FScore, NMI, and
Accuracy w.r.t. different values of . In a broad view,MVSC-IR performs the worst at the extreme values of (0
and 1), and tends to get better when is set at some softvalues in between 0 and 1. Based on our experimentalstudy, MVSC-IR always produces results within 5 percentof the best case, regarding any types of evaluation metric,with from 0.2 to 0.8.
6 MVSC AS REFINEMENT FOR kk-MEANS
From the analysis of (12) in Section 3.2, MVS provides anadditional criterion for measuring the similarity amongdocuments compared with CS. Alternatively, MVS can beconsidered as a refinement for CS, and consequently MVSCalgorithms as refinements for spherical k-means, whichuses CS. To further investigate the appropriateness andeffectiveness of MVS and its clustering algorithms, we
carried out another set of experiments in which solutionsobtained by Spkmeans were further optimized by MVSC-IRand MVSC-IV. The rationale for doing so is that if the finalsolutions by MVSC-IR and MVSC-IV are better than the
996 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
Fig. 6. Clustering results in Accuracy. Left-to-right in legend corresponds to left-to-right in the plot.
-
7/27/2019 Clustering With Multiviewpoint-Based
10/14
intermediate ones obtained by Spkmeans, MVS is indeedgood for the clustering problem. These experiments wouldreveal more clearly if MVS actually improves the clusteringperformance compared with CS.
In the previous section, MVSC algorithms have beencompared against the existing algorithms that are closelyrelated to them, i.e., ones that also employ similaritymeasures and criterion functions. In this section, we make
use of the extended experiments to further compare theMVSC with a different type of clustering approach, theNMF methods [10], which do not use any form of explicitlydefined similarity measure for documents.
6.1 TDT2 and Reuters-21578Collections
For variety and thoroughness, in this empirical study, weused two new document copora described in Table 6: TDT2and Reuters-21578. The original TDT2 corpus,3 whichconsists of 11,201 documents in 96 topics (i.e., classes), hasbeen one of the most standard sets for document clusteringpurpose. We used a subcollection of this corpus whichcontains 10,021 documents in the largest 56 topics. The
Reuters-21578 Distribution 1.0 has been mentioned earlier inthis paper. The original corpus consists of 21,578 documents
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 997
TABLE 4Clustering Results in NMI
TABLE 3Clustering Results in FScore
3. http://www.nist.gov/speech/tests/tdt/tdt98/index.html.
-
7/27/2019 Clustering With Multiviewpoint-Based
11/14
in 135 topics. We used a subcollection having 8,213 docu-ments from the largest 41 topics. The same two document
collections had been used in the paper of the NMF methods
[10]. Documents that appear in two or more topics were
removed, and the remaining documents were preprocessed
in the same way as in Section 5.1.
6.2 Experiments and Results
The following clustering methods:
. Spkmeans: spherical k-means
. rMVSC-IR: refinement of Spkmeans by MVSC-IR
. rMVSC-IV: refinement of Spkmeans by MVSC-IV
. MVSC-IR: normal MVSC using criterion IR
. MVSC-IV: normal MVSC using criterion IV,
and two new document clustering approaches that do not
use any particular form of similarity measure:
. NMF: Nonnegative Matrix Factorization method
. NMF-NCW: Normalized Cut Weighted NMF
were involved in the performance comparison. When used
as a refinement for Spkmeans, the algorithms rMVSC-IRand rMVSC-IV worked directly on the output solution of
Spkmeans. The cluster assignment produced by Spkmeans
was used as initialization for both rMVSC-IR and rMVSC-IV. We also investigated the performance of the original
MVSC-IR and MVSC-IV further on the new data sets.Besides, it would be interesting to see how they and theirSpkmeans-initialized versions fare against each other. Whatis more, two well-known document clustering approachesbased on nonnegative matrix factorization, NMF and NMF-NCW [10], are also included for a comparison with ouralgorithms, which use explicit MVS measure.
During the experiments, each of the two corpora in Table 6were used to create six different test cases, each of whichcorresponded to a distinct number of topics used(c 5; . . . ; 10). For each test case, c topics were randomlyselected from the corpus and their documents were mixed
together to form a test set. This selection was repeated50 times so that each test case had 50 different test sets. Theaverage performance of the clustering algorithms with k cwere calculated over these 50 test sets. This experimentalsetup is inspired by the similar experiments conducted in theNMF paper [10]. Furthermore, similar to previous experi-mental setup in Section 5.2, each algorithm (including NMFand NMF-NCW) actually considered 10 trials on any test setbefore using the solution of the best obtainable objectivefunction value as its final output.
The clustering results on TDT2 and Reuters-21578 areshown in Table 7 and 8, respectively. For each test case in a
column, the value in bold and underlined is the bestamong the results returned by the algorithms, while thevalue in bold only is the second to best. From the tables,several observations can be made. First, MVSC-IR andMVSC-IV continue to show they are good clusteringalgorithms by outperforming other methods frequently.They are always the best in every test case of TDT2.
998 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
Fig. 7. MVSC-IRs performance with respect to .
TABLE 6TDT2 and Reuters-21578Document Corpora
TABLE 5Statistical Significance of Comparisons Based on Paired t-Tests with 5 Percent Significance Level
-
7/27/2019 Clustering With Multiviewpoint-Based
12/14
Compared with NMF-NCW, they are better in almost allthe cases, except only the case of Reuters-21578, k 5,where NMF-NCW is the best based on Accuracy.
The second observation, which is also the main objective
of this empirical study, is that by applying MVSC to refinethe output of spherical k-means, clustering solutions areimproved significantly. Both rMVSC-IR and rMVSC-IV leadto higher NMIs and Accuracies than Spkmeans in all the
cases. Interestingly, there are many circumstances whereSpkmeans result is worse than that of NMF clusteringmethods, but after refined by MVSCs, it becomes better. Tohave a more descriptive picture of the improvements, we
could refer to the radar charts in Fig. 8. The figure showsdetails of a particular test case where k 5. Remember thata test case consists of 50 different test sets. The chartsdisplay result on each test set, including the accuracy result
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 999
TABLE 7Clustering Results on TDT2
TABLE 8Clustering Results on Reuters-21578
Fig. 8. Accuracies on the 50 test sets (in sorted order of Spkmeans) in the test case k 5.
-
7/27/2019 Clustering With Multiviewpoint-Based
13/14
obtained by Spkmeans, and the results after refinement byMVSC, namely rMVSC-IR and rMVSC-IV. For effectivevisualization, they are sorted in ascending order of theaccuracies by Spkmeans (clockwise). As the patterns in bothFigs. 8a and 8b reveal, improvement in accuracy is mostlikely attainable by rMVSC-IR and rMVSC-IV. Many of theimprovements are with a considerably large margin,especially when the original accuracy obtained bySpkmeans is low. There are only few exceptions whereafter refinement, accuracy becomes worst. Nevertheless, thedecreases in such cases are small.
Finally, it is also interesting to notice from Tables 7 and 8that MVSC preceded by spherical k-means does notnecessarily yields better clustering results than MVSC withrandom initialization. There are only a small number ofcases in the two tables that rMVSC can be found better thanMVSC. This phenomenon, however, is understandable.Given a local optimal solution returned by spherical k-means, rMVSC algorithms as a refinement method wouldbe constrained by this local optimum itself and, hence, their
search space might be restricted. The original MVSCalgorithms, on the other hand, are not subjected to thisconstraint, and are able to follow the search trajectory oftheir objective function from the beginning. Hence, whileperformance improvement after refining spherical k-meansresult by MVSC proves the appropriateness of MVS and itscriterion functions for document clustering, this observationin fact only reaffirms its potential.
7 CONCLUSIONS AND FUTURE WORK
In this paper, we propose a Multiviewpoint-based Similar-ity measuring method, named MVS. Theoretical analysis
and empirical examples show that MVS is potentially moresuitable for text documents than the popular cosinesimilarity. Based on MVS, two criterion functions, IR andIV, and their respective clustering algorithms, MVSC-IRand MVSC-IV, have been introduced. Compared with otherstate-of-the-art clustering methods that use different typesof similarity measure, on a large number of document datasets and under different evaluation metrics, the proposedalgorithms show that they could provide significantlyimproved clustering performance.
The key contribution of this paper is the fundamentalconcept of similarity measure from multiple viewpoints.
Future methods could make use of the same principle, butdefine alternative forms for the relative similarity in (10), ordo not use average but have other methods to combine therelative similarities according to the different viewpoints.Besides, this paper focuses on partitional clustering ofdocuments. In the future, it would also be possible to applythe proposed criterion functions for hierarchical clusteringalgorithms. Finally, we have shown the application of MVSand its clustering algorithms for text data. It would beinteresting to explore how they work on other types ofsparse and high-dimensional data.
REFERENCES[1] X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J.
McLachlan, A. Ng, B. Liu, P.S. Yu, Z.-H. Zhou, M. Steinbach, D.J.Hand, and D. Steinberg, Top 10 Algorithms in Data Mining,Knowledge Information Systems, vol. 14, no. 1, pp. 1-37, 2007.
[2] I. Guyon, U.V. Luxburg, and R.C. Williamson, Clustering:Science or Art?, Proc. NIPS Workshop Clustering Theory, 2009.
[3] I. Dhillon and D. Modha, Concept Decompositions for LargeSparse Text Data Using Clustering, Machine Learning, vol. 42,nos. 1/2, pp. 143-175, Jan. 2001.
[4] S. Zhong, Efficient Online Spherical K-means Clustering, Proc.IEEE Intl Joint Conf. Neural Networks (IJCNN), pp. 3180-3185, 2005.
[5] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, Clustering withBregman Divergences, J. Machine Learning Research, vol. 6,pp. 1705-1749, Oct. 2005.
[6] E. Pekalska, A. Harol, R.P.W. Duin, B. Spillmann, and H. Bunke,Non-Euclidean or Non-Metric Measures Can Be Informative,Structural, Syntactic, and Statistical Pattern Recognition, vol. 4109,pp. 871-880, 2006.
[7] M. Pelillo, What Is a Cluster? Perspectives from Game Theory,Proc. NIPS Workshop Clustering Theory, 2009.
[8] D. Lee and J. Lee, Dynamic Dissimilarity Measure for SupportBased Clustering, IEEE Trans. Knowledge and Data Eng., vol. 22,no. 6, pp. 900-905, June 2010.
[9] A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, Clustering on theUnit Hypersphere Using Von Mises-Fisher Distributions,
J. Machine Learning Research, vol. 6, pp. 1345-1382, Sept. 2005.[10] W. Xu, X. Liu, and Y. Gong, Document Clustering Based on Non-
Negative Matrix Factorization, Proc. 26th Ann. Intl ACM SIGIRConf. Research and Development in Informaion Retrieval, pp. 267-273,
2003.[11] I.S. Dhillon, S. Mallela, and D.S. Modha, Information-Theoretic
Co-Clustering, Proc. Ninth ACM SIGKDD Intl Conf. KnowledgeDiscovery and Data Mining (KDD), pp. 89-98, 2003.
[12] C.D. Manning, P. Raghavan, and H. Schutze, An Introduction toInformation Retrieval. Cambridge Univ. Press, 2009.
[13] C. Ding, X. He, H. Zha, M. Gu, and H. Simon, A Min-Max CutAlgorithm for Graph Partitioning and Data Clustering, Proc.IEEE Intl Conf. Data Mining (ICDM), pp. 107-114, 2001.
[14] H. Zha, X. He, C. Ding, H. Simon, and M. Gu, Spectral Relaxationfor K-Means Clustering, Proc. Neural Info. Processing Systems(NIPS), pp. 1057-1064, 2001.
[15] J. Shi and J. Malik, Normalized Cuts and Image Segmentation,IEEE Trans. Pattern Analysis Machine Intelligence, vol. 22, no. 8,pp. 888-905, Aug. 2000.
[16]I.S. Dhillon, Co-Clustering Documents and Words UsingBipartite Spectral Graph Partitioning, Proc. Seventh ACMSIGKDD Intl Conf. Knowledge Discovery and Data Mining (KDD),pp. 269-274, 2001.
[17] Y. Gong and W. Xu, Machine Learning for Multimedia ContentAnalysis. Springer-Verlag, 2007.
[18] Y. Zhao and G. Karypis, Empirical and Theoretical Comparisonsof Selected Criterion Functions for Document Clustering,
Machine Learning, vol. 55, no. 3, pp. 311-331, June 2004.[19] G. Karypis, CLUTO a Clustering Toolkit, technical report, Dept.
of Computer Science, Univ. of Minnesota, http://glaros.dtc.umn.edu/~gkhome/views/cluto, 2003.
[20] A. Strehl, J. Ghosh, and R. Mooney, Impact of SimilarityMeasures on Web-Page Clustering, Proc. 17th Natl Conf. ArtificialIntelligence: Workshop of Artificial Intelligence for Web Search (AAAI),pp. 58-64, July 2000.
[21] A. Ahmad and L. Dey, A Method to Compute Distance BetweenTwo Categorical Values of Same Attribute in UnsupervisedLearning for Categorical Data Set, Pattern Recognition Letters,vol. 28, no. 1, pp. 110-118, 2007.
[22] D. Ienco, R.G. Pensa, and R. Meo, Context-Based DistanceLearning for Categorical Data Clustering, Proc. Eighth Intl Symp.Intelligent Data Analysis (IDA), pp. 83-94, 2009.
[23] P. Lakkaraju, S. Gauch, and M. Speretta, Document SimilarityBased on Concept Tree Distance, Proc. 19th ACM Conf. Hypertextand Hypermedia, pp. 127-132, 2008.
[24] H. Chim and X. Deng, Efficient Phrase-Based DocumentSimilarity for Clustering, IEEE Trans. Knowledge and Data Eng.,vol. 20, no. 9, pp. 1217-1229, Sept. 2008.
[25] S. Flesca, G. Manco, E. Masciari, L. Pontieri, and A. Pugliese, FastDetection of xml Structural Similarity, IEEE Trans. Knowledge and
Data Eng., vol. 17, no. 2, pp. 160-175, Feb. 2005.[26] E.-H. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V.Kumar, B. Mobasher, and J. Moore, Webace: A Web Agent forDocument Categorization and Exploration, Proc. Second IntlConf. Autonomous Agents (AGENTS 98), pp. 408-415, 1998.
1000 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
-
7/27/2019 Clustering With Multiviewpoint-Based
14/14
[27] J. Friedman and J. Meulman, Clustering Objects on Subsets ofAttributes, J. Royal Statistical Soc. Series B Statistical Methodology,vol. 66, no. 4, pp. 815-839, 2004.
[28] L. Hubert, P. Arabie, and J. Meulman, Combinatorial Data Analysis:Optimization by Dynamic Programming. SIAM, 2001.
[29] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, seconded. John Wiley & Sons, 2001.
[30] S. Zhong and J. Ghosh, A Comparative Study of GenerativeModels for Document Clustering, Proc. SIAM Intl Conf. Data
Mining Workshop Clustering High Dimensional Data and Its Applica-
tions, 2003.[31] Y. Zhao and G. Karypis, Criterion Functions for Document
Clustering: Experiments and Analysis, technical report, Dept. ofComputer Science, Univ. of Minnesota, 2002.
[32] T.M. Mitchell, Machine Learning. McGraw-Hill, 1997.
Duc Thang Nguyen received the BEng degreein electrical and electronic engineering fromNanyang Technological University, Singapore,where he is also working toward the PhDdegree in the Division of Information Engineer-ing. Currently, he is an operations planninganalyst at PSA Corporation Ltd., Singapore. Hisresearch interests include algorithms, informa-tion retrieval, data mining, optimizations, andoperations research.
Lihui Chen received the BEng degree incomputer science and engineering at ZhejiangUniversity, China and the PhD degree incomputational science at the University of St.Andrews, United Kingdom. Currently, she is anassociate professor in the Division of InformationEngineering at Nanyang Technological Univer-sity in Singapore. Her research interests includemachine learning algorithms and applications,data mining and web intelligence. She has
published more than 70 referred papers in international journals andconferences in these areas. She is a senior member of the IEEE, and amember of the IEEE Computational Intelligence Society.
Chee Keong Chan received the BEng degreefrom National University of Singapore in elec-trical and electronic engineering, the MSCdegree and DIC in computing from ImperialCollege, University of London, and the PhDdegree from Nanyang Technological University.Upon graduation from NUS, he worked as anR&D engineer in Philips Singapore for severalyears. Currently, he is an associate professor inthe Information Engineering Division, lecturing in
subjects related to computer systems, artificial intelligence, softwareengineering, and cyber security. Through his years in NTU, he haspublished about numerous research papers in conferences, journals,books, and book chapter. He has also provided numerous consultationsto the industries. His current research interest areas include data mining(text and solar radiation data mining), evolutionary algorithms (schedul-ing and games), and renewable energy.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 1001