10/6/2015nikos hourdakis, msc thesis1 design and evaluation of clustering approaches for large...
TRANSCRIPT
18/04/23 Nikos Hourdakis, MSc Thesis 1
Design and Evaluation of Clustering Approaches for Large Document Collections, The “BIC-Means” Method
Nikolaos Hourdakis
Technical University of CreteDepartment of Electronic and Computer Engineering
18/04/23 Nikos Hourdakis, MSc Thesis 2
Motivation
Large document collections in many applications. Digital libraries, Web
There is additional interest in methods for more effective management of information. Abtraction, Browsing, Classification, Retrieval
Clustering is the means for achieving better organization of information. The data space is partitioned into groups of entities
with similar content.
18/04/23 Nikos Hourdakis, MSc Thesis 3
Outline
Background State-of-the-art clustering approaches Partitional, hierarchical methods
K-Means and its variants Incremental K-Means, Bisecting Incremental K-Means
Proposed method: BIC-Means Bisecting Incremental K-Means using BIC as stopping
criterion. Evaluation of clustering methods Application in Information Retrieval
18/04/23 Nikos Hourdakis, MSc Thesis 4
Hierarchical Clustering (1/3)
Nested sequence of clusters. Two approaches:
A. Agglomerative: Starting from singleton clusters, recursively merges the two most similar clusters until there is only one cluster.
B. Divisive (e.g., Bisecting K-Means): Starting with all documents in the same root cluster, iteratively splits each cluster into K clusters.
18/04/23 Nikos Hourdakis, MSc Thesis 5
Hierarchical Clustering – Example (2/3)
..
.. .. .. .. ..... .. ... ..
.. . ....
5
46
7
23
1
4
1
2 3
5 6 7
18/04/23 Nikos Hourdakis, MSc Thesis 6
Hierarchical Clustering (3/3)
Organization and browsing of large document collections call for hierarchical clustering but:
Agglomerative clustering have quadratic time complexity.
Prohibitive for large data sets.
18/04/23 Nikos Hourdakis, MSc Thesis 7
Partitional Clustering
We focus on Partitional Clustering K-Means, Incremental K-Means, Bisecting K-Means
At least as good as hierarchical. Low complexity, O(KN) Faster than hierarchical for large document
collections.
18/04/23 Nikos Hourdakis, MSc Thesis 8
K-Means
1. Randomly select K centroids
2. Repeat ITER times or until the centroids do not change:
a) Assign each instance to the cluster whose centroid it is closest.
b) Re-compute the cluster centroids.
Generates a flat partition of K Clusters (K must be known in advance).
Centroid is the mean of a group of instances.
18/04/23 Nikos Hourdakis, MSc Thesis 9
K-Means Example
.
...
. ..
.
.
.
.
.......
.
.
..
xx
C
C C
18/04/23 Nikos Hourdakis, MSc Thesis 10
K-Means demo (1/7): http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html
18/04/23 Nikos Hourdakis, MSc Thesis 11
K-Means demo (2/7)
18/04/23 Nikos Hourdakis, MSc Thesis 12
K-Means demo (3/7)
18/04/23 Nikos Hourdakis, MSc Thesis 13
K-Means demo (4/7)
18/04/23 Nikos Hourdakis, MSc Thesis 14
K-Means demo (5/7)
18/04/23 Nikos Hourdakis, MSc Thesis 15
K-Means demo (6/7)
18/04/23 Nikos Hourdakis, MSc Thesis 16
K-Means demo (7/7)
18/04/23 Nikos Hourdakis, MSc Thesis 17
Comments No proof of convergence Converges to a local minimum of the distortion
measure (average of the square distance of the points from their nearest centroids):
ΣiΣd(d-μc)2
Too slow for practical databases K-means fully deterministic once initial centroids
selected. Bad choice of initial centroids leads to poor
clusters.
18/04/23 Nikos Hourdakis, MSc Thesis 18
Incremental K-Means (IK)
In K-Means new centroids are computed after each iteration (after all documents have been examined).
In Incremental K-Means each cluster centroid is updated after a document is assigned to a cluster:
S
dCSC
1
'
18/04/23 Nikos Hourdakis, MSc Thesis 19
Comments
Not as sensitive as K-Means to the selection of initial centroids.
Faster convergence, much faster in general
18/04/23 Nikos Hourdakis, MSc Thesis 20
Bisecting IK-Means (1/4)
A hierarchical clustering solution is produced by recursively applying the Incremental K-Means in a document collection. The documents are initially partitioned into two
clusters. The algorithm iteratively selects and bisects each
one of the leaf clusters until singleton clusters are reached.
18/04/23 Nikos Hourdakis, MSc Thesis 21
Bisecting IK-means (2/4)
Input: (d1,d2…dN) Output: hierarchy of K-clusters
1. All document in cluster C2. Apply IK-means to split C into K clusters
(K=2) C1,C2,…CK leaf clusters
3. Iteratively split each Ci cluster until K clusters or singleton clusters are produces at the leafs
18/04/23 Nikos Hourdakis, MSc Thesis 22
Bisecting IK-Means (3/4)
The algorithm is exhaustive terminating at singleton clusters (unless K is known)
Terminating at singleton clusters Is time consumingSingleton clusters are meaninglessIntermediate clusters are more likely to correspond
to real classes
No criterion for stopping bisections before singleton clusters are reached.
18/04/23 Nikos Hourdakis, MSc Thesis 23
Bayesian Information Criterion (BIC) (1/3)
To prevent over-splitting we define a strategy to stop the Bisecting algorithm when meaningful clusters are reached.
Bayesian Information Criterion (BIC) or Schwarz Criterion [Schwarz 1978].
X-Means [Pelleg and Moore, 2000] used BIC for estimating the best K in a given range of values.
18/04/23 Nikos Hourdakis, MSc Thesis 24
Bayesian Information Criterion (BIC) (2/3)
In this work, we suggest using BIC as the splitting criterion of a cluster in order to decide whether a cluster should split or not.
It measures the improvement of the cluster structure between a cluster and its two children clusters.
We compute the BIC score of: A cluster and of its Two children clusters.
18/04/23 Nikos Hourdakis, MSc Thesis 25
Bayesian Information Criterion (BIC) (3/3)
If the BIC score of the produced children clusters is less than the BIC score of their parent cluster we do not accept the split. We keep the parent cluster as it is.
Otherwise, we accept the split and the algorithm proceeds similarly to lower levels.
18/04/23 Nikos Hourdakis, MSc Thesis 26
Example
The BIC score of the parent cluster is less than BIC score of the generated cluster structure => we accept the bisection.
Two resultingclusters:BIC(K=2)=2245
1C Parent cluster:BIC(K=1)=1980
1C 2C
C
18/04/23 Nikos Hourdakis, MSc Thesis 27
Computing BIC
The BIC score of a data collection is defined as (Kass and Wasserman, 1995):
where is the log-likelihood of the data set D, Pj = M*K+1, is a function of the number of independent parameters and R is the number of points.
ˆ( ) log2j
j j
pBIC M l D R
ˆjl D
18/04/23 Nikos Hourdakis, MSc Thesis 28
Log-likelihood
Given a cluster of points, that produces a Gaussian distribution N(μ, σ2), log-likelihood is the probability that a neighborhood of data points follows this distribution. The log-likelihood of the data can be
considered as a measure of the cohesiveness of a cluster.
It estimates how closely to the centroid are the points of the cluster.
18/04/23 Nikos Hourdakis, MSc Thesis 29
Parameters pj
Sometimes, due to the complexity of the data (many dimensions or many data points), the data may follow other distributions.
We penalize log-likelihood by a function of the number of independent parameters (pj/2*logR).
18/04/23 Nikos Hourdakis, MSc Thesis 30
Notation μj : coordinates of j-th centroid μ(i) : centroid nearest to i-th data point D: input set of data points Dj : set of data points that have μ(j) as their
closest centroid R = |D| and Ri = |Di| M: the number of dimensions Mj: family of alternative models (different
models correspond clustering solutions) BIC scores the models and chooses the best
among K models
18/04/23 Nikos Hourdakis, MSc Thesis 31
Computing BIC (1/3)
To compute log-likelihood of data we need the parameters of the Gaussian for the data
Maximum likelihood estimate (MLE) of the variance (under spherical Gaussian assumption)
i
iixKR2)(
2 1
18/04/23 Nikos Hourdakis, MSc Thesis 32
Computing BIC (2/3)
Probability of point xi : Gaussian with the estimated σ and mean the nearest cluster centroid to xi
Log likelihood of data
2
2i 2
1exp
2
1xP iM
i xR
R
i
iii i R
RxixPDl log
2
1
2
1loglog)(
2
2
18/04/23 Nikos Hourdakis, MSc Thesis 33
Computing BIC (3/3)
Focusing on the set Dn of points which belong to centroid n
RRRRR
MRRDl
nnnn
nnn
loglog2
)log(2
)2log(2
)( 2
18/04/23 Nikos Hourdakis, MSc Thesis 34
Proposed Method: BIC-Means (1/2)
BIC: Bisecting InCremental K-Means clustering incorporating BIC as the stopping criterion. BIC performs a splitting test at each leaf
cluster to prevent it from over-splitting. BIC-Means doesn’t terminate at singleton
clusters. BIC-Means terminates when there are no
separable clusters according to BIC.
18/04/23 Nikos Hourdakis, MSc Thesis 35
Proposed Method: BIC-Means (2/2)
Combines the strengths of partitional and hierarchical clustering methods Hierarchical clustering Low complexity (O(N*K)) Good clustering quality Produces meaningful clusters at the
leafs
18/04/23 Nikos Hourdakis, MSc Thesis 36
BIC-Means Algorithm
Input: S: (d1, d2,…,dn) data in one cluster
Output: A hierarchy of clusters. 1. All documents in one cluster C.
2. Apply Incremental K-Means to split C into C1, C2.
3. Compute BIC for C and C1, C2 :
I. If BIC(C) < BIC(C1, C2) put C1, C2 in queue
II. Otherwise do not split C
4. Repeat steps 2, 3 and 4, until there is no separable leaf clusters in queue according to BIC.
18/04/23 Nikos Hourdakis, MSc Thesis 37
Evaluation
Evaluation of document clustering algorithms. Two data sets: OHSUMED (233,445 Medline
documents), Reuters (21578 documents). Application of clustering to information retrieval
Evaluation of several cluster-based retrieval strategies.
Comparison with retrieval by exhaustive search on OHSUMED.
18/04/23 Nikos Hourdakis, MSc Thesis 38
F-Measure
Howe good the clusters approximate data classes F-Measure for cluster C and class T is defined as:
, where , The F measure of a class T is the maximum value it
achieves over all clusters C:
FT= maxCFTC
The F measure of the clustering solution is the mean FT (over all classes)
2 /( )F Measure P R P R /R N T/P N C
TT
FT
CF
18/04/23 Nikos Hourdakis, MSc Thesis 39
Comparison of Clustering Algorithms
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
K-Means Incremental K-Means Bisecting Increm. K-Means
Avg
F-M
easu
re (
10 t
rials
)
Algorithms
Comparison of K-Means, Incremental K-Means and BisectingOhsumed1 - Reuters1 data sets
Reuters1 OHSUMED1
18/04/23 Nikos Hourdakis, MSc Thesis 40
Evaluation of Incremental K-Means
Incremental K-Means - Reuters1Number of Iterations of Center adjustment
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 iteration 2 iterations 3 iterations 4 iterations
Number of Iterations
Avg
F-M
easu
re(1
0 tr
ials
)
18/04/23 Nikos Hourdakis, MSc Thesis 41
MeSH Representation of Documents
We use MeSH terms for describing medical documents (OHSUMED).
Each document is represented by a vector of MeSH terms (multi-word terms instead of single word terms).
Leads to more compact representation (each vector contains less terms, about 20).
Sequential approach to extract MeSH terms from OHSUMED documents.
18/04/23 Nikos Hourdakis, MSc Thesis 42
Bisecting Incremental K-Means – Clustering Quality
Bisecting Incremental K-M eans- OHSUM ED2 M eSH terms Vs Single Word Terms Representation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Single-Word Term Representation MeSH Term Representation
Document Representation
Av
g F
-Me
as
ure
(1
0 t
ria
ls)
18/04/23 Nikos Hourdakis, MSc Thesis 43
Speed of Clustering
Bisecting Incremental K-M eans - Ohsumed2 M eSH-based Vs Single word Terms Representation
Single-Word Terms Representation - 97.6 min
MeSH Terms Representation,14 min.
0
10
20
30
40
50
60
70
80
90
100
110
Single-Word Term Representation MeSH Term Representation
Document Representation
Av
g C
lus
teri
ng
Tim
e (
min
)
18/04/23 Nikos Hourdakis, MSc Thesis 44
Evaluation of BIC-Means
Comparison of BIC-Means and Bisecting Incremental K-MeansF-Measure
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Ohsumed2 Reuters1 Reuters2
Data Set
Av
g F
-Me
as
ure
(1
0 t
ria
ls)
BIC-Means Bisecting Incremental K-Means
18/04/23 Nikos Hourdakis, MSc Thesis 45
Speed of Clustering
Comparison of BIC-Means and Bisecting Incremental K-Means Clustering Time
0102030405060708090
100110120130140150160170180
Ohsumed2 Reuters1 Reuters2
Data Set
Av
g C
lus
teri
ng
Tim
e (
min
)
BIC-Means Bisecting Incremental K-Means
18/04/23 Nikos Hourdakis, MSc Thesis 46
Comments
BIC-Means is much faster than Bisecting Incremental K-Means Not exhaustive algorithm.
Achieves approximately the same F-Measure with the exhaustive Bisecting approach.
It is more suited for clustering large document collections.
18/04/23 Nikos Hourdakis, MSc Thesis 47
Application of Clustering to Information Retrieval
We demonstrate that it is possible to reduce the size of the search (and therefore retrieval response time) on large data sets (OHSUMED).
BIC-Means is applied on entire OHSUMED. Each document is represented by MeSH terms. Chose 61 queries of the original OHSUMED
query set developed by Hersh et. al. Each OHSUMED document has been judged as
relevant to a query.
18/04/23 Nikos Hourdakis, MSc Thesis 48
Query – Document Similarity
1 2
1 2
11 21 2
2 21 2
1 1
( , )| || |
M
id idi
M M
id idi i
w wd dSim d d
d d w w
Similarity is defined as the cosine of the angle between document and query vectors.
θ
d1
d2
18/04/23 Nikos Hourdakis, MSc Thesis 49
Information Retrieval Methods
Method 1: Search M clusters closer to the query Compute similarity between cluster centroid - query
Method 2: Search M clusters closer to the query Each cluster is represented by the 20 most frequent
terms of its centroid.
Method 3: Search M clusters whose centre contain the terms of the query.
18/04/23 Nikos Hourdakis, MSc Thesis 50
Method 1: Search M clusters closer to the query (compute similarity between cluster centroid – query).
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
recall
pre
cis
ion
top_1Cluster
top_3Clusters
top_10Clusters
top_30Clusters
top_50Clusters
top_100Clusters
top_150Clusters
Exhaustive Search
18/04/23 Nikos Hourdakis, MSc Thesis 51
Method 2: Search M clusters closer to the query. Each cluster is represented by the 20 most frequent terms of its centroid.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4recall
pre
cis
ion
20Terms-top_10Clusters
20Terms-top_50Clusters
20Terms-top_100Clusters
20Terms-top_150Clusters
top_150Clusters
Exhaustive Search
18/04/23 Nikos Hourdakis, MSc Thesis 52
Method 3: Search M clusters containing the terms of the query.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
recall
pre
cisi
on
AllQinCen_Top_15Clus ters
AllQinCen_Top_30Clus ters
AllQinCen_Top_50Clus ters
AllQinCen_AllClus ters
Exhaustive Search
18/04/23 Nikos Hourdakis, MSc Thesis 53
Size of Search
"Avg Num ber of Docum ents searched over the 61 queries"Retrieval Strategy: Retrieve the clusters w hich contain all the MeSH Query
Term s in the ir Centroid.
0
35000
70000
105000
140000
175000
210000
245000
VSM AllClusters Top_50Clusters Top_30Clusters Top_15Clusters
Search Strategy
Nu
m O
F D
ocs
18/04/23 Nikos Hourdakis, MSc Thesis 54
Comments
Best cluster-based retrieval strategy: Retrieve only the clusters which contain all the MeSH
query terms in their centroid vector (Method 3). Search the documents which are contained in the
retrieved clusters and order them by similarity with the query.
Advantages: Searches only 30% of all OHSUMED documents as
opposed to exhaustive searching (233,445 docs). Almost as effective as the retrieval by exhaustive
searching (searching without clustering).
18/04/23 Nikos Hourdakis, MSc Thesis 55
Conclusions (1/2)
We implemented and evaluated various partitional clustering techniques Incremental K-Means Bisecting Incremental K-Means (exhaustive
approach)
BIC-Means Incorporates BIC as stopping criterion for preventing
clustering from over-splitting. Produces meaningful clusters at the leafs.
18/04/23 Nikos Hourdakis, MSc Thesis 56
Conclusions (2/2)
BIC-Means Much faster than Bisecting Incremental K-Means. As effective as exhaustive Bisecting approach. More suited for clustering large document collection.
Cluster-based retrieval strategies Reduces the size of the search. The best proposed retrieval method is as effective
as exhaustive searching (searching without clustering).
18/04/23 Nikos Hourdakis, MSc Thesis 57
Future Work
Evaluation using more or application specific data sets.
Examine additional cluster-based retrieval strategies (top-down, bottom-up).
Clustering and Browsing on Medline. Clustering Dynamic Document Collections. Semantic Similarity Methods in document
clustering.
18/04/23 Nikos Hourdakis, MSc Thesis 58
References Nikos Hourdakis, Michalis Argyriou, Euripides
G.M. Petrakis, Evangelos Milios, " Hierarchical Clustering in Medical Document Collections: the BIC-Means Method", Journal of Digital Information Management (JDIM), Vol. 8, No. 2, pp. 71-77, April. 2010.
Dan Pelleg, Andrew Moore, “X-means: Extending K-means with efficient estimation of the number of clusters”, Proc. of the 7th Intern. Conf. on Machine Learning, 2000, pp. 727-734
18/04/23 Nikos Hourdakis, MSc Thesis 59
Thank you!!!
Questions?