text document clustering and similarity measures · text document clustering and similarity...

Text Document Clustering and Similarity Measures

By: Pranjal Singh(10511) Mohit Sharma(11434) Department of Computer Science & Engineering

o Problem and the Approach

o Document Representation

o Metrics and Similarity measures

o Clustering Algorithms

o Evaluation

o Work done so far

o Work Remaining

The Problem

Ever increasing volume of text documents has brought challenges for their effective and efficient organization

Clustering organizes a large quantity of unordered data into a small number of meaningful and coherent clusters

No single similarity measure or clustering algorithm outperform all others in all domains.

Approach at a high level

Similarity measures quantify how similar or different two documents are

In this work we contrast clustering algorithms that use five different similarity measures and contrast their effectiveness on different text domains

We are also evaluating cluster qualities based on purity and entropy measures.





o Evaluation

o Work done so far

o Work Remaining

Document Representation

We are using the ‘bag of words’ model.

Each word corresponds to a dimension in the

resulting data space

Each document then becomes a vector

consisting of non-negative values on each

dimension.

Document Representation

Let D = {d1,d2,d3,d4 ....} be the set of documents and T = {t1,t2,t3,t4....tm} be the set of unique terms in D.

A document is then represented as a m dimensional vector, td = (tf(d, t1), . . . , tf(d, tm))).

Pre-processing of documents is necessary to make computations more efficient as well as faster. More on this later.





o Evaluation

o Work done so far

o Work Remaining

Metric

Not every distance measure is a metric.

Distance measure must satisfy the following:

Euclidean Distance

Given two documents da and db represented by their term vectors ta and tb respectively, the Euclidean distance is given simply as :

where DE = Distance between vectors

wt,a and wt,b are weights as given by tfidf values i.e wt,a= tfidf(da, t)

tfidf(d,t) = tf(d, t) * Log(|D| / df(t))

|D| = number of documents

df (t) = number of documents in which ‘t’ appears.

Cosine Similarity

Quantifies correlation b/w vectors ta and tb as

cosine of the angle between them in m

dimensional space.

where SIMC = Cosine similarity

t a and tb are vectors containing weights corresponding to each

dimension.

Bounded b/w [0,1] and independent of document

length.

Mahalanobis Distance

Differs from Euclidean distance in that it takes

into account the correlations of the data set and

is scale-invariant.

where

dst = Distance between vectors

xs and xt are weights as given by tfidf values i.e

wt,a= tfidf(da, t)

C is the covariance matrix

Jaccard Coefficient

Measures similarity as the intersection divided by

the union of the objects.

For text document, the Jaccard coefficient

compares the sum weight of shared terms to the

sum weight of terms that are present in either of

the two document but are not the shared terms.

Pearson Correlation

Takes many different forms, we are using the

following formula in our work.

Ranges form [-1,1]

In subsequent experiments we use the distance

measure, which is DP=1−SIMP when SIMP ≥ 0and

DP = |SIMP| when SIMP< 0





o Evaluation

o Work done so far

o Work Remaining

Clustering Algorithms

Hierarchical Algorithms is a method of cluster analysis which seeks to build a

hierarchy of clusters.

Agglomerative : This is a "bottom up" approach: each

observation starts in its own cluster, and pairs of clusters are

merged as one moves up the hierarchy. Divisive: This is a "top down" approach: all observations

start in one cluster, and splits are performed recursively as

one moves down the hierarchy.

Clustering Algorithms

K-means Algorithms aims to partition n observations into k clusters in

which each observation belongs to the cluster with the

nearest mean, serving as a representative of the

cluster.

Comparisons

Hierarchical Algorithms:

Agglomerative : O (n3)

Divisive: O (2n)

K-means Algorithms:

Various implementations with different

heuristics. All run in polynomial time.





o Evaluation

o Work done so far

o Work Remaining

Datasets

20 news: news articles on different topics

Classic: abstracts and scientific paper

Hitech : San Hose newspaper articles

tr41: from TREC collection of articles

wap : web pages collection

webkb : another web page dataset

r0 : standard cluster testing database.

Evaluation

Evaluating and contrasting cluster quality objectively is a difficult task in itself.

In practice, manually assigned category labels are usually used as a baseline criteria for evaluating clusters.

As a result, the clusters, which are generated in an unsupervised way, are compared to the pre-defined category structure, which is normally created by human experts.

This kind of evaluation assumes that the objective of clustering is to replicate human thinking, so a clustering solution is good if the clusters are consistent with the manually created categories.

Evaluation Purity Measure :

Measures the coherence of a cluster, i.e degree to

which a cluster contains documents from a single

category

For an ideal cluster, which only contains

documents from a single category, its purity value

is 1. In general, the higher the purity value, the

better the quality of the cluster is.

Evaluation Entropy Measure :

The entropy measure evaluates the distribution

of categories in a given cluster.

The entropy measure is more comprehensive

than purity because rather than just considering

the number of objects in and not in the dominant

category in a cluster; it considers the overall

distribution of all the categories in a given cluster.

Experiments

We plan to use both hierarchical and k-means

clustering using all the mentioned similarity

measures on different datasets from different

domains.

We shall then use purity and entropy techniques

to measure the quality of clusters that the two

clustering algorithm give on the 5 similarity

measures.

We hope to critique on the effectiveness of

similarity measures based on cluster qualities.

Past Work and Results The paper by Anna Huang claims following results for

similar experiments.

We hope to do better owing to use of stemming and

better feature selection by PCA.





o Evaluation

o Work done so far

o Work Remaining

Work done • Decided and obtained the datasets. Created one

ourselves. Manual labeling had to be done for some documents.

• Repeatedly pruned the documents before making idf matrices for datasets. Removed words below a certain threshold frequency and also irrelevant high frequency words.

• Written codes for creating the tf(d,t) and tfidf(d,t) matrices from documents to mat files.

Work done

Tried clustering algorithms on small datasets on

MATLAB. Trying to figure out a way to make

them run faster for a sparse matrix.

Integrating the five similarity measures with the

clustering algorithm, the default is Euclidean

distance.





o Evaluation

o Work done so far

o Work Remaining

Work Remaining

Optimization of codes for better running times.

Cluster analysis using Purity and Entropy

measures

Quantified results for effectiveness of different

similarity measures used.

If time permits, we would like to improve upon

document representation. Stemming and some

semantic knowledge looks promising to improve

cluster coherence.

References [1] M. Steinbach, G. Karypis, and V. Kumar. A comparison of

document clustering techniques. In KDD Workshop on Text Mining, 2000.

[2] J. M. Neuhaus and J. D. Kalbfleisch. Between- and within-cluster covariate effects in the analysis of clustered data. Biometrics, 54(2):638–645, Jun. 1998.

[3] Anna Huang. Similarity Measures for text document clustering.

[4] B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.

Datasets:

[1] CLUTO package : http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download

[2] 20 news : http://qwone.com/~jason/20Newsgroups/

http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download




text document clustering and similarity measures · text document clustering and similarity...

Documents