text clustering hongning wang cs@uva. today’s lecture clustering of text documents – problem...

31
Text Clustering Hongning Wang CS@UVa

Upload: louisa-marshall

Post on 19-Jan-2016

227 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

Text Clustering

Hongning WangCS@UVa

Page 2: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 2

Today’s lecture

• Clustering of text documents– Problem overview• Applications

– Distance metrics– Two basic categories of clustering algorithms– Evaluation metrics

CS@UVa

Page 3: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 3

Clustering v.s. Classification

• Assigning documents to its corresponding categories

CS@UVa

How to label it?

Page 4: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 4

Clustering problem in general

• Discover “natural structure” of data– What is the criterion? – How to identify them?– How many clusters?

CS@UVa

Page 5: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 5

Clustering problem in general

• Clustering - the process of grouping a set of objects into clusters of similar objects– Basic criteria• high intra-class similarity• low inter-class similarity

– No (little) supervision signal about the underlying clustering structure

– Need similarity/distance as guidance to form clusters

CS@UVa

Page 6: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 6

What is the “natural grouping”?

CS@UVa

Clustering is very subjective!Distance metric is important!

group by gender group by source of ability group by costume

Page 7: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS6501: Text Mining 7

Clustering in text mining

CS@UVa

Access Mining

Organization

Filterinformation

Discover knowledge

Add Structure/Annotations

Serve for IR applications

Based on NLP/ML techniques

Sub-area of DM research

Text clustering

Page 8: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 8

Applications of text clustering

• Organize document collections– Automatically identify

hierarchical/topical relation among documents

CS@UVa

Page 9: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 9

Applications of text clustering

• Grouping search results– Organize documents by

topics– Facilitate user browsing

CS@UVa

http://search.carrot2.org/stable/search

Page 10: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 10

Applications of text clustering

• Topic modeling– Grouping words into topics

CS@UVa

Will be discussed later separately

Page 11: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 11

Distance metric

• Basic properties– Positive separation

• i.f.f.,

– Symmetry

– Triangle inequality

CS@UVa

Page 12: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 12

Typical distance metric

• Minkowski metric

• When , it is Euclidean distance

• Cosine metric

• when ,

CS@UVa

Page 13: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 13

Typical distance metric

• Edit distance– Count the minimum number of operations

required to transform one string into the other• Possible operations: insertion, deletion and

replacement

CS@UVa

Can be efficiently solved by dynamic programming

Page 14: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 14

Typical distance metric

• Edit distance– Count the minimum number of operations

required to transform one string into the other• Possible operations: insertion, deletion and

replacement

– Extent to distance between sentences• Word similarity as cost of replacement

– “terrible” -> “bad”: low cost– “terrible” -> “terrific”: high cost

• Preserving word order in distance computation

CS@UVa

Lexicon or distributional semantics

Page 15: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 15

Clustering algorithms

• Partitional clustering algorithms– Partition the instances into different groups– Flat structure• Need to specify the number of classes in advance

CS@UVa

Page 16: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 16

Clustering algorithms

• Typical partitional clustering algorithms– k-means clustering• Partition data by its closest mean

CS@UVa

Page 17: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 17

Clustering algorithms

• Typical partitional clustering algorithms– k-means clustering• Partition data by its closest mean

– Gaussian Mixture Model• Consider variance within the cluster as well

CS@UVa

Page 18: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 18

Clustering algorithms

• Hierarchical clustering algorithms– Create a hierarchical decomposition of objects– Rich internal structure• No need to specify the number of clusters• Can be used to organize objects

CS@UVa

Page 19: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 19

Clustering algorithms

• Typical hierarchical clustering algorithms– Bottom-up agglomerative clustering• Start with individual objects as separated clusters• Repeatedly merge closest pair of clusters

CS@UVa

Most typical usage: gene sequence analysis

Page 20: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 20

Clustering algorithms

• Typical hierarchical clustering algorithms– Top-down divisive clustering• Start with all data as one cluster• Repeatedly splitting the remaining clusters into two

CS@UVa

Page 21: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 21

Desirable properties of clustering algorithms

• Scalability– Both in time and space

• Ability to deal with various types of data– No/less assumption about input data– Minimal requirement about domain knowledge

• Interpretability and usability

CS@UVa

Page 22: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 22

Cluster validation

• Criteria to determine whether the clusters are meaningful– Internal validation• Stability and coherence

– External validation• Match with known categories

CS@UVa

Page 23: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 23

Internal validation

• Coherence– Inter-cluster similarity v.s. intra-cluster similarity– Davies–Bouldin index

– where is total number of clusters, is average distance of all elements in cluster , is the distance between cluster centroid and .

CS@UVa

We prefer smaller DB-index!

Evaluate every pair of clusters

Page 24: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 24

Internal validation

• Coherence– Inter-cluster similarity v.s. intra-cluster similarity– Dunn index

– Worst situation analysis

• Limitation– No indication of actual application’s performance– Bias towards a specific type of clustering algorithm

if that algorithm is designed to optimize similar metric

CS@UVa

We prefer larger D-index!

Page 25: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 25

External validation

• Given class label on each instance– Purity: correctly clustered documents in each

cluster

– where is a set of documents in cluster , and is a set of documents in cluster

CS@UVa

𝑝𝑢𝑟𝑖𝑡𝑦 (Ω,𝐶 )=¿117

(5+4+3 )

Not a good metric if we assign each document into a single cluster

Required, might need extra cost

Page 26: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 26

External validation

• Given class label on each instance– Normalized mutual information (NMI)

– where , and

• Indicate the increase of knowledge about classes when we know the clustering results

CS@UVa

Normalization by entropy will penalize too many clusters

Page 27: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 27

External validation

• Given class label on each instance– Rand index• Idea: we want to assign two documents to the same

cluster if and only if they are from the same class

CS@UVa

TP FP

FN TN Over every pair of documents in the collection

Essentially it is like classification accuracy

Page 28: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 28

External validation

• Given class label on each instance– Rand index

CS@UVa

20 20

24 72

𝑇𝑃=(52)+(42)+(32)+(22)=20𝑇𝑃+𝐹𝑃=(62)+(62)+(52)=40

Page 29: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 29

External validation

• Given class label on each instance– Precision/Recall/F-measure• Based on the contingency table, we can also define

precision/recall/F-measure of clustering quality

CS@UVa

TP FP

FN TN

Page 30: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 30

What you should know

• Unsupervised natural of clustering problem– Distance metric is essential to determine the

clustering results• Two basic categories of clustering algorithms– Partitional clustering– Hierarchical clustering

• Clustering evaluation– Internal v.s. external

CS@UVa

Page 31: Text Clustering Hongning Wang CS@UVa. Today’s lecture Clustering of text documents – Problem overview Applications – Distance metrics – Two basic categories

CS 6501: Text Mining 31

Today’s reading

• Introduction to Information Retrieval– Chapter 16: Flat clustering• 16.2 Problem statement• 16.3 Evaluation of clustering

CS@UVa