information retrieval search engine technology (8) prof. dragomir r. radev

Information RetrievalSearch Engine Technology

(8)http://tangra.si.umich.edu/clair/ir09

Prof. Dragomir R. [email protected]

http://tangra.si.umich.edu/clair/ir09

SET/IR – W/S 2009

…13. Clustering…

Clustering

• Exclusive/overlapping clusters• Hierarchical/flat clusters

• The cluster hypothesis– Documents in the same cluster are relevant to

the same query– How do we use it in practice?

Representations for document clustering

• Typically: vector-based– Words: “cat”, “dog”, etc.– Features: document length, author name, etc.

• Each document is represented as a vector in an n-dimensional space

• Similar documents appear nearby in the vector space (distance measures are needed)

Scatter-gather

• Introduced by Cutting, Karger, and Pedersen

• Iterative process– Show terms for each cluster– User picks some of them– System produces new clusters

• Example:– http://www.ischool.berkeley.edu/~hearst/imag

es/sg-example1.html

http://www.ischool.berkeley.edu/~hearst/images/sg-example1.html

http://www.ischool.berkeley.edu/~hearst/images/sg-example1.html

k-means

• Iteratively determine which cluster a point belongs to, then adjust the cluster cenroid, then repeat

• Needed: small number k of desired clusters

• hard decisions• Example: Weka

k-means1 initialize cluster centroids to arbitrary vectors2 while further improvement is possible do3 for each document d do4 find the cluster c whose centroid is closest to d5 assign d to cluster c6 end for7 for each cluster c do8 recompute the centroid of cluster c based on its documents9 end for10 end while

K-means (cont’d)

• In practice (to avoid suboptimal clusters), run hierarchical agglomerative clustering on sample size sqrt(N) and then use the resulting clusters as seeds for k-means.

Example

• Cluster the following vectors into two groups:– A = <1,6>– B = <2,2>– C = <4,0>– D = <3,3>– E = <2,5>– F = <2,1>

Weka

• A general environment for machine learning (e.g. for classification and clustering)

• Book by Witten and Frank• www.cs.waikato.ac.nz/ml/weka• cd /data2/tools/weka-3-4-7• export CLASSPATH=$CLASSPATH:./weka.jar• java weka.clusterers.SimpleKMeans -t ~/e.arff • java weka.clusterers.SimpleKMeans -p 1-2 -t

~/e.arff

http://www.cs.waikato.ac.nz/ml/weka

Demos• http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.

html• http://cgm.cs.mcgill.ca/~godfried/student_projects/bonnef_k-means • http://www.cs.washington.edu/research/imagedatabase/demo/

kmcluster • http://www.cc.gatech.edu/~dellaert/html/software.html • http://www-2.cs.cmu.edu/~awm/tutorials/kmeans11.pdf • http://www.ece.neu.edu/groups/rpl/projects/kmeans/

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

http://cgm.cs.mcgill.ca/~godfried/student_projects/bonnef_k-means

Probability and likelihood

i

ixpXpL )|()|()(

Example:

What is in this case?

Bayesian formulation

Posterior ∞ likelihood x prior

E-M algorithms

[Dempster et al. 77]

• Class of iterative algorithms for maximum likelihood estimation in problems with incomplete data. Given a model of data generation and data with some missing values, EM alternately uses the current model to estimate the missing values, and then uses the missing value estimates to improve the model. Using all the available data, EM will locally maximize the likelihood of the generative parameters giving estimates for the missing values.

[McCallum & Nigam 98]

E-M algorithm

• Initialize probability model• Repeat

– E-step: use the best available current classifier to classify some datapoints

– M-step: modify the classifier based on the classes produced by the E-step.

• Until convergenceSoft clustering method

EM example

Figure from Chris Bishop

Demos

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/mixture.html

http://lcn.epfl.ch/tutorial/english/gaussian/html/

http://www.cs.cmu.edu/~alad/em/

http://www.nature.com/nbt/journal/v26/n8/full/nbt1406.html

http://people.csail.mit.edu/mcollins/papers/wpeII.4.ps

“Online” centroid method

Centroid method

cx

xc

c

1)(

Online centroid-based clustering

sim ≥ T sim < T

Sample centroidsC 00022 (N=44)

(10000)diana 1.93princess 1.52

C 00025 (N=19)(10000)albanians 3.00

C 00026 (N=10)(10000)universe 1.50

expansion 1.00bang 0.90

C 10007 (N=11)(10000)crashes 1.00

safety 0.55transportat

ion0.55

drivers 0.45board 0.36flight 0.27buckle 0.27

pittsburgh 0.18graduating 0.18automobile 0.18

C 00035 (N=22)(10000)airlines 1.45

finnair 0.45

C 00031 (N=34)(10000)el 1.85

nino 1.56

C 00008 (N=113)(10000)space 1.98

shuttle 1.17station 0.75nasa 0.51

columbia 0.37mission 0.33

mir 0.30astronaut

s0.14

steering 0.11safely 0.07

C 10062 (N=161)microsoft 3.24

justice 0.93departmen

t0.88

windows 0.98corp 0.61

software 0.57ellison 0.07hatch 0.06

netscape 0.04metcalfe 0.02

Evaluation of clustering

• Formal definition• Objective function• Purity (considering the majority class in

each cluster)

RAND index

• Accuracy when preserving object-object relationships.

• RI=(TP+TN)/TP+FP+FN+TN• In the example:

202040

2022

23

24

25

4025

26

26

FP

TP

FPTP

RAND indexSame cluster

Same class TP=20 FN=24

FP=20 TN=72

RI = 0.68

Hierarchical clustering methods• Single-linkage

– One common pair is sufficient– disadvantages: long chains

• Complete-linkage– All pairs have to match– Disadvantages: too conservative

• Average-linkage• Demo

Non-hierarchical methods

• Also known as flat clustering• Centroid method (online)• K-means• Expectation maximization

Hierarchical clustering

21 65

43 87

Single link produces straggly clusters (e.g., ((12)(56)))

Hierarchical agglomerative clusteringDendrograms

http://odur.let.rug.nl/~kleiweg/clustering/clustering.html/data2/tools/clustering

E.g., language similarity:

Clustering using dendrograms

REPEATCompute pairwise similaritiesIdentify closest pairMerge pair into single node

UNTIL only one node leftQ: what is the equivalent Venn diagram representation?

Example: cluster the following sentences:

A B C B AA D C C A D EC D E F C D AE F G F D AA C D A B A

Paper reading

• Mark Newman paper “The structure and function of complex networks”(sections I, II, III, IV, VI, VII, and VIIIa)