clustering algorithms k-means hierarchic agglomerative clustering (hac) …. birch association rule...

Clustering Algorithms

• k-means• Hierarchic Agglomerative Clustering (HAC)•….• BIRCH• Association Rule Hypergraph Partitioning (ARHP)•Categorical clustering (CACTUS, STIRR)•……•STC•QDC

Hierarchical clustering

Given a set of N items to be clustered, and an NxN distance (or similarity) matrix,

1. Start by assigning each item to its own cluster

2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.

3. Compute distances (similarities) between the new cluster and each of the old clusters.

4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

Iwona Białynicka-Birula - Clustering Web Search Results

Agglomerative hierarchical clustering


Clustering result: dendrogram


AHC variants

• Various ways of calculating cluster similarity

single-link(minimum)

complete-link(maximum)

Group-average(average)

Data ClusteringK-means

Partitional clusteringInitial number of clusters k

K-means1. Place K points into the space represented by the

objects that are being clustered. These points represent initial group centroids.

2. Assign each object to the group that has the closest centroid.

3. When all objects have been assigned, recalculate the positions of the K centroids.

4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

8

Example by Andrew W. Moore

9

K-means

11


K-means clustering (k=3)


Single-pass

threshold

Document Clustering: k-means •k-means: distance-based flat clustering

•Advantage:•linear time complexity •works relatively well in low dimension space

•Drawback:•distance computation in high dimension space•centroid vector may not well summarize the cluster documents•initial k clusters affect the quality of clusters

0. Input: D::={d1,d2,…dn }; k::=the cluster number;1. Select k document vectors as the initial centriods of k clusters 2. Repeat3. Select one vector d in remaining documents4. Compute similarities between d and k centroids5. Put d in the closest cluster and recompute the centroid 6. Until the centroids don’t change7. Output:k clusters of documents

Document Clustering: HAC •Hierarchic agglomerative clustering(HAC):distance-based hierarchic clustering

•Advantage:•producing better quality clusters•works relatively well in low dimension space

•Drawback:•distance computation in high dimension space•quadratic time complexity

0. Input: D::={d1,d2,…dn };1. Calculate similarity matrix SIM[i,j] 2. Repeat3. Merge the most similar two clusters, K and L, to form a new cluster KL4. Compute similarities between KL and each of the remaining cluster and update SIM[i,j]5. Until there is a single(or specified number) cluster6. Output: dendogram of clusters

clustering algorithms k-means hierarchic agglomerative clustering (hac) …. birch association rule...

Documents

closest cluster

clustering algorithms

clustering web search

old clusters

cluster documentsinitial

cluster number1

similar pair of clusters

new cluster kl4