clustering algorithms k-means hierarchic agglomerative clustering (hac) …. birch association rule...
TRANSCRIPT
Clustering Algorithms
• k-means• Hierarchic Agglomerative Clustering (HAC)•….• BIRCH• Association Rule Hypergraph Partitioning (ARHP)•Categorical clustering (CACTUS, STIRR)•……•STC•QDC
Hierarchical clustering
Given a set of N items to be clustered, and an NxN distance (or similarity) matrix,
1. Start by assigning each item to its own cluster
2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.
3. Compute distances (similarities) between the new cluster and each of the old clusters.
4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.
Iwona Białynicka-Birula - Clustering Web Search Results
Agglomerative hierarchical clustering
Iwona Białynicka-Birula - Clustering Web Search Results
Clustering result: dendrogram
Iwona Białynicka-Birula - Clustering Web Search Results
AHC variants
• Various ways of calculating cluster similarity
single-link(minimum)
complete-link(maximum)
Group-average(average)
Data ClusteringK-means
Partitional clusteringInitial number of clusters k
K-means1. Place K points into the space represented by the
objects that are being clustered. These points represent initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the K centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.
8
Example by Andrew W. Moore
9
10
K-means
11
Iwona Białynicka-Birula - Clustering Web Search Results
K-means clustering (k=3)
14
15
Iwona Białynicka-Birula - Clustering Web Search Results
Single-pass
threshold
Document Clustering: k-means •k-means: distance-based flat clustering
•Advantage:•linear time complexity •works relatively well in low dimension space
•Drawback:•distance computation in high dimension space•centroid vector may not well summarize the cluster documents•initial k clusters affect the quality of clusters
0. Input: D::={d1,d2,…dn }; k::=the cluster number;1. Select k document vectors as the initial centriods of k clusters 2. Repeat3. Select one vector d in remaining documents4. Compute similarities between d and k centroids5. Put d in the closest cluster and recompute the centroid 6. Until the centroids don’t change7. Output:k clusters of documents
Document Clustering: HAC •Hierarchic agglomerative clustering(HAC):distance-based hierarchic clustering
•Advantage:•producing better quality clusters•works relatively well in low dimension space
•Drawback:•distance computation in high dimension space•quadratic time complexity
0. Input: D::={d1,d2,…dn };1. Calculate similarity matrix SIM[i,j] 2. Repeat3. Merge the most similar two clusters, K and L, to form a new cluster KL4. Compute similarities between KL and each of the remaining cluster and update SIM[i,j]5. Until there is a single(or specified number) cluster6. Output: dendogram of clusters