cs583, bing liu, uic 1 clustering clustering is a technique for finding similarity groups in data,...
TRANSCRIPT
![Page 1: CS583, Bing Liu, UIC 1 Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that](https://reader036.vdocuments.site/reader036/viewer/2022082820/5697bf9c1a28abf838c93620/html5/thumbnails/1.jpg)
CS583, Bing Liu, UIC 1
Clustering Clustering is a technique for finding similarity groups
in data, called clusters. I.e., it groups data instances that are similar to (near) each other
in one cluster and data instances that are very different (far away) from each other into different clusters.
Clustering is often called an unsupervised learning task as no class values denoting an a priori grouping of the data instances are given, which is the case in supervised learning.
Due to historical reasons, clustering is often considered synonymous with unsupervised learning. In fact, association rule mining is also unsupervised
This chapter focuses on clustering.
![Page 2: CS583, Bing Liu, UIC 1 Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that](https://reader036.vdocuments.site/reader036/viewer/2022082820/5697bf9c1a28abf838c93620/html5/thumbnails/2.jpg)
CS583, Bing Liu, UIC 2
Aspects of clustering A clustering algorithm
Partitional clustering Hierarchical clustering …
A distance (similarity, or dissimilarity) function Clustering quality
Inter-clusters distance maximized Intra-clusters distance minimized
The quality of a clustering result depends on the algorithm, the distance function, and the application.
![Page 3: CS583, Bing Liu, UIC 1 Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that](https://reader036.vdocuments.site/reader036/viewer/2022082820/5697bf9c1a28abf838c93620/html5/thumbnails/3.jpg)
CS583, Bing Liu, UIC 3
K-means clustering K-means is a partitional clustering algorithm Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X Rr, and r is the number of attributes (dimensions) in the data.
The k-means algorithm partitions the given data into k clusters. Each cluster has a cluster center, called centroid. k is specified by the user
![Page 4: CS583, Bing Liu, UIC 1 Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that](https://reader036.vdocuments.site/reader036/viewer/2022082820/5697bf9c1a28abf838c93620/html5/thumbnails/4.jpg)
CS583, Bing Liu, UIC 4
K-means algorithm
Given k, the k-means algorithm works as follows: 1) Randomly choose k data points (seeds) to be the
initial centroids, cluster centers
2) Assign each data point to the closest centroid
3) Re-compute the centroids using the current cluster memberships.
4) If a convergence criterion is not met, go to 2).
![Page 5: CS583, Bing Liu, UIC 1 Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that](https://reader036.vdocuments.site/reader036/viewer/2022082820/5697bf9c1a28abf838c93620/html5/thumbnails/5.jpg)
CS583, Bing Liu, UIC 5
K-means algorithm – (cont …)
![Page 6: CS583, Bing Liu, UIC 1 Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that](https://reader036.vdocuments.site/reader036/viewer/2022082820/5697bf9c1a28abf838c93620/html5/thumbnails/6.jpg)
CS583, Bing Liu, UIC 6
Stopping/convergence criterion 1. no (or minimum) re-assignments of data
points to different clusters, 2. no (or minimum) change of centroids, or 3. minimum decrease in the sum of squared
error (SSE),
Ci is the jth cluster, mj is the centroid of cluster Cj (the mean vector of all the data points in Cj), and dist(x, mj) is the distance between data point x and centroid mj.
k
jC jj
distSSE1
2),(x
mx (1)
![Page 7: CS583, Bing Liu, UIC 1 Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that](https://reader036.vdocuments.site/reader036/viewer/2022082820/5697bf9c1a28abf838c93620/html5/thumbnails/7.jpg)
CS583, Bing Liu, UIC 7
An example
++
![Page 8: CS583, Bing Liu, UIC 1 Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that](https://reader036.vdocuments.site/reader036/viewer/2022082820/5697bf9c1a28abf838c93620/html5/thumbnails/8.jpg)
CS583, Bing Liu, UIC 8
An example (cont …)
![Page 9: CS583, Bing Liu, UIC 1 Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that](https://reader036.vdocuments.site/reader036/viewer/2022082820/5697bf9c1a28abf838c93620/html5/thumbnails/9.jpg)
CS583, Bing Liu, UIC 9
An example distance function
![Page 10: CS583, Bing Liu, UIC 1 Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that](https://reader036.vdocuments.site/reader036/viewer/2022082820/5697bf9c1a28abf838c93620/html5/thumbnails/10.jpg)
CS583, Bing Liu, UIC 10
Strengths of k-means Strengths:
Simple: easy to understand and to implement Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations. Since both k and t are small. k-means is considered a
linear algorithm. K-means is the most popular clustering algorithm. Note that: it terminates at a local optimum if SSE is
used. The global optimum is hard to find due to complexity.
![Page 11: CS583, Bing Liu, UIC 1 Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that](https://reader036.vdocuments.site/reader036/viewer/2022082820/5697bf9c1a28abf838c93620/html5/thumbnails/11.jpg)
CS583, Bing Liu, UIC 11
Weaknesses of k-means
The algorithm is only applicable if the mean is defined. For categorical data, k-mode - the centroid is
represented by most frequent values. The user needs to specify k. The algorithm is sensitive to outliers
Outliers are data points that are very far away from other data points.
Outliers could be errors in the data recording or some special data points with very different values.
![Page 12: CS583, Bing Liu, UIC 1 Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that](https://reader036.vdocuments.site/reader036/viewer/2022082820/5697bf9c1a28abf838c93620/html5/thumbnails/12.jpg)
CS583, Bing Liu, UIC 12
Weaknesses of k-means: Problems with outliers
![Page 13: CS583, Bing Liu, UIC 1 Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that](https://reader036.vdocuments.site/reader036/viewer/2022082820/5697bf9c1a28abf838c93620/html5/thumbnails/13.jpg)
CS583, Bing Liu, UIC 13
Weaknesses of k-means: To deal with outliers One method is to remove some data points in the
clustering process that are much further away from the centroids than other data points. To be safe, we may want to monitor these possible outliers
over a few iterations and then decide to remove them. Another method is to perform random sampling.
Since in sampling we only choose a small subset of the data points, the chance of selecting an outlier is very small. Assign the rest of the data points to the clusters by
distance or similarity comparison, or classification
![Page 14: CS583, Bing Liu, UIC 1 Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that](https://reader036.vdocuments.site/reader036/viewer/2022082820/5697bf9c1a28abf838c93620/html5/thumbnails/14.jpg)
CS583, Bing Liu, UIC 14
Weaknesses of k-means (cont …) The algorithm is sensitive to initial seeds.
![Page 15: CS583, Bing Liu, UIC 1 Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that](https://reader036.vdocuments.site/reader036/viewer/2022082820/5697bf9c1a28abf838c93620/html5/thumbnails/15.jpg)
CS583, Bing Liu, UIC 15
Weaknesses of k-means (cont …) If we use different seeds: good results
There are some methods to help choose good seeds
![Page 16: CS583, Bing Liu, UIC 1 Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that](https://reader036.vdocuments.site/reader036/viewer/2022082820/5697bf9c1a28abf838c93620/html5/thumbnails/16.jpg)
CS583, Bing Liu, UIC 16
Weaknesses of k-means (cont …) The k-means algorithm is not suitable for
discovering clusters that are not hyper-ellipsoids (or hyper-spheres).
+
![Page 17: CS583, Bing Liu, UIC 1 Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that](https://reader036.vdocuments.site/reader036/viewer/2022082820/5697bf9c1a28abf838c93620/html5/thumbnails/17.jpg)
CS583, Bing Liu, UIC 17
K-means summary Despite weaknesses, k-means is still the
most popular algorithm due to its simplicity, efficiency and other clustering algorithms have their own lists of
weaknesses. No clear evidence that any other clustering
algorithm performs better in general although they may be more suitable for some
specific types of data or applications. Comparing different clustering algorithms is a
difficult task. No one knows the correct clusters!