data mining i - kbsntoutsi/dm1.sose19/lectures/11... · 2019. 11. 10. · clustering topics covered...
TRANSCRIPT
![Page 1: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/1.jpg)
Fakultät für Elektrotechnik und InformatikInstitut für Verteilte Systeme
AG Intelligente Systeme - Data Mining group
Data Mining I
Summer semester 2019
Lecture 11: Clustering – 2: Hiearchical clustering
Lectures: Prof. Dr. Eirini Ntoutsi
TAs: Tai Le Quy, Vasileios Iosifidis, Maximilian Idahl, Shaheer Asghar
![Page 2: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/2.jpg)
Clustering topics covered in DM1
1. Partitioning-based clustering
kMeans, kMedoids
2. Density-based clustering
DBSCAN
3. Grid-based clustering
4. Hierarchical clustering
1. Diana, Agnes, BIRCH, ROCK, CHAMELEON
5. Clustering evaluation
Data Mining I @SS19: Clustering 3
1 2 3 4 5
2
![Page 3: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/3.jpg)
Outline
Hierarchical clustering
Bisecting k-Means
An overview of clustering
Homework/tutorial
Things you should know from this lecture
Data Mining I @SS19: Clustering 3 3
![Page 4: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/4.jpg)
Outline
Hierarchical clustering
Bisecting k-Means
An overview of clustering
Homework/tutorial
Things you should know from this lecture
Data Mining I @SS19: Clustering 3 8
![Page 5: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/5.jpg)
Hierarchical-based clustering
Produces a set of nested clusters organized as a hierarchical tree
Can be visualized also as a dendrogram
A tree like diagram that records the sequences of merges or splits & cluster memberships
The height at which two clusters are merged in the dendrogram reflects their distance
An instance can belong to multiple clusters.
The assignement though is still hard
Data Mining I @SS19: Clustering 3
1
2
3
4
5
6
1
23 4
5
Nested clusters
Dis
tan
ce
1 3 2 5 4 60
0.05
0.1
0.15
0.2
Dendrogram
1
2
34
5
Points
9
![Page 6: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/6.jpg)
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
A clustering can be obtained by ‘cutting’ the dendrogram at the proper level
Cutting based on distance (i.e., I want ≤ 0.1 distance)
Cutting based on the number of clusters (i.e., I want 2 clusters)
Data Mining I @SS19: Clustering 3
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
10
![Page 7: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/7.jpg)
Applications of hierarchical clustering 1/3
The dendrogram of clusters may correspond to meaningful taxonomies
Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)
Data Mining I @SS19: Clustering 3
Source: http://currents.plos.org/treeoflife/article/the-tree-of-life-and-a-new-classification-of-bony-fishes/
11
![Page 8: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/8.jpg)
Applications of hierarchical clustering 2/3
The dendrogram of clusters may correspond to meaningful taxonomies
Dendrogram showing hierarchical clustering of tissue gene expression data with colours denoting tissues.
Data Mining I @SS19: Clustering 3
Source: http://genomicsclass.github.io/book/pages/clustering_and_heatmaps.html
12
![Page 9: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/9.jpg)
Applications of hierarchical clustering 3/3
The dendrogram of clusters may correspond to meaningful taxonomies
USArrests dataset: statistics in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973.
Data Mining I @SS19: Clustering 3
Source: https://uc-r.github.io/hc_clustering
13
![Page 10: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/10.jpg)
Hierarchical vs Partitioning
Data Mining I @SS19: Clustering 3
p4
p1 p3
p2
p4p1 p2 p3
Partitioning clustering
Nested clusters
Dendrogram
Hierarchical clustering algorithms typically have local objectives
Partitioning algorithms typically have global objectivese.g., k-Means
14
![Page 11: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/11.jpg)
Hierarchical clustering methods
Two main types of hierarchical clustering
Agglomerative or AGNES (Agglomerative Nesting):
Bottom-up approach
Start with the points as individual clusters
At each step, merge the closest pair of clusters
until only one cluster (or k clusters) left
Divisive or DIANA (Divisive analysis):
Top-down approach
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a single point (or there are k clusters)
Merge or split one cluster at a time
Data Mining I @SS19: Clustering 3
1 3 2 5 4 60
0.05
0.1
0.15
0.2
1 3 2 5 4 60
0.05
0.1
0.15
0.2
15
![Page 12: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/12.jpg)
Hierarchical clustering methods
Traditional hierarchical algorithms use a similarity or distance matrix to decide on which cluster to split/merge next
Employed distance/similarity function depends on the application
Data Mining I @SS19: Clustering 3
p1
p3
p12
…
p2
p1 p2 p3 … p12
Proximity matrix
...p1 p2 p3 p4 p9 p10 p11 p12
16
![Page 13: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/13.jpg)
Agglomerative clustering algorithm
Most popular hierarchical clustering technique
Basic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
Key operation: the computation of the proximity of two clusters
Different approaches (single link, complete link, …..) which lead to different algorithms
Data Mining I @SS19: Clustering 3
1 3 2 5 4 60
0.05
0.1
0.15
0.2
17
![Page 14: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/14.jpg)
Starting situation
Start with clusters of individual points and a proximity matrix
Data Mining I @SS19: Clustering 3
p1
p3
p12
…
p2
p1 p2 p3 … p12
Proximity matrix
...p1 p2 p3 p4 p9 p10 p11 p12
18
![Page 15: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/15.jpg)
Intermediate situation I
After some merging steps, we have some clusters
Data Mining I @SS19: Clustering 3
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity matrix
...p1 p2 p3 p4 p9 p10 p11 p12
19
![Page 16: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/16.jpg)
Intermediate situation II
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix.
Data Mining I @SS19: Clustering 3
Proximity matrix
...p1 p2 p3 p4 p9 p10 p11 p12
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
C1
C4
C2 C5
C3
20
![Page 17: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/17.jpg)
Merging
Two major questions for merging
How we identify the closest pair of clusters to be merged?
How do we update the proximity matrix?
Data Mining I @SS19: Clustering 3
C1
C4
C2 U C5
C3
...p1 p2 p3 p4 p9 p10 p11 p12
? ? ? ?
?
?
?
C2 U C5C1
C1
C3
C4
C2 U C5
C3 C4
Proximity matrix
21
![Page 18: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/18.jpg)
Distance between clusters
Each cluster is a set of points
How do we compare two sets of points/clusters?
A variety of different methods Single link (or MIN)
Complete link (or MAX)
Group average
Distance between centroids
Distance between medoids
Other methods driven by an objective function
Ward’s Method uses squared error
Data Mining I @SS19: Clustering 3
Distance?
22
![Page 19: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/19.jpg)
Distance between clusters: Single link distance or MIN
Single link (or MIN) distance between Ci and Cj is the minimum distance between any object in Ci
and any object in Cj, i.e.,
i.e., the distance is defined by the two closest objects (shortest edge)
Data Mining I @SS19: Clustering 3
jiyxjisl CyCxyxdCCdis ,),(min, ,
Ci CJ
23
![Page 20: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/20.jpg)
Distance between clusters: Complete link or MAX
Complete link (or MAX) distance between Ci and Cj is the maximum distance between any object in Ci
and any object in Cj, i.e.,
i.e., the distance is defined by the two most dissimilar objects (longest edge)
Data Mining I @SS19: Clustering 3
Ci CJ
jiyxjicl CyCxyxdCCdis ,),(max, ,
24
![Page 21: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/21.jpg)
Distance between clusters: Group average
Group average distance between Ci and Cj is the average distance between any object in Ci and any
object in Cj, i.e.,
Data Mining I @SS19: Clustering 3
ji
CyCx
jiavgCC
yxd
CCdisji
,
),(
,
25
![Page 22: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/22.jpg)
Distance between clusters: Centroid distance
Centroid distance between Ci and Cj is the distance between the centroid ci of Ci and the centroid cj
of Cj, i.e.,
Data Mining I @SS19: Clustering 3
),(, jijicentroids ccdCCdis
n
p
c
n
i
i
m
1
Centroid of a cluster
26
![Page 23: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/23.jpg)
Example
Data Mining I @SS19: Clustering 3
Dataset (6 2D points)
Distance matrix (Euclidean distance)
27
![Page 24: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/24.jpg)
Back to the pseudocode of the agglomerative clustering algorithm
Pseudocode of the algorithm
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
Data Mining I @SS19: Clustering 3
1 3 2 5 4 60
0.05
0.1
0.15
0.2
28
![Page 25: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/25.jpg)
Single link distance or MIN agglomerative clustering algorithm
Similarity of two clusters is based on the most similar (closest) pair of objects
Determined by one pair of points
Data Mining I @SS19: Clustering 3
1
2
3
4
5
6
1
2
3
4
5
3 6 2 5 4 10
0.05
0.1
0.15
0.2
Dendrogram
jiyxjisl CyCxyxdCCdis ,),(min, ,
Nested clusters
29
![Page 26: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/26.jpg)
Short break (5’)
Given the following 1-dimensional dataset, build a hierarchical agglomerative clustering using single-link distance
Data Mining I @SS19: Clustering 3 30
![Page 27: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/27.jpg)
Single link distance (MIN): strengths
Can discover clusters of arbitrary shapes
Data Mining I @SS19: Clustering 3
Original points Two clusters
31
![Page 28: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/28.jpg)
Single link distance (MIN): limitations
Data Mining I @SS19: Clustering 3
Two clustersOriginal points
Sensitive to noise and outliers
DBSCAN can be viewed as a robust variant of single link distance
It excludes noisy points between clusters to avoid undesirable chaining effects.
32
![Page 29: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/29.jpg)
Single link distance (MIN): limitations
Data Mining I @SS19: Clustering 3
Produces long, elongated clusters (chain-like clusters)
33
![Page 30: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/30.jpg)
Complete link distance or MAX agglomerative clustering algorithm
Similarity of two clusters is based on the least similar (most distant) pair of objects
Determined by one pair of points
Data Mining I @SS19: Clustering 3
Nested clusters Dendrogram
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1
2
3
4
5
6
1
2 5
3
4
jiyxjicl CyCxyxdCCdis ,),(max, ,
34
![Page 31: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/31.jpg)
Complete link distance (MAX): strengths
Data Mining I @SS19: Clustering 3
Original points Two clusters
Less susceptible to noise and outliers and comparing to MIN
35
![Page 32: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/32.jpg)
Complete link distance (MAX): limitations
Because it focuses on minimizing the diameter of the cluster, it will create clusters so that all of them have similar diameter
If there are natural larger clusters than others, it tends to break large clusters
Data Mining I @SS19: Clustering 3
Original points Two clusters
36
![Page 33: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/33.jpg)
Short break (5’)
Given the following 1-dimensional dataset, build a hierarchical agglomerative clustering using complete-link distance
Data Mining I @SS19: Clustering 3 37
![Page 34: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/34.jpg)
(Group) Average-link distance agglomerative clustering algorithm
Proximity of two clusters is the average of pairwise distances between objects in the two clusters.
Determined by all pairs of points in the two clusters
Data Mining I @SS19: Clustering 3
Nested clusters
1
2
3
4
5
6
1
2
5
3
4
Dendrogram
3 6 4 2 5 10
0.05
0.1
0.15
0.2
0.25
ji
CyCx
jiavgCC
yxd
CCdisji
,
),(
,
38
![Page 35: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/35.jpg)
(Group) Average-link distance: strengths and limitations
Compromise between Single and Complete Link
Strengths
Less susceptible to noise and outliers
Limitations
Biased towards spherical clusters
Data Mining I @SS19: Clustering 3 39
![Page 36: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/36.jpg)
Centroid-link distance agglomerative clustering algorithm
The distance between two clusters is the distance of their corresponding centroids
Difference to other measures (often considered bad): the possibility of inversions
Two clusters that are merged at step k might be more similar than the pair of clusters merged in step k-1
For the other methods, distance between clusters monotonically increases (or at worst does not increase)
Data Mining I @SS19: Clustering 3
),(, jijicentroids ccdCCdis
40
![Page 37: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/37.jpg)
Ward’s method
Ward’s method or Ward's minimum variance method
Clusters are represented by centroids
The proximity between two clusters is measured in terms of theincrease in SSE (sum of squared error) that results from merging the two clusters
At each step, merge the pair of clusters that leads to minimum increase in total inter-cluster variance after merging.
Data Mining I @SS19: Clustering 3
Nested clusters
1
2
3
4
5
61
2
5
3
4
41
![Page 38: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/38.jpg)
Ward’s method cont’
Ward’s method seems similarly to k-Means: it tries to minimize the sum of square distances of points from their cluster centroids, but not globally
Less susceptible to noise and outliers
Biased towards spherical clusters
Data Mining I @SS19: Clustering 3 42
![Page 39: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/39.jpg)
Comparison of the different methods
Data Mining I @SS19: Clustering 3
Group average Ward’s method
1
2
3
4
5
6
1
2
5
3
4
Complete link (MAX)
1
2
3
4
5
6
1
2
5
34
1
2
3
4
5
6
1
2 5
3
41
2
3
4
5
6
1
2
3
4
5
Single link (MIN)
43
![Page 40: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/40.jpg)
Hierarchical methods: complexity
O(N2) space to store the proximity matrix
N is the number of points.
O(N3) time in most of the cases
There are N steps and at each step the size, N2, proximity matrix must be updated and searched
Complexity can be reduced to O(N2 log(N) ) time for some approaches using appropriate data structures
Data Mining I @SS19: Clustering 3 44
![Page 41: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/41.jpg)
Hierarchical clustering: overview
No knowledge on the number of clusters
Produces a hierarchy of clusters, not a flat clustering
A single clustering can be obtained from the dendrogram
No backtracking: Merging decisions are final
Once a decision is made to combine two clusters, it cannot be undone
Lack of a global objective function
Decisions are local, at each step
No objective function is directly minimized
Different schemes have problems with one or more of the following:
Sensitivity to noise and outliers
Breaking large clusters
Difficulty handling different sized clusters and convex shapes
Inefficiency, especially for large datasets
Data Mining I @SS19: Clustering 3 45
![Page 42: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/42.jpg)
Outline
Hierarchical clustering
Bisecting k-Means
An overview of clustering
Homework/tutorial
Things you should know from this lecture
Data Mining I @SS19: Clustering 3 46
![Page 43: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/43.jpg)
Bisecting k-Means
Hybrid method, combines k-Means and hierarchical clustering
Idea: first split the set of points into two clusters, select one of these clusters for further splitting, and so on, until k clusters remain.
Pseudocode:
Which cluster to split?
The one with the largest SSE (worse one)
Based on SSE and size
…
Data Mining I @SS19: Clustering 3 47
![Page 44: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/44.jpg)
Bisecting k-Means
An example
Data Mining I @SS19: Clustering 3 48
![Page 45: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/45.jpg)
Outline
Hierarchical clustering
Bisecting k-Means
An overview of clustering
Homework/tutorial
Things you should know from this lecture
Data Mining I @SS19: Clustering 3 49
![Page 46: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/46.jpg)
An overview on clustering
Intuitively, a cluster is a set of data objects that are similar to one another within the same cluster and dissimilar to the objects
in other clusters
Cluster analysis: Find similarities between data according to the characteristics found in the data and group similar data
objects into clusters
Key points in clustering
Similarity/ distance function
Learning algorithm
An unsupervised learning task
No clues on the number of clusters, nor in the characteristics of these clusters
Important DM task: as a stand-alone tool or as a preprocessing step
A large amount of algorithms
Partitioning methods
Hierarchical methods
Density-based methods
Model-based methods
….
Data Mining I @SS19: Clustering 3 50
![Page 47: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/47.jpg)
Outline
Hierarchical clustering
Bisecting k-Means
An overview of clustering
Homework/tutorial
Things you should know from this lecture
Data Mining I @SS19: Clustering 3 51
![Page 48: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/48.jpg)
Homework/ tutorial
Homework
Use the Elki data mining tool to experiment with clustering algorithms http://elki.dbs.ifi.lmu.de/
Or Python/ Weka (more limited w.r.t. clustering)
Readings:
Tan P.-N., Steinbach M., Kumar V book, Chapter 8.
Data Clustering: A Review, https://www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf
Nando de Freitas youtube video: https://www.youtube.com/watch?v=voN8omBe2r4
Data Mining I @SS19: Clustering 3 52
![Page 49: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/49.jpg)
Outline
Hierarchical clustering
Bisecting k-Means
An overview of clustering
Homework/tutorial
Things you should know from this lecture
Data Mining I @SS19: Clustering 3 53
![Page 50: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering](https://reader036.vdocuments.site/reader036/viewer/2022071111/5fe70d1322101f7432788536/html5/thumbnails/50.jpg)
Homework/ tutorial
Hierarchical clustering basics
Agglomerative approach
Similarity measures between clusters
Bisecting kMeans
Data Mining I @SS19: Clustering 3 54