hierarchical clustering
DESCRIPTION
Hierarchical clusteringTRANSCRIPT
Hierarchical Clustering
Mehta Ishani
130040701003
04/12/20232
What is Clustering in Data Mining?
Cluster: a collection of data objects that
are “similar” to one another and thus can be treated collectively as one group
but as a collection, they are sufficiently different from other groups
Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters
Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters
04/12/20233
Distance or Similarity Measures Measuring Distance
In order to group similar items, we need a way to measure the distance between objects (e.g., records)
Note: distance = inverse of similarity Often based on the representation of objects as “feature vectors”
ID Gender Age Salary1 F 27 19,0002 M 51 64,0003 M 52 100,0004 F 33 55,0005 M 45 45,000
T1 T2 T3 T4 T5 T6Doc1 0 4 0 0 0 2Doc2 3 1 4 3 1 2Doc3 3 0 0 0 3 0Doc4 0 1 0 3 0 0Doc5 2 2 2 3 1 4
An Employee DB Term Frequencies for Documents
Which objects are more similar?
04/12/20234
Distance or Similarity Measures
Common Distance Measures:
Manhattan distance:
Euclidean distance:
Cosine similarity:
1 2, , , nX x x x 1 2, , , nY y y y
1 1 2 2( , ) n ndist X Y x y x y x y
2 2
1 1( , ) n ndist X Y x y x y
Can be normalizedto make values fallbetween 0 and 1.
Can be normalizedto make values fallbetween 0 and 1.
( , ) 1 ( , )dist X Y sim X Y 2 2
( )( , )
i ii
i ii i
x ysim X Y
x y
04/12/20235
Distance or Similarity Measures Weighting Attributes
in some cases we want some attributes to count more than others associate a weight with each of the attributes in calculating
distance, e.g.,
Nominal (categorical) Attributes can use simple matching: distance=1 if values match, 0 otherwise or convert each nominal attribute to a set of binary attribute, then
use the usual distance measure if all attributes are nominal, we can normalize by dividing the
number of matches by the total number of attributes
Normalization: want values to fall between 0 an 1: other variations possible
2 2
1 1 1( , ) n n ndist X Y w x y w x y
min'
max mini i
i
i i
x xx
x x
04/12/20236
Distance or Similarity Measures
Example
max distance for salary: 100000-19000 = 79000 max distance for age: 52-27 = 25
dist(ID2, ID3) = SQRT( 0 + (0.04)2 + (0.44)2 ) = 0.44 dist(ID2, ID4) = SQRT( 1 + (0.72)2 + (0.12)2 ) = 1.24
ID Gender Age Salary1 F 27 19,0002 M 51 64,0003 M 52 100,0004 F 33 55,0005 M 45 45,000
min'
max mini i
i
i i
x xx
x x
ID Gender Age Salary1 1 0.00 0.002 0 0.96 0.563 0 1.00 1.004 1 0.24 0.445 0 0.72 0.32
04/12/20237
Domain Specific Distance Functions For some data sets, we may need to use specialized functions
we may want a single or a selected group of attributes to be used in the computation of distance - same problem as “feature selection”
may want to use special properties of one or more attribute in the data
natural distance functions may exist in the data
Example: Zip Codesdistzip(A, B) = 0, if zip codes are
identicaldistzip(A, B) = 0.1, if first 3 digits are
identicaldistzip(A, B) = 0.5, if first digits are
identicaldistzip(A, B) = 1, if first digits are
different
Example: Zip Codesdistzip(A, B) = 0, if zip codes are
identicaldistzip(A, B) = 0.1, if first 3 digits are
identicaldistzip(A, B) = 0.5, if first digits are
identicaldistzip(A, B) = 1, if first digits are
different
Example: Customer Solicitationdistsolicit(A, B) = 0, if both A and B respondeddistsolicit(A, B) = 0.1, both A and B were chosen but did not
responddistsolicit(A, B) = 0.5, both A and B were chosen, but only
one respondeddistsolicit(A, B) = 1, one was chosen, but the other was not
Example: Customer Solicitationdistsolicit(A, B) = 0, if both A and B respondeddistsolicit(A, B) = 0.1, both A and B were chosen but did not
responddistsolicit(A, B) = 0.5, both A and B were chosen, but only
one respondeddistsolicit(A, B) = 1, one was chosen, but the other was not
04/12/20238
Distance (Similarity) Matrix Similarity (Distance) Matrix
based on the distance or similarity measure we can construct a symmetric matrix of distance (or similarity values)
(i, j) entry in the matrix is the distance (similarity) between items i and j
similarity (or distance) of to ij i jd D D
Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix.
The diagonal is all 1’s (similarity) or all 0’s (distance)
Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix.
The diagonal is all 1’s (similarity) or all 0’s (distance)
1 2
1 12 1
2 21 2
1 2
n
n
n
n n n
I I I
I d d
I d d
I d d
04/12/20239
Example: Term Similarities in Documents
T1 T2 T3 T4 T5 T6 T7 T8Doc1 0 4 0 0 0 2 1 3Doc2 3 1 4 3 1 2 0 1Doc3 3 0 0 0 3 0 3 0Doc4 0 1 0 3 0 0 2 0Doc5 2 2 2 3 1 4 0 2
sim T T w wi j jkikk
N
( , ) ( )
1
T1 T2 T3 T4 T5 T6 T7T2 7T3 16 8T4 15 12 18T5 14 3 6 6T6 14 18 16 18 6T7 9 6 0 6 9 2T8 7 17 8 9 3 16 3
Term-TermSimilarity Matrix
Term-TermSimilarity Matrix
04/12/202310
Similarity (Distance) Thresholds
A similarity (distance) threshold may be used to mark pairs that are “sufficiently” similarT1 T2 T3 T4 T5 T6 T7
T2 7T3 16 8T4 15 12 18T5 14 3 6 6T6 14 18 16 18 6T7 9 6 0 6 9 2T8 7 17 8 9 3 16 3
T1 T2 T3 T4 T5 T6 T7T2 0T3 1 0T4 1 1 1T5 1 0 0 0T6 1 1 1 1 0T7 0 0 0 0 0 0T8 0 1 0 0 0 1 0
Using a threshold value of 10 in the previous example
04/12/202311
Graph Representation
The similarity matrix can be visualized as an undirected graph each item is represented by a node, and edges represent the fact that
two items are similar (a one in the similarity threshold matrix)
T1 T2 T3 T4 T5 T6 T7T2 0T3 1 0T4 1 1 1T5 1 0 0 0T6 1 1 1 1 0T7 0 0 0 0 0 0T8 0 1 0 0 0 1 0
T1 T3
T4
T6T8
T5
T2
T7If no threshold is used, thenmatrix can be represented asa weighted graph
If no threshold is used, thenmatrix can be represented asa weighted graph
04/12/202312
Clustering Methodologies Two general methodologies
Partitioning Based Algorithms Hierarchical Algorithms
Partitioning Based divide a set of N items into K clusters (top-down)
Hierarchical agglomerative: pairs of items or clusters are
successively linked to produce larger clusters divisive: start with the whole set as a cluster and
successively divide sets into smaller partitions
04/12/202313
Hierarchical Clustering
Use distance matrix as clustering criteria.
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
aa b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative(AGNES)
divisive(DIANA)
04/12/202314
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990) Use the dissimilarity matrix. Merge nodes that have the least dissimilarity Go on in a non-descending fashion Eventually all nodes belong to the same cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
04/12/202315
Algorithmic steps for Agglomerative Hierarchical clustering Let X = {x1, x2, x3, ..., xn} be the set of data points.
(1)Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0.
(2)Find the least distance pair of clusters in the current clustering, say pair (r), (s), according
to d[(r),(s)] = min d[(i),(j)] where the minimum is over all pairs of clusters in the current clustering.
(3)Increment the sequence number: m = m +1.Merge clusters (r) and (s) into a single cluster to form the next clustering m. Set the level of this clustering to L(m) = d[(r),(s)].
(4)Update the distance matrix, D, by deleting the rows and columns corresponding to clusters (r) and (s) and adding a row and column corresponding to the newly formed cluster. The distance between the new cluster, denoted (r,s) and old cluster(k) is defined in this way: d[(k), (r,s)] = min (d[(k),(r)], d[(k),(s)]).
(5)If all the data points are in one cluster then stop, else repeat from step 2).
04/12/202316
A Dendrogram Shows How the Clusters are Merged Hierarchically
04/12/202317
DIANA (Divisive Analysis)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g., Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
04/12/202318
Algorithmic steps for Divisive Hierarchical clustering 1. Start with one cluster that contains all samples.2. Calculate diameter of each cluster. Diameter is the maximal
distance between samples in the cluster. Choose one cluster C having maximal diameter of all clusters to split.
3. Find the most dissimilar sample x from cluster C. Let x depart from the original cluster C to form a new independent cluster N (now cluster C does not include sample x). Assign all members of cluster C to MC.
4. Repeat step 6 until members of cluster C and N do not change.
5. Calculate similarities from each member of MC to cluster C and N, and let the member owning the highest similarities in MC move to its similar cluster C or N. Update members of C and N.
6. Repeat the step 2, 3, 4, 5 until the number of clusters becomes the number of samples or as specified by the user.
04/12/202319
Pros and ConsAdvantages1) No a priori information about the number of clusters required.2) Easy to implement and gives best result in some cases.
Disadvantages1) Algorithm can never undo what was done previously.2) Time complexity of at least O(n2 log n) is required, where ‘n’ is the
number of data points.3) Based on the type of distance matrix chosen for merging different
algorithms can suffer with one or more of the following: i) Sensitivity to noise and outliers ii) Breaking large clusters iii) Difficulty handling different sized clusters and convex shapes4) No objective function is directly minimized5) Sometimes it is difficult to identify the correct number of clusters
by the dendogram.
04/12/202320
Questions????