types of clustering approaches: linkage based, e.g

82
Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-Means Density Based Clustering, e.g. DBScan Grid Based Clustering Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨oppner, Frank Klawonn and Iris Ad¨ a 1 / 60

Upload: others

Post on 05-Jun-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Types of Clustering Approaches: Linkage Based, e.g

Finding Clusters

Types of Clustering Approaches:

Linkage Based, e.g. Hierarchical Clustering

Clustering by Partitioning, e.g. k-Means

Density Based Clustering, e.g. DBScan

Grid Based Clustering

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 1 / 60

Page 2: Types of Clustering Approaches: Linkage Based, e.g

Hierarchical Clustering

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 2 / 60

Page 3: Types of Clustering Approaches: Linkage Based, e.g

Hierarchical clustering

–3 –2 –1 0 1 2 3–3

–2

–1

0

1

2

3

Iris setosaIris versicolorIris virginica

In the two-dimensional MDS (Sammon mapping) representation of the Irisdata set, two clusters can be identified. (The colours, indicating thespecies of the flowers, are ignored here.)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 3 / 60

Page 4: Types of Clustering Approaches: Linkage Based, e.g

Hierarchical clustering

Hierarchical clustering builds clusters step by step.

Usually a bottom up strategy is applied by first considering each dataobject as a separate cluster and then step by step joining clusterstogether that are close to each other. This approach is calledagglomerative hierarchical clustering.

In contrast to agglomerative hierarchical clustering, divisivehierarchical clustering starts with the whole data set as a singlecluster and then divides clusters step by step into smaller clusters.

In order to decide which data objects should belong to the samecluster, a (dis-)similarity measure is needed.

Note: We do need to have access to features, all that is needed forhierarchical clustering is an n× n-matrix [di,j ], where di,j is the(dis-)similarity of data objects i and j. (n is the number of dataobjects.)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 4 / 60

Page 5: Types of Clustering Approaches: Linkage Based, e.g

Hierarchical clustering

Hierarchical clustering builds clusters step by step.

Usually a bottom up strategy is applied by first considering each dataobject as a separate cluster and then step by step joining clusterstogether that are close to each other. This approach is calledagglomerative hierarchical clustering.

In contrast to agglomerative hierarchical clustering, divisivehierarchical clustering starts with the whole data set as a singlecluster and then divides clusters step by step into smaller clusters.

In order to decide which data objects should belong to the samecluster, a (dis-)similarity measure is needed.

Note: We do need to have access to features, all that is needed forhierarchical clustering is an n× n-matrix [di,j ], where di,j is the(dis-)similarity of data objects i and j. (n is the number of dataobjects.)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 4 / 60

Page 6: Types of Clustering Approaches: Linkage Based, e.g

Hierarchical clustering

Hierarchical clustering builds clusters step by step.

Usually a bottom up strategy is applied by first considering each dataobject as a separate cluster and then step by step joining clusterstogether that are close to each other. This approach is calledagglomerative hierarchical clustering.

In contrast to agglomerative hierarchical clustering, divisivehierarchical clustering starts with the whole data set as a singlecluster and then divides clusters step by step into smaller clusters.

In order to decide which data objects should belong to the samecluster, a (dis-)similarity measure is needed.

Note: We do need to have access to features, all that is needed forhierarchical clustering is an n× n-matrix [di,j ], where di,j is the(dis-)similarity of data objects i and j. (n is the number of dataobjects.)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 4 / 60

Page 7: Types of Clustering Approaches: Linkage Based, e.g

Hierarchical clustering

Hierarchical clustering builds clusters step by step.

Usually a bottom up strategy is applied by first considering each dataobject as a separate cluster and then step by step joining clusterstogether that are close to each other. This approach is calledagglomerative hierarchical clustering.

In contrast to agglomerative hierarchical clustering, divisivehierarchical clustering starts with the whole data set as a singlecluster and then divides clusters step by step into smaller clusters.

In order to decide which data objects should belong to the samecluster, a (dis-)similarity measure is needed.

Note: We do need to have access to features, all that is needed forhierarchical clustering is an n× n-matrix [di,j ], where di,j is the(dis-)similarity of data objects i and j. (n is the number of dataobjects.)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 4 / 60

Page 8: Types of Clustering Approaches: Linkage Based, e.g

Hierarchical clustering

Hierarchical clustering builds clusters step by step.

Usually a bottom up strategy is applied by first considering each dataobject as a separate cluster and then step by step joining clusterstogether that are close to each other. This approach is calledagglomerative hierarchical clustering.

In contrast to agglomerative hierarchical clustering, divisivehierarchical clustering starts with the whole data set as a singlecluster and then divides clusters step by step into smaller clusters.

In order to decide which data objects should belong to the samecluster, a (dis-)similarity measure is needed.

Note: We do need to have access to features, all that is needed forhierarchical clustering is an n× n-matrix [di,j ], where di,j is the(dis-)similarity of data objects i and j. (n is the number of dataobjects.)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 4 / 60

Page 9: Types of Clustering Approaches: Linkage Based, e.g

Hierarchical clustering: Dissimilarity matrix

The dissimilarity matrix [di,j ] should at least satisfy the followingconditions.

di,j ≥ 0, i.e. dissimilarity cannot be negative.

di,i = 0, i.e. each data object is completely similar to itself.

di,j = dj,i, i.e. data object i is (dis-)similar to data object j to thesame degree as data object j is (dis-)similar to data object i.

It is often useful if the dissimilarity is a (pseudo-)metric, satisfying also the

triangle inequality di,k ≤ di,j + dj,k.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 5 / 60

Page 10: Types of Clustering Approaches: Linkage Based, e.g

Agglomerative hierarchical clustering: Algorithm

Input: n× n dissimilarity matrix [di,j ].

1 Start with n clusters, each data objects forms a single cluster.

2 Reduce the number of clusters by joining those two clusters that aremost similar (least dissimilar).

3 Repeat step 3 until there is only one cluster left containing all dataobjects.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 6 / 60

Page 11: Types of Clustering Approaches: Linkage Based, e.g

Measuring dissimilarity between clusters

The dissimilarity between two clusters containing only one data objecteach is simply the dissimilarity of the two data objects specified in thedissimilarity matrix [di,j ].

But how do we compute the dissimilarity between clusters thatcontain more than one data object?

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 7 / 60

Page 12: Types of Clustering Approaches: Linkage Based, e.g

Measuring dissimilarity between clusters

The dissimilarity between two clusters containing only one data objecteach is simply the dissimilarity of the two data objects specified in thedissimilarity matrix [di,j ].

But how do we compute the dissimilarity between clusters thatcontain more than one data object?

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 7 / 60

Page 13: Types of Clustering Approaches: Linkage Based, e.g

Measuring dissimilarity between clusters

CentroidDistance between the centroids (mean value vectors) of the twoclusters

Average LinkageAverage dissimilarity between all pairs of points of the two clusters.

Single LinkageDissimilarity between the two most similar data objects of the twoclusters.

Complete LinkageDissimilarity between the two most dissimilar data objects of the twoclusters.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 8 / 60

Page 14: Types of Clustering Approaches: Linkage Based, e.g

Measuring dissimilarity between clusters

CentroidDistance between the centroids (mean value vectors) of the twoclusters1

Average LinkageAverage dissimilarity between all pairs of points of the two clusters.

Single LinkageDissimilarity between the two most similar data objects of the twoclusters.

Complete LinkageDissimilarity between the two most dissimilar data objects of the twoclusters.

1Requires that we can compute the mean vector!Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 8 / 60

Page 15: Types of Clustering Approaches: Linkage Based, e.g

Measuring dissimilarity between clusters

CentroidDistance between the centroids (mean value vectors) of the twoclusters1

Average LinkageAverage dissimilarity between all pairs of points of the two clusters.

Single LinkageDissimilarity between the two most similar data objects of the twoclusters.

Complete LinkageDissimilarity between the two most dissimilar data objects of the twoclusters.

1Requires that we can compute the mean vector!Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 8 / 60

Page 16: Types of Clustering Approaches: Linkage Based, e.g

Measuring dissimilarity between clusters

CentroidDistance between the centroids (mean value vectors) of the twoclusters1

Average LinkageAverage dissimilarity between all pairs of points of the two clusters.

Single LinkageDissimilarity between the two most similar data objects of the twoclusters.

Complete LinkageDissimilarity between the two most dissimilar data objects of the twoclusters.

1Requires that we can compute the mean vector!Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 8 / 60

Page 17: Types of Clustering Approaches: Linkage Based, e.g

Measuring dissimilarity between clusters

CentroidDistance between the centroids (mean value vectors) of the twoclusters1

Average LinkageAverage dissimilarity between all pairs of points of the two clusters.

Single LinkageDissimilarity between the two most similar data objects of the twoclusters.

Complete LinkageDissimilarity between the two most dissimilar data objects of the twoclusters.

1Requires that we can compute the mean vector!Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 8 / 60

Page 18: Types of Clustering Approaches: Linkage Based, e.g

Measuring dissimilarity between clusters

CentroidDistance between the centroids (mean value vectors) of the twoclusters1

Average LinkageAverage dissimilarity between all pairs of points of the two clusters.Single LinkageDissimilarity between the two most similar data objects of the twoclusters.Complete LinkageDissimilarity between the two most dissimilar data objects of the twoclusters.

1Requires that we can compute the mean vector!Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 8 / 60

Page 19: Types of Clustering Approaches: Linkage Based, e.g

Measuring dissimilarity between clusters

Single linkage can “follow chains” in the data(may be desirable in certain applications).

Complete linkage leads to very compact clusters.

Average linkage also tends clearly towards compact clusters.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 9 / 60

Page 20: Types of Clustering Approaches: Linkage Based, e.g

Measuring dissimilarity between clusters

Single linkage Complete linkage

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 10 / 60

Page 21: Types of Clustering Approaches: Linkage Based, e.g

Measuring dissimilarity between clusters

Ward’s method

another strategy for merging clusters

In contrast to single, complete or average linkage, it takes the numberof data objects in each cluster into account.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 11 / 60

Page 22: Types of Clustering Approaches: Linkage Based, e.g

Measuring dissimilarity between clusters

The updated dissimilarity between the newly formed cluster C ∪ C′ andthe cluster C′′ is computed in the follwing way.

d′(C ∪ C′, C′′) = ...

single linkage = mind′(C, C′′), d′(C′, C′′)complete linkage = maxd′(C, C′′), d′(C′, C′′)

average linkage =|C|d′(C, C′′) + |C′|d′(C′, C′′)

|C|+ |C′|

Ward =(|C|+ |C′′|)d′(C, C′′) + (|C′|+ |C′′|)d′(C′, C′′)− |C′′|d′(C, C′)

|C|+ |C′|+ |C′′|

centroid2 =1

|C ∪ C′||C′′|∑

x∈C∪C′

∑y∈C′′

d(x,y)

2If metric, usually mean vector needs to be computed!Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 12 / 60

Page 23: Types of Clustering Approaches: Linkage Based, e.g

Dendrograms

The cluster merging process arranges the data points in a binary tree.

Draw the data tuples at the bottom or on the left(equally spaced if they are multi-dimensional).

Draw a connection between clusters that are merged, with thedistance to the data points representing the distance between theclusters.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 13 / 60

Page 24: Types of Clustering Approaches: Linkage Based, e.g

Hierarchical clustering

Example

Clustering of the 1-dimensional data set 2, 12, 16, 25, 29, 45.

All three approaches to measure the distance between clusters lead todifferent dendrograms.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 14 / 60

Page 25: Types of Clustering Approaches: Linkage Based, e.g

Hierarchical clustering

Centroid Single linkage Complete linkage

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 15 / 60

Page 26: Types of Clustering Approaches: Linkage Based, e.g

Dendrograms

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 16 / 60

Page 27: Types of Clustering Approaches: Linkage Based, e.g

Dendrograms

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 17 / 60

Page 28: Types of Clustering Approaches: Linkage Based, e.g

Dendrograms

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 18 / 60

Page 29: Types of Clustering Approaches: Linkage Based, e.g

Choosing the right clusters

Simplest Approach: Specify a minimum desired distance between clusters. Stop merging clusters if the closest two clusters are farther apart than

this distance.

Visual Approach: Merge clusters until all data points are combined into one cluster. Draw the dendrogram and find a good cut level. Advantage: Cut needs not be strictly horizontal.

More Sophisticated Approaches: Analyze the sequence of distances in the merging process. Try to find a step in which the distance between the two clusters

merged is considerably larger than the distance of the previous step. Several heuristic criteria exist for this step selection.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 19 / 60

Page 30: Types of Clustering Approaches: Linkage Based, e.g

Heatmaps

A heatmap combines

a dendrogram resulting from clustering the data,

a dendrogram resulting from clustering the attributes and

colours to indicate the values of the attributes.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 20 / 60

Page 31: Types of Clustering Approaches: Linkage Based, e.g

Example: Heatmap and dendrogram

x y

2

1

3

4

9

8

10

7

5

6

−2 0 2 4Value

01

23

4

Color Keyand Histogram

Cou

nt

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 21 / 60

Page 32: Types of Clustering Approaches: Linkage Based, e.g

Example: Heatmap and dendrogram

1 2 3 4

454721287142635434384025424944163492464137122310194829302050136463132182239233531115132781785979978798189768692779582948483919680879367536573981005455665864636872565974526257606151757169701111141151011091251161231041051191061121071021201031211221171181241131081101331361281321481291399088146126137138144142145140149135127131141130150143134147

0 5 10 15Value

050

150

Color Keyand Histogram

Cou

nt

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 22 / 60

Page 33: Types of Clustering Approaches: Linkage Based, e.g

Example: Heatmap and dendrogram

1 2

445775117259961007786546715473313743531241273985179681855458258024343197566395844839949828146221268588709035647350527122231348164064932606923761746927858892966742025991081366542936130381991787

−2 −1 0 1 2Value

05

1525

Color Keyand Histogram

Cou

nt

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 23 / 60

Page 34: Types of Clustering Approaches: Linkage Based, e.g

Iris Data: Heatmap and dendrogram

sw sl pl pw

235381149204722451519617163334442427365071225372132298402841118311035461322691439434303481011371491031131401051411421461111161451251211441101181321081311261301061361191234261995894638869120806883936091951075481827090629264797598727467859756100658996115114122102143109731471121241278413566875153551347759761291337810414811713815012813952577186

−2 0 1 2 3Value

040

80

Color Keyand Histogram

Cou

nt

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 24 / 60

Page 35: Types of Clustering Approaches: Linkage Based, e.g

Divisive hierarchical clustering

The top-down approach of divisive hierarchical clustering is rarely used.

In agglomerative clustering the minimum of the pairwisedissimilarities has to be determined, leading to a quadratic complexityin each step (quadratic in the number of clusters still present in thecorresponding step).

In divisive clustering for each cluster all possible splits would have tobe considered.

In the first step, there are 2n−1 − 1 possible splits, where n is thenumber of data objects.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 25 / 60

Page 36: Types of Clustering Approaches: Linkage Based, e.g

What is Similarity?

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 26 / 60

Page 37: Types of Clustering Approaches: Linkage Based, e.g

How to cluster these objects?

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 27 / 60

Page 38: Types of Clustering Approaches: Linkage Based, e.g

How to cluster these objects?

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 28 / 60

Page 39: Types of Clustering Approaches: Linkage Based, e.g

How to cluster these objects?

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 29 / 60

Page 40: Types of Clustering Approaches: Linkage Based, e.g

Clustering example

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 30 / 60

Page 41: Types of Clustering Approaches: Linkage Based, e.g

Clustering example

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 31 / 60

Page 42: Types of Clustering Approaches: Linkage Based, e.g

Clustering example

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 32 / 60

Page 43: Types of Clustering Approaches: Linkage Based, e.g

Scaling

The previous three slides show the same data set.

In the second slide, the unit on the x-axis was changed to centi-units.

In the third slide, the unit on the y-axis was changed to centi-units.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 33 / 60

Page 44: Types of Clustering Approaches: Linkage Based, e.g

Scaling

The previous three slides show the same data set.

In the second slide, the unit on the x-axis was changed to centi-units.

In the third slide, the unit on the y-axis was changed to centi-units.

Clusters should not depend on the measurement unit!

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 33 / 60

Page 45: Types of Clustering Approaches: Linkage Based, e.g

Scaling

The previous three slides show the same data set.

In the second slide, the unit on the x-axis was changed to centi-units.

In the third slide, the unit on the y-axis was changed to centi-units.

Clusters should not depend on the measurement unit!

Therefore, some kind of normalisation (see the chapter on datapreparation) should be carried out before clustering.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 33 / 60

Page 46: Types of Clustering Approaches: Linkage Based, e.g

Complex Similarities: An Example

A few Adrenalin-like drug candidates:

Adrenalin (D)

(C)

(B) (E)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 34 / 60

Page 47: Types of Clustering Approaches: Linkage Based, e.g

Complex Similarities: An Example

Similarity: Polarity

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 35 / 60

Page 48: Types of Clustering Approaches: Linkage Based, e.g

Complex Similarities: An Example

Dissimilarity: Hydrophobic / Hydrophilic

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 36 / 60

Page 49: Types of Clustering Approaches: Linkage Based, e.g

Complex Similarities: An Example

Similar to Adrenalin...

Adrenalin Amphetamin

Ephedrin

Dopamin MDMA

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 37 / 60

Page 50: Types of Clustering Approaches: Linkage Based, e.g

Complex Similarities: An Example

Similar to Adrenalin...but some cross the blood-brain barrier

Adrenalin Amphetamin (Speed)

Ephedrin

Dopamin MDMA (Ecstasy)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 37 / 60

Page 51: Types of Clustering Approaches: Linkage Based, e.g

Similarity Measures

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 38 / 60

Page 52: Types of Clustering Approaches: Linkage Based, e.g

Notion of (dis-)similarity: Numerical attributes

Various choices for dissimilarities between two numerical vectors:

Manhatten

Pearson

Tschebyschew

Euclidean

Minkowksi Lp dp(x, y) =p√∑n

i=1 |xi − yi|p

Euclidean L2 dE(x, y) =√

(x1 − y1)2 + . . .+ (xn − yn)2

Manhattan L1 dM (x, y) = |x1 − y1|+ . . .+ |xn − yn|Tschebyschew L∞ d∞(x, y) = max|x1 − y1|, . . . , |xn − yn|

Cosine dC(x, y) = 1− x>y‖x‖‖y‖

Tanimoto dT (x, y) =x>y

‖x‖2+‖y‖2−x>y

Pearson Euclidean of z-score transformed x, yCompendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 39 / 60

Page 53: Types of Clustering Approaches: Linkage Based, e.g

Notion of (dis-)similarity: Binary attributes

The two values (e.g. 0 and 1) of a binary attribute can be interpreted assome property being absent (0) or present (1).

In this sense, a vector of binary attribute can be interpreted as a set ofproperties that the corresponding object has.

Example

The binary vector (0, 1, 1, 0, 1) corresponds to the set of propertiesa2, a3, a5.The binary vector (0, 0, 0, 0, 0) corresponds to the empty set.

The binary vector (1, 1, 1, 1, 1) corresponds to the seta1, a2, a3, a4, a5.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 40 / 60

Page 54: Types of Clustering Approaches: Linkage Based, e.g

Notion of (dis-)similarity: Binary attributes

Dissimilarity measures for two vectors of binary attributes.Each data object is represent by the corresponding set of properties thatare present.

binary attributes sets of properties

simple match dS = 1− b+nb+n+x

Russel & Rao dR = 1− bb+n+x 1− |X∩Y ||Ω|

Jaccard dJ = 1− bb+x 1− |X∩Y ||X∪Y |

Dice dD = 1− 2b2b+x 1− 2|X∩Y |

|X|+|Y |no. of predicates that...

b = ...hold in both recordsn = ...do not hold in both recordsx = ...hold in only one of both records

x y set X set Y b n x dM dR dJ dD

101000 111000 a1, a3 a1, a2, a3 2 3 1 0.16 0.66 0.33 0.20

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 41 / 60

Page 55: Types of Clustering Approaches: Linkage Based, e.g

Notion of (dis-)similarity: Nominal attributes

Nominal attributes may be transformed into a set of binary attributes, eachof them indicating one particular feature of the attribute (1-of-n coding).

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 42 / 60

Page 56: Types of Clustering Approaches: Linkage Based, e.g

Notion of (dis-)similarity: Nominal attributes

Nominal attributes may be transformed into a set of binary attributes, eachof them indicating one particular feature of the attribute (1-of-n coding).

Example

Attribute Manufacturer with the values BMW, Chrysler, Dacia, Ford,Volkswagen.

manufacturer ...

Volkswagen ...Dacia ...Ford ...

binary vector

000010100000100

Then one of the dissimilarity measures for binary attribute can be applied.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 42 / 60

Page 57: Types of Clustering Approaches: Linkage Based, e.g

Notion of (dis-)similarity: Nominal attributes

Nominal attributes may be transformed into a set of binary attributes, eachof them indicating one particular feature of the attribute (1-of-n coding).

Example

Attribute Manufacturer with the values BMW, Chrysler, Dacia, Ford,Volkswagen.

manufacturer ...

Volkswagen ...Dacia ...Ford ...

binary vector

000010100000100

Then one of the dissimilarity measures for binary attribute can be applied.

Another way to measure similarity between two vectors of nominalattributes is to compute the proportion of attributes where both vectorshave the same value, leading to the Russel & Rao dissimilarity measure.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 42 / 60

Page 58: Types of Clustering Approaches: Linkage Based, e.g

Prototype Based Clustering

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 43 / 60

Page 59: Types of Clustering Approaches: Linkage Based, e.g

Prototype Based Clustering

given: dataset of size n

return: set of typical examples of size k << n.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 44 / 60

Page 60: Types of Clustering Approaches: Linkage Based, e.g

k-Means clustering

Choose a number k of clusters to be found (user input).

Initialize the cluster centres randomly(for instance, by randomly selecting k data points).

Data point assignment:Assign each data point to the cluster centre that is closest to it (i.e.closer than any other cluster centre).

Cluster centre update:Compute new cluster centres as the mean vectors of the assigneddata points. (Intuitively: centre of gravity if each data point has unitweight.)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 45 / 60

Page 61: Types of Clustering Approaches: Linkage Based, e.g

k-Means clustering

Repeat these two steps (data point assignment and cluster centreupdate) until the clusters centres do not change anymore.

It can be shown that this scheme must converge,i.e., the update of the cluster centres cannot go on forever.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 46 / 60

Page 62: Types of Clustering Approaches: Linkage Based, e.g

k-Means clustering

Aim: Minimize the objective function

f =

k∑i=1

n∑j=1

uijdij

under the constraints uij ∈ 0, 1 and

k∑i=1

uij = 1 for all j = 1, . . . , n.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 47 / 60

Page 63: Types of Clustering Approaches: Linkage Based, e.g

Alternating optimization

Assuming the cluster centres to be fixed, uij = 1 should be chosen forthe cluster i to which data object xj has the smallest distance inorder to minimize the objective function.

Assuming the assignments to the clusters to be fixed, each clustercentre should be chosen as the mean vector of the data objectsassigned to the cluster in order to minimize the objective function.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 48 / 60

Page 64: Types of Clustering Approaches: Linkage Based, e.g

k-Means clustering: Example

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 49 / 60

Page 65: Types of Clustering Approaches: Linkage Based, e.g

k-Means clustering: Example

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 49 / 60

Page 66: Types of Clustering Approaches: Linkage Based, e.g

k-Means clustering: Example

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 49 / 60

Page 67: Types of Clustering Approaches: Linkage Based, e.g

k-Means clustering: Example

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 49 / 60

Page 68: Types of Clustering Approaches: Linkage Based, e.g

k-Means clustering: Example

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 49 / 60

Page 69: Types of Clustering Approaches: Linkage Based, e.g

k-Means clustering: Example

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 49 / 60

Page 70: Types of Clustering Approaches: Linkage Based, e.g

k-Means clustering: Local minima

Clustering is successful in this example:The clusters found are those that would have been formed intuitively.

Convergence is achieved after only 5 steps.(This is typical: convergence is usually very fast.)

However: The clustering result is fairly sensitive to the initialpositions of the cluster centres.

With a bad initialisation clustering may fail(the alternating update process gets stuck in a local minimum).

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 50 / 60

Page 71: Types of Clustering Approaches: Linkage Based, e.g

k-Means clustering: Local minima

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 51 / 60

Page 72: Types of Clustering Approaches: Linkage Based, e.g

Gaussian Mixture Models

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 52 / 60

Page 73: Types of Clustering Approaches: Linkage Based, e.g

Gaussian mixture models – EM clustering

Assumption: Data was generated by sampling a set of normaldistributions.(The probability density is a mixture of normal distributions.)

Aim: Find the parameters for the normal distributions and how mucheach normal distribution contributes to the data.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 53 / 60

Page 74: Types of Clustering Approaches: Linkage Based, e.g

Gaussian mixture models

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-3 -2 -1 0 1 2 3 4

Two normaldistributions.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 54 / 60

Page 75: Types of Clustering Approaches: Linkage Based, e.g

Gaussian mixture models

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-3 -2 -1 0 1 2 3 4

Two normaldistributions.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-3 -2 -1 0 1 2 3 4

Mixture model (both

normal distrubutionscontribute 50%).

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 54 / 60

Page 76: Types of Clustering Approaches: Linkage Based, e.g

Gaussian mixture models

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-3 -2 -1 0 1 2 3 4

Two normaldistributions.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-3 -2 -1 0 1 2 3 4

Mixture model (bothnormal distrubutionscontribute 50%).

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-3 -2 -1 0 1 2 3 4

Mixture model (onenormal distrubutionscontributes 10%, theother 90%).

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 54 / 60

Page 77: Types of Clustering Approaches: Linkage Based, e.g

Gaussian mixture models

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 55 / 60

Page 78: Types of Clustering Approaches: Linkage Based, e.g

Gaussian mixture models – EM clustering

Assumption: Data were generated by sampling a set of normaldistributions.(The probability density is a mixture of normal distributions.)

Aim: Find the parameters for the normal distributions and how mucheach normal distribution contributes to the data.

Algorithm: EM clustering (expectation maximisation). Alternatingscheme in which the parameters of the normal distributions and thelikelihoods of the data points to be generated by the correspondingnormal distributions are estimated.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 56 / 60

Page 79: Types of Clustering Approaches: Linkage Based, e.g

Density Based Clustering

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 57 / 60

Page 80: Types of Clustering Approaches: Linkage Based, e.g

Density-based clustering

For numerical data, density-based clustering algorithm often yield the bestresults.

Principle: A connected region with high data density corresponds to onecluster.

DBScan is one of the most popular density-based clustering algorithms.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 58 / 60

Page 81: Types of Clustering Approaches: Linkage Based, e.g

Density-based clustering: DBScan

Principle idea of DBScan:

1 Find a data point where the data density is high, i.e. in whoseε-neighbourhood are at least ` other points. (ε and ` are parametersof the algorithm to be chosen by the user.)

2 All the points in the ε-neighbourhood are considered to belong to onecluster.

3 Expand this ε-neighbourhood (the cluster) as long as the high densitycriterion is satisfied.

4 Remove the cluster (all data points assigned to the cluster) from thedata set and continue with 1. as long as data points with a high datadensity around them can be found.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 59 / 60

Page 82: Types of Clustering Approaches: Linkage Based, e.g

Density-based clustering: DBScan

grid cell

neighbourhood cell

with at least 3 hits

grid cell

neighbourhood cell

with at least 3 hits

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011.c©Michael R. Berthold, Christian Borgelt, Frank Hoppner, Frank Klawonn and Iris Ada 60 / 60