objective of clustering - the university of edinburgh · fuzzy and soft k-means clustering...

115
Objective of clustering Discover structures and patterns in high-dimensional data. Group data with similar patterns together. This reduces the complexity and facilitates interpretation.

Upload: others

Post on 03-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Objective of clustering

• Discover structures and patterns in high-dimensional

data.

• Group data with similar patterns together.

• This reduces the complexity and facilitates

interpretation.

Page 2: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Expression level gene B

Exp

ress

ion

leve

l gen

e A

Page 3: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Expression level gene B

Exp

ress

ion

leve

l gen

e A

Malignant tumour

Benign tumour

Page 4: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Expression level gene B

Exp

ress

ion

leve

l gen

e A

Malignant tumour

Benign tumour

Page 5: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Benign tumour

Malignant tumour

Exp

ress

ion

leve

l gen

e A

Expression level gene B

Page 6: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Benign tumour

Malignant tumour

?

Exp

ress

ion

leve

l gen

e A

Expression level gene B

Page 7: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Genes involved in pathway B

Genes involved in pathway A

Exp

ress

ion

leve

l und

er s

tarv

atio

n

Expression level under heat shock

?

Page 8: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

How shall we cluster the data?

Page 9: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Good clustering

Page 10: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Bad clustering

Page 11: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Sum of squared vector-centroid distances

Within-group variance

Page 12: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Between-group variance Sum over squared distances between centroids

Page 13: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Tight clusters

Within-group variance small

Page 14: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Diffuse clusters

Within-group variance large

Page 15: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Between-group variance largeClusters far apart

Page 16: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Between-group variance smallClusters close together

Page 17: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

How to cluster the data

Page 18: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

How to cluster the data

• Minimize the within-group variance

→ Tight clusters

Page 19: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

How to cluster the data

• Minimize the within-group variance

→ Tight clusters

• Maximize the between-group variance

→ Clusters well separated

Page 20: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

How to cluster the data

• Minimize the within-group variance

→ Tight clusters

• Maximize the between-group variance

→ Clusters well separated

• Problem NP-hard

→ Heuristic algorithms and approximations are needed.

Page 21: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

K-means clustering

Page 22: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

K-means clustering

• Objective: Partition the data into a predefined number

of clusters, K.

• Method: Alternatingly update

– the cluster assignment of each data vector;

– the cluster centroids.

Page 23: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

• Decide on the number of clusters, K.

Page 24: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

• Decide on the number of clusters, K.

• Start with a set of cluster centroids: c1, . . . , cK.

Page 25: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

• Decide on the number of clusters, K.

• Start with a set of cluster centroids: c1, . . . , cK.

• Iteration (until cluster assignments remain unchanged):

– For all data vectors xi, i = 1, . . . , N , and all centroids ck,

k = 1, . . . ,K: Compute the distance dik between the data

vector xi and the centroid ck.

Page 26: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

• Decide on the number of clusters, K.

• Start with a set of cluster centroids: c1, . . . , cK.

• Iteration (until cluster assignments remain unchanged):

– For all data vectors xi, i = 1, . . . , N , and all centroids ck,

k = 1, . . . ,K: Compute the distance dik between the data

vector xi and the centroid ck.

– Assign each data vector xi to the closest centroid ck, that is,

the one with minimal dik. Record the cluster membership in

an indicator variable λik, with λik = 1 if xi → ck and λik = 0otherwise.

Page 27: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

• Decide on the number of clusters, K.

• Start with a set of cluster centroids: c1, . . . , cK.

• Iteration (until cluster assignments remain unchanged):

– For all data vectors xi, i = 1, . . . , N , and all centroids ck,

k = 1, . . . ,K: Compute the distance dik between the data

vector xi and the centroid ck.

– Assign each data vector xi to the closest centroid ck, that is,

the one with minimal dik. Record the cluster membership in

an indicator variable λik, with λik = 1 if xi → ck and λik = 0otherwise.

– Set each cluster centroid to the mean of its assigned cluster:

ck =∑i λikxi∑i λik

Page 28: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

A good example of K-means clustering

Page 29: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 30: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 31: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 32: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 33: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 34: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 35: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 36: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

A bad example of K-means clustering

Page 37: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 38: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 39: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 40: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Shortcoming of K-means clustering

• The algorithm can easily get stuck insuboptimal cluster formations.

• Use fuzzy or soft K-means.

Page 41: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Fuzzy and soft K-means clustering

Page 42: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Fuzzy and soft K-means clustering

• Objective: Soft or fuzzy partition of the data into a

predefined number of clusters, K.

– Each data vector may belong to more than one cluster,

according to its degree of membership.

– This is in contrast to K-means, where a data vector

either wholly belongs to a cluster or not.

Page 43: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Fuzzy and soft K-means clustering

• Objective: Soft or fuzzy partition of the data into a

predefined number of clusters, K.

– Each data vector may belong to more than one cluster,

according to its degree of membership.

– This is in contrast to K-means, where a data vector

either wholly belongs to a cluster or not.

• Method: Alternatingly update

– the membership grade for each data vector;

– the cluster centroids.

Page 44: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

• Decide on the number of clusters, K.

• Start with a set of cluster centroids: c1, . . . , cK.

Page 45: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

• Decide on the number of clusters, K.

• Start with a set of cluster centroids: c1, . . . , cK.

• Iteration (until membership grades remain unchanged):

– For all data vectors xi, i = 1, . . . , N , and all centroids ck,

k = 1, . . . ,K: Compute the distance dik between the data

vector xi and the centroid ck.

Page 46: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

• Decide on the number of clusters, K.

• Start with a set of cluster centroids: c1, . . . , cK.

• Iteration (until membership grades remain unchanged):

– For all data vectors xi, i = 1, . . . , N , and all centroids ck,

k = 1, . . . ,K: Compute the distance dik between the data

vector xi and the centroid ck.

– Compute the membership grades λik. Note: λik ≥ 0 indicates

the amount of association of data vector xi with centroid ck and

depends on the distance dik: if dik < dik′, then λik > λik′. The

detailed functional form (omitted) differs between soft and fuzzy

K-means.

Page 47: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

• Decide on the number of clusters, K.

• Start with a set of cluster centroids: c1, . . . , cK.

• Iteration (until membership grades remain unchanged):

– For all data vectors xi, i = 1, . . . , N , and all centroids ck,

k = 1, . . . ,K: Compute the distance dik between the data

vector xi and the centroid ck.

– Compute the membership grades λik. Note: λik ≥ 0 indicates

the amount of association of data vector xi with centroid ck and

depends on the distance dik: if dik < dik′, then λik > λik′. The

detailed functional form (omitted) differs between soft and fuzzy

K-means.

– Recompute the cluster centroids: ck =∑i λikxi∑i λik

Page 48: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Two examples of soft K-means clustering

The posterior probability for a given data point is indicated by a colour

scale ranging from pure red (corresponding to a posterior probability

of 1.0 for the red component and 0.0 for the blue component) through

to pure blue.

Page 49: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 50: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 51: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 52: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 53: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 54: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 55: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Initialization for which K-means failed

Page 56: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 57: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 58: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 59: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 60: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 61: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 62: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 63: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Agglomerative hierarchical clustering:UPGMA

(hierarchical average linkage clustering)

Page 64: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

. .

..

5

4

3

3 5421

1 2

Page 65: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

. .

..

5

4

3

3

d(1,2)/2

6

541

1 2

2

Page 66: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

. .

..

1 2 4 5

67

d(3,4)/2d(1,2)/2

3

3

4

5

21

Page 67: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Distance between clusters

Average of individual distances.

Page 68: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

. .

..

1 2 4 5

67

d(3,4)/2d(1,2)/2

3

3

4

5

21

Page 69: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

. .

..

1 2 4 5

67

d(1,2)/2

8

d(5,7)/2d(3,4)/2

3

3

4

5

21

Page 70: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

. .

..

1 2 3 4 5

67

d(5,7)/2

8

9

d(1,2)/2 d(3,4)/2

3

4

5

d(6,8)/2 - d(5,7)/2

21

Page 71: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Experiments

Genes.

Page 72: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Experiments

Genes.

Page 73: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Experiments

Genes.

Page 74: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Experiments

Genes.

Page 75: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Experiments

Genes.

Page 76: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Experiments

Genes

Page 77: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

From Spellman et al., http://cellcycle-www.stanford.edu/

Page 78: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

• Hierarchical clustering methods produce a tree or

dendrogram → Allows the biologist to visualize and

interpret the data.

• No need to specify how many clusters are appropriate

−→ partition of the data for each number of clusters

K.

• Partitions are obtained from cutting the tree at different

levels.

Page 79: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

. .

..

1 2 3 4 5

67

d(5,7)/2

8

9

d(1,2)/2 d(3,4)/2

3

4

5

d(6,8)/2 - d(5,7)/2

21

Page 80: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

. .

...

4 5

67

8

9

d(5,7)/2

d(1,2)/2 d(3,4)/2

3

4

5

d(6,8)/2 - d(5,7)/2

1 2

31 2

Page 81: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Principal clustering paradigms

• Non-hierarchical

Cluster N vectors into K groups in an iterative process.

• Hierarchical

Hierarchie of nested clusters; each cluster typically

consists of the union of two smaller sub-clusters.

Page 82: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Hierarchical methods can be further subdivided

Page 83: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Hierarchical methods can be further subdivided

• Bottom-up or agglomerative clustering:

Start with a single-object cluster and recursively merge

them into larger clusters.

Page 84: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Hierarchical methods can be further subdivided

• Bottom-up or agglomerative clustering:

Start with a single-object cluster and recursively merge

them into larger clusters.

• Top down or divisive clustering:

Start with a cluster containing all data and recursively

divide it into smaller clusters.

Page 85: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Overview of clustering methods

Hierarchical Non-hierarchical

Top-down or divisive K-means

Fuzzy/soft K-means

Bottom-up or agglomerative UPGMA

Page 86: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Shortcoming of bottom-up agglomerative clustering

• Focus on local structures → loses some of the information

present in global patterns.

• Once a data vector has been assigned to a node,

it cannot be reassigned to another node later

when global patterns emerge.

Page 87: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Shortcoming of bottom-up agglomerative clustering

• Focus on local structures → loses some of the information

present in global patterns.

• Once a data vector has been assigned to a node,

it cannot be reassigned to another node later

when global patterns emerge.

How can we devise a hierarchical top-down approach?

Hierarchical Non-hierarchical

Top-down or divisive ? K-means

Fuzzy/soft K-means

Bottom-up or agglomerative UPGMA

Page 88: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Divisive (top-down) hierarchical clustering:Binary tree-structured vector quantization

(BTSVQ)

Page 89: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Initially, all data belong to the same cluster

. .

..

Page 90: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Fetch one cluster from the stack.Split this cluster into two clustersusing the fuzzy/soft Kmeans algorithm

Initially, all data belong to the same cluster

. .

..

Page 91: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

. .

..

Initially, all data belong to the same cluster

updated) threshold?Is this ratio larger than a (dynamicallybetween variance/within variance.Compute the ratio

using the fuzzy/soft Kmeans algorithmSplit this cluster into two clustersFetch one cluster from the stack.

Page 92: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

. .

..

Initially, all data belong to the same cluster

Yes updated) threshold?Is this ratio larger than a (dynamicallybetween variance/within variance.Compute the ratio

the stackclusters onPlace both

using the fuzzy/soft Kmeans algorithmSplit this cluster into two clustersFetch one cluster from the stack.

Page 93: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Compute the ratiobetween variance/within variance.Is this ratio larger than a (dynamicallyupdated) threshold?Yes

No

Initially, all data belong to the same cluster

. .

..

Fetch one cluster from the stack.Split this cluster into two clustersusing the fuzzy/soft Kmeans algorithm

Merge the two clusters and remove theresulting cluster from the stack.

Place bothclusters onthe stack

Page 94: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Compute the ratiobetween variance/within variance.Is this ratio larger than a (dynamicallyupdated) threshold?Yes

No

Initially, all data belong to the same cluster

. .

..

Fetch one cluster from the stack.Split this cluster into two clustersusing the fuzzy/soft Kmeans algorithm

Merge the two clusters and remove theresulting cluster from the stack.

Any remaining clusters on the stack ?

Place bothclusters onthe stack

Page 95: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Compute the ratiobetween variance/within variance.Is this ratio larger than a (dynamicallyupdated) threshold?

Yes

Yes

No

Initially, all data belong to the same cluster

. .

..

Fetch one cluster from the stack.Split this cluster into two clustersusing the fuzzy/soft Kmeans algorithm

Merge the two clusters and remove theresulting cluster from the stack.

Any remaining clusters on the stack ?

Place bothclusters onthe stack

Page 96: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Compute the ratiobetween variance/within variance.Is this ratio larger than a (dynamicallyupdated) threshold?

Yes

Yes

No

Initially, all data belong to the same cluster

No

. .

..

Fetch one cluster from the stack.Split this cluster into two clustersusing the fuzzy/soft Kmeans algorithm

Merge the two clusters and remove theresulting cluster from the stack.

Any remaining clusters on the stack ?

Place bothclusters onthe stack

End

Page 97: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Compute the ratiobetween variance/within variance.Is this ratio larger than a (dynamicallyupdated) threshold?

Yes

Yes

No

Initially, all data belong to the same cluster

NoAdapted in simplified form fromSzeto et al.Journal of Visual Languagesand Computing 14, 341-362

(2003)

. .

..

Fetch one cluster from the stack.Split this cluster into two clustersusing the fuzzy/soft Kmeans algorithm

Merge the two clusters and remove theresulting cluster from the stack.

Any remaining clusters on the stack ?

Place bothclusters onthe stack

End

Page 98: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Overview of clustering methods

Hierarchical Non-hierarchical

Top-down or divisive BTSVQ K-means

Fuzzy/soft K-means

Bottom-up or agglomerative UPGMA

Page 99: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Pitfalls of clustering

Page 100: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Pitfalls of clustering

• Clustering algorithms always produce clusters even for

uniformaly distrubuted data.

Page 101: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Pitfalls of clustering

• Clustering algorithms always produce clusters even for

uniformaly distrubuted data.

• Difficult to test the null hypothesis of no clusters

(current research).

Page 102: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Pitfalls of clustering

• Clustering algorithms always produce clusters even for

uniformaly distrubuted data.

• Difficult to test the null hypothesis of no clusters

(current research).

• Difficult to estimate the true number of clusters (current

research).

Page 103: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Pitfalls of clustering

• Clustering algorithms always produce clusters even for

uniformaly distrubuted data.

• Difficult to test the null hypothesis of no clusters

(current research).

• Difficult to estimate the true number of clusters (current

research).

• Risk of artifacts.

• Use clustering only for hypothesis generation.

Page 104: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Pitfalls of clustering

• Clustering algorithms always produce clusters even for

uniformaly distrubuted data.

• Difficult to test the null hypothesis of no clusters

(current research).

• Difficult to estimate the true number of clusters (current

research).

• Risk of artifacts.

• Use clustering only for hypothesis generation.

• Independent experimental verification required.

Page 105: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Deciding on the number of clusters:Gap statistic

Tibshirani, Walther, Hastie (2001), J. Royal Statistical Society B

Idea:

• Compute EK for randomized data.

• Compare this with EK from real data.

Page 106: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Randomize data

seneG

Experiments

Page 107: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Randomize data

Shuffle. .

seneG

Experiments

Page 108: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Randomize data

. .

seneG

Experiments

Page 109: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Randomize data

seneG

Experiments

Page 110: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Randomize data

Shuffle

seneG

Experiments

Page 111: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Randomize data

. .

seneG

Experiments

Page 112: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

Randomize data

. .

seneG

Experiments

And so on . . .

Page 113: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

5 81 2 3 4 6 7Number of clusters K

random data

true data

.

. .

.

E K

Page 114: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

764321 85 5

.

..

.

Gap= | |(randomized data)(true dat) -

true data

random data

Number of clusters K Number of clusters K764321 8

KE KEKE

Page 115: Objective of clustering - The University of Edinburgh · Fuzzy and soft K-means clustering Objective: Soft or fuzzy partition of the data into a prede ned number of clusters, K. {

85764321 85 1

.

..

.

Adapted from Hastie, Tibshiranie, Friedman: The Elements of Statistical Learning, Springer 2001

over random dataMaximal improvement

Gap= | |(randomized data)(true dat) -

true data

random data

Number of clusters K Number of clusters K76432

KE KEKE