cz5211 topics in computational biology lecture 4: clustering analysis for microarray data ii prof....

CZ5211 Topics in Computational BiologyCZ5211 Topics in Computational Biology

Lecture 4: Clustering Analysis for Microarray Data IILecture 4: Clustering Analysis for Microarray Data II

Prof. Chen Yu ZongProf. Chen Yu Zong

Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@cz3.nus.edu.sg

http://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, NUSRoom 07-24, level 7, SOC1, NUS

mailto:[email protected]

http://xin.cz3.nus.edu.sg/

22

K-means clusteringK-means clustering

This method differs from the hierarchical clustering

in several ways. In particular:

• There is no hierarchy, the data are partitioned. You will be presented only with the final cluster membership for each case.

• There is no role for the dendrogram in k-means clustering. • You must supply the number of clusters (k) into which the

data are to be grouped.

33

Example of K-means algorithm:Example of K-means algorithm:Lloyd’s algorithmLloyd’s algorithm

• Has been shown to converge to a locally optimal solution

• But can converge to a solution arbitrarily bad compared to the optimal solution K=3

Data Points

Optimal Centers

Heuristic Centers

44


• Given a set of n data points in d-dimensional space and an integer k

• We want to find the set of k points in d-dimensional space that minimizes the mean squared distance from each data point to its nearest center

• No exact polynomial-time algorithms are known for this problem

55

K-means clusteringK-means clustering• Usually uses Euclidean distance

• Gives spherical clusters

• How many clusters, K?

• Solution is not unique, clustering can depend on your starting point

66


Step 1: Transform n (genes) * m (experiments) matrix inton(genes) * n(genes) distance matrix

Step 2: Cluster genes based on a k-means clustering algorithm

Exp 1 Exp 2 Exp 3 Exp 4Gene AGene BGene C

Gene A Gene B Gene CGene A 0Gene B ? 0Gene C ? ? 0

77


To transform the n*m matrix into n*n matrix, use a similarity (distance) metric.(Tavazoie et al. Nature Genetics. 1999 Jul;22:281-5)

Euclidean distance

Where any two genes X and Y observed over a series of M conditions.

88


Gene 1 Gene 2 Gene 3 Gene 4Gene 1 0Gene 2 1 0Gene 3 1 0Gene 4 1 1 0

1 2

3 4

1 2

1

2

99

K-means clustering algorithm K-means clustering algorithm

Step 1: Suppose distance of genes expression patterns are positioned on a two dimensional space based a distance matrix

Step 2: The first cluster center(red) is chosen randomly and then subsequent centers are by finding the data point farthest from the centers already chosen. In this example, k=3.

1010

K-means clustering algorithmK-means clustering algorithm

Step 3: Each point is assigned to the cluster associated with the closest representative center

Step 4: Minimizes the within-cluster sum of squared distances from the cluster mean by moving the centroid (star points), that is computing a new cluster representative

1111

K-means clustering algorithmK-means clustering algorithm

Run step 3, 4 and 5 until no further changes occur – Self-consistency reached

Step 5: Repeat step 3 and 4 with a new representative

1212

Basic Algorithm for K-MeansBasic Algorithm for K-Means

1. Choose K initial cluster centers at random2. Partition objects into k clusters by assigning objects to

the closest centroid3. Calculate the centroid of each of the k clusters.4. Assign each object to cluster i, by first calculating the

distance from each object to all cluster centers, choose closest.

5. If object changes clusters, recalculate the centroids6. Repeat until objects not moving anymore.

1313

Euclidean Distance and Centroid PointEuclidean Distance and Centroid Point

n

iiiE yxyxd

1

2)(),(

Simple and Fast! Remember this when we consider the complexity!

),...,2

,1

(),...,,( 11121 k

xnth

k

ndx

k

stxxxxCP

k

ii

k

ii

k

ii

k

The above equation is used to find the n dimensional centroid point amid k n dimensional points:

1414

K-means 2nd example with k=2K-means 2nd example with k=2

1. We Pick k=2 centers at random

2. We cluster our data around these center points

1515


3. We recalculate centers based on our current clusters

1616


4. We re-cluster our data around our new center points

1717


5. We repeat the last two steps until no more data points are moved into a different cluster

1818

K-means 3K-means 3rdrd example: Initialization example: Initialization

x

xx

1919

K-means 3K-means 3rdrd example: Iteration 1 example: Iteration 1

x

xx

2020


xxx

2121


x

x

x

2222


x

x

x

2323

K-means clustering problemsK-means clustering problems

• Random initialization means that you may get different clusters each time

• Data points are assigned to only one cluster (hard assignment)

• Implicit assumptions about the “shapes” of clusters

• You have to pick the number of clusters…

2424

K-means problem: K-means problem: alwaysalways finds k clusters: finds k clusters:

x x

x

2525

K-means problem: distance may not always K-means problem: distance may not always accurately reflect relationshipaccurately reflect relationship

-Each data point is assigned to the correct cluster

-But data points that seem to be far away from each other in heuristic are in reality very closely related to each other

2626

Tips on improving K-means clustering:Tips on improving K-means clustering: to split/combine clusters to split/combine clusters

• Variations of the ISODATA algorithm

– Split clusters that are too large by increasing k by one– Merge clusters that are too small, by merging clusters

that are very close to one another

• What is too close and too far?

2727

Tips on improving K-means clustering:Tips on improving K-means clustering:Use of K-mediods instead of centroidsUse of K-mediods instead of centroids

• Kmeans uses centroids, average of samples in a cluster

• Mediod: “representative object” within a cluster

• Less Sensitive to outliers

2828

Tips on improving K-means clustering: Tips on improving K-means clustering: How to choose k?How to choose k?

• Use another clustering method

• Run algorithm on data with several different values of k, and look at the stability of the results

• Use advance knowledge about the characteristics of your test

2929

Tips on improving K-means clustering: Tips on improving K-means clustering: Choosing K by using SilhouettesChoosing K by using Silhouettes

• Silhouette of a gene, i, is:

• ai: average distance of sample, i, to other samples in the same cluster

• bi: average distance of sample, i, to genes in the nearest neighbor cluster

• maximal average Silhouette width can be used to select the number of clusters, s(i) close to one are well-classified

max ,i i

i i

b as i

a b

3030

Tips on improving K-means clustering: Tips on improving K-means clustering: Choosing K by using SilhouettesChoosing K by using Silhouettes

k=2 k=3

3131

Tips on improving K-means clustering: Tips on improving K-means clustering: Choosing K by using Choosing K by using WADPWADP

wweighted eighted aaverage verage ddiscrepancy iscrepancy ppairsairs

• Add noise (perturbations to original data)• Calculate the number of paired samples that cluster

together in the original cluster that didn’t get perturbed• Repeat for every cutoff level in HC or each k in k-means• Estimate the proportion of pairs that changes for each k• Use different levels of noise (heuristic)• Look for largest k before WADP gets large

3232

Tips on improving K-means clustering: Tips on improving K-means clustering: Choosing K by using Cluster Quality MeasuresChoosing K by using Cluster Quality Measures

• By introducing a measure of cluster quality Q, different values of k can be evaluated until an optimal value of Q is reached

• But, since clustering is an unsupervised learning method, one can’t really expect to find a “correct” measure Q…

• So, once again there are different choices of Q and our decision will depend on what dissimilarity measure are used and what types of clusters we want

3333


• Jagota suggested a measure that emphasizes cluster tightness or homogeneity:

• |Ci | is the number of data points in cluster i• Q will be small if (on average) the data points in each

cluster are close

iC

i

k

i i

dC

Qx

x ),(||

11

3434


k

Q

This is a plot of the Q measure as given in Jagota for k-means clustering on the data shown earlierHow many clusters do you think there actually are?

3535


• The Q measure given in Jagota takes into account homogeneity within clusters, but not separation between clusters

• Other measures try to combine these two characteristics (i.e., the Davies-Bouldin measure)

• An alternate approach is to look at cluster stability:– Add random noise to the data many times and count

how many pairs of data points no longer cluster together

– How much noise to add? Should reflect estimated variance in the data

3636

What makes a clustering good?What makes a clustering good?

• Clustering results can be different for different methods and distance metrics

• Except in the simplest of cases, result is sensitive to noise and outliers in the data

• Like the case of differential genes, looking for – Homogeneity: similarity within a cluster– Separation: differences between clusters

3737

What makes a clustering good?What makes a clustering good?Hypothesis Testing ApproachHypothesis Testing Approach

• Null hypothesis is that data has NO structure

• Generate a reference data population under the random hypothesis, data models a random structure and compare it to the actual data

• Estimate a statistic that indicates data structure

3838

Cluster QualityCluster Quality

• Since any data can be clustered, how do we know our clusters are meaningful?– The size (diameter) of the cluster vs. The inter-cluster

distance– Distance between the members of a cluster and the

cluster’s center– Diameter of the smallest sphere

3939


size=5

size=5distance=20

distance=5

Quality of cluster assessed by ratio of distance to nearest cluster and cluster diameter

4040


Quality can be assessed simply by looking at the diameter of a cluster

A cluster can be formed even when there is no similarity between clustered patterns. This occurs because the algorithm forces k clusters to be created.

4141

Characteristics of k-means clusteringCharacteristics of k-means clustering

• The random selection of initial center points creates the following properties

– Non-Determinism

– May produce clusters without patterns• One solution is to choose the centers randomly

from existing patterns

4242

K-means clustering algorithm complexityK-means clustering algorithm complexity

• Linear relationship with the number of data points, N

• CPU time required is proportional to cN– c does not depend on N, but rather the number of

clusters, k

• Low computational complexity

• High speed

cz5211 topics in computational biology lecture 4: clustering analysis for microarray data ii prof....

Documents