cz5211 topics in computational biology lecture 4: clustering analysis for microarray data ii prof....

42
CZ5211 Topics in Computational Biology CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Lecture 4: Clustering Analysis for Microarray Data II Microarray Data II Prof. Chen Yu Zong Prof. Chen Yu Zong Tel: 6874-6877 Tel: 6874-6877 Email: Email: [email protected] [email protected] http://xin.cz3.nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS Room 07-24, level 7, SOC1, NUS

Upload: buck-bond

Post on 19-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

3 Example of K-means algorithm: Lloyd ’ s algorithm Has been shown to converge to a locally optimal solution But can converge to a solution arbitrarily bad compared to the optimal solution K=3 Data Points Optimal Centers Heuristic Centers

TRANSCRIPT

Page 1: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

CZ5211 Topics in Computational BiologyCZ5211 Topics in Computational Biology

Lecture 4: Clustering Analysis for Microarray Data IILecture 4: Clustering Analysis for Microarray Data II

Prof. Chen Yu ZongProf. Chen Yu Zong

Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@cz3.nus.edu.sg

http://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, NUSRoom 07-24, level 7, SOC1, NUS

Page 2: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

22

K-means clusteringK-means clustering

This method differs from the hierarchical clustering

in several ways. In particular:

• There is no hierarchy, the data are partitioned. You will be presented only with the final cluster membership for each case.

• There is no role for the dendrogram in k-means clustering. • You must supply the number of clusters (k) into which the

data are to be grouped.

Page 3: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

33

Example of K-means algorithm:Example of K-means algorithm:Lloyd’s algorithmLloyd’s algorithm

• Has been shown to converge to a locally optimal solution

• But can converge to a solution arbitrarily bad compared to the optimal solution K=3

Data Points

Optimal Centers

Heuristic Centers

Page 4: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

44

K-means clusteringK-means clustering

• Given a set of n data points in d-dimensional space and an integer k

• We want to find the set of k points in d-dimensional space that minimizes the mean squared distance from each data point to its nearest center

• No exact polynomial-time algorithms are known for this problem

Page 5: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

55

K-means clusteringK-means clustering• Usually uses Euclidean distance

• Gives spherical clusters

• How many clusters, K?

• Solution is not unique, clustering can depend on your starting point

Page 6: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

66

K-means clusteringK-means clustering

Step 1: Transform n (genes) * m (experiments) matrix inton(genes) * n(genes) distance matrix

Step 2: Cluster genes based on a k-means clustering algorithm

Exp 1 Exp 2 Exp 3 Exp 4Gene AGene BGene C

Gene A Gene B Gene CGene A 0Gene B ? 0Gene C ? ? 0

Page 7: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

77

K-means clusteringK-means clustering

To transform the n*m matrix into n*n matrix, use a similarity (distance) metric.(Tavazoie et al. Nature Genetics. 1999 Jul;22:281-5)

Euclidean distance

Where any two genes X and Y observed over a series of M conditions.

Page 8: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

88

K-means clusteringK-means clustering

Gene 1 Gene 2 Gene 3 Gene 4Gene 1 0Gene 2 1 0Gene 3 1 0Gene 4 1 1 0

1 2

3 4

1 2

1

2

Page 9: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

99

K-means clustering algorithm K-means clustering algorithm

Step 1: Suppose distance of genes expression patterns are positioned on a two dimensional space based a distance matrix

Step 2: The first cluster center(red) is chosen randomly and then subsequent centers are by finding the data point farthest from the centers already chosen. In this example, k=3.

Page 10: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

1010

K-means clustering algorithmK-means clustering algorithm

Step 3: Each point is assigned to the cluster associated with the closest representative center

Step 4: Minimizes the within-cluster sum of squared distances from the cluster mean by moving the centroid (star points), that is computing a new cluster representative

Page 11: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

1111

K-means clustering algorithmK-means clustering algorithm

Run step 3, 4 and 5 until no further changes occur – Self-consistency reached

Step 5: Repeat step 3 and 4 with a new representative

Page 12: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

1212

Basic Algorithm for K-MeansBasic Algorithm for K-Means

1. Choose K initial cluster centers at random2. Partition objects into k clusters by assigning objects to

the closest centroid3. Calculate the centroid of each of the k clusters.4. Assign each object to cluster i, by first calculating the

distance from each object to all cluster centers, choose closest.

5. If object changes clusters, recalculate the centroids6. Repeat until objects not moving anymore.

Page 13: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

1313

Euclidean Distance and Centroid PointEuclidean Distance and Centroid Point

n

iiiE yxyxd

1

2)(),(

Simple and Fast! Remember this when we consider the complexity!

),...,2

,1

(),...,,( 11121 k

xnth

k

ndx

k

stxxxxCP

k

ii

k

ii

k

ii

k

The above equation is used to find the n dimensional centroid point amid k n dimensional points:

Page 14: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

1414

K-means 2nd example with k=2K-means 2nd example with k=2

1. We Pick k=2 centers at random

2. We cluster our data around these center points

Page 15: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

1515

K-means 2nd example with k=2K-means 2nd example with k=2

3. We recalculate centers based on our current clusters

Page 16: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

1616

K-means 2nd example with k=2K-means 2nd example with k=2

4. We re-cluster our data around our new center points

Page 17: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

1717

K-means 2nd example with k=2K-means 2nd example with k=2

5. We repeat the last two steps until no more data points are moved into a different cluster

Page 18: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

1818

K-means 3K-means 3rdrd example: Initialization example: Initialization

x

xx

Page 19: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

1919

K-means 3K-means 3rdrd example: Iteration 1 example: Iteration 1

x

xx

Page 20: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

2020

K-means 3K-means 3rdrd example: Iteration 2 example: Iteration 2

xxx

Page 21: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

2121

K-means 3K-means 3rdrd example: Iteration 3 example: Iteration 3

x

x

x

Page 22: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

2222

K-means 3K-means 3rdrd example: Iteration 4 example: Iteration 4

x

x

x

Page 23: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

2323

K-means clustering problemsK-means clustering problems

• Random initialization means that you may get different clusters each time

• Data points are assigned to only one cluster (hard assignment)

• Implicit assumptions about the “shapes” of clusters

• You have to pick the number of clusters…

Page 24: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

2424

K-means problem: K-means problem: alwaysalways finds k clusters: finds k clusters:

x x

x

Page 25: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

2525

K-means problem: distance may not always K-means problem: distance may not always accurately reflect relationshipaccurately reflect relationship

-Each data point is assigned to the correct cluster

-But data points that seem to be far away from each other in heuristic are in reality very closely related to each other

Page 26: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

2626

Tips on improving K-means clustering:Tips on improving K-means clustering: to split/combine clusters to split/combine clusters

• Variations of the ISODATA algorithm

– Split clusters that are too large by increasing k by one– Merge clusters that are too small, by merging clusters

that are very close to one another

• What is too close and too far?

Page 27: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

2727

Tips on improving K-means clustering:Tips on improving K-means clustering:Use of K-mediods instead of centroidsUse of K-mediods instead of centroids

• Kmeans uses centroids, average of samples in a cluster

• Mediod: “representative object” within a cluster

• Less Sensitive to outliers

Page 28: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

2828

Tips on improving K-means clustering: Tips on improving K-means clustering: How to choose k?How to choose k?

• Use another clustering method

• Run algorithm on data with several different values of k, and look at the stability of the results

• Use advance knowledge about the characteristics of your test

Page 29: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

2929

Tips on improving K-means clustering: Tips on improving K-means clustering: Choosing K by using SilhouettesChoosing K by using Silhouettes

• Silhouette of a gene, i, is:

• ai: average distance of sample, i, to other samples in the same cluster

• bi: average distance of sample, i, to genes in the nearest neighbor cluster

• maximal average Silhouette width can be used to select the number of clusters, s(i) close to one are well-classified

max ,i i

i i

b as i

a b

Page 30: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

3030

Tips on improving K-means clustering: Tips on improving K-means clustering: Choosing K by using SilhouettesChoosing K by using Silhouettes

k=2 k=3

Page 31: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

3131

Tips on improving K-means clustering: Tips on improving K-means clustering: Choosing K by using Choosing K by using WADPWADP

wweighted eighted aaverage verage ddiscrepancy iscrepancy ppairsairs

• Add noise (perturbations to original data)• Calculate the number of paired samples that cluster

together in the original cluster that didn’t get perturbed• Repeat for every cutoff level in HC or each k in k-means• Estimate the proportion of pairs that changes for each k• Use different levels of noise (heuristic)• Look for largest k before WADP gets large

Page 32: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

3232

Tips on improving K-means clustering: Tips on improving K-means clustering: Choosing K by using Cluster Quality MeasuresChoosing K by using Cluster Quality Measures

• By introducing a measure of cluster quality Q, different values of k can be evaluated until an optimal value of Q is reached

• But, since clustering is an unsupervised learning method, one can’t really expect to find a “correct” measure Q…

• So, once again there are different choices of Q and our decision will depend on what dissimilarity measure are used and what types of clusters we want

Page 33: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

3333

Tips on improving K-means clustering: Tips on improving K-means clustering: Choosing K by using Cluster Quality MeasuresChoosing K by using Cluster Quality Measures

• Jagota suggested a measure that emphasizes cluster tightness or homogeneity:

• |Ci | is the number of data points in cluster i• Q will be small if (on average) the data points in each

cluster are close

iC

i

k

i i

dC

Qx

x ),(||

11

Page 34: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

3434

Tips on improving K-means clustering: Tips on improving K-means clustering: Choosing K by using Cluster Quality MeasuresChoosing K by using Cluster Quality Measures

k

Q

This is a plot of the Q measure as given in Jagota for k-means clustering on the data shown earlierHow many clusters do you think there actually are?

Page 35: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

3535

Tips on improving K-means clustering: Tips on improving K-means clustering: Choosing K by using Cluster Quality MeasuresChoosing K by using Cluster Quality Measures

• The Q measure given in Jagota takes into account homogeneity within clusters, but not separation between clusters

• Other measures try to combine these two characteristics (i.e., the Davies-Bouldin measure)

• An alternate approach is to look at cluster stability:– Add random noise to the data many times and count

how many pairs of data points no longer cluster together

– How much noise to add? Should reflect estimated variance in the data

Page 36: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

3636

What makes a clustering good?What makes a clustering good?

• Clustering results can be different for different methods and distance metrics

• Except in the simplest of cases, result is sensitive to noise and outliers in the data

• Like the case of differential genes, looking for – Homogeneity: similarity within a cluster– Separation: differences between clusters

Page 37: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

3737

What makes a clustering good?What makes a clustering good?Hypothesis Testing ApproachHypothesis Testing Approach

• Null hypothesis is that data has NO structure

• Generate a reference data population under the random hypothesis, data models a random structure and compare it to the actual data

• Estimate a statistic that indicates data structure

Page 38: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

3838

Cluster QualityCluster Quality

• Since any data can be clustered, how do we know our clusters are meaningful?– The size (diameter) of the cluster vs. The inter-cluster

distance– Distance between the members of a cluster and the

cluster’s center– Diameter of the smallest sphere

Page 39: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

3939

Cluster QualityCluster Quality

size=5

size=5distance=20

distance=5

Quality of cluster assessed by ratio of distance to nearest cluster and cluster diameter

Page 40: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

4040

Cluster QualityCluster Quality

Quality can be assessed simply by looking at the diameter of a cluster

A cluster can be formed even when there is no similarity between clustered patterns. This occurs because the algorithm forces k clusters to be created.

Page 41: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

4141

Characteristics of k-means clusteringCharacteristics of k-means clustering

• The random selection of initial center points creates the following properties

– Non-Determinism

– May produce clusters without patterns• One solution is to choose the centers randomly

from existing patterns

Page 42: CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

4242

K-means clustering algorithm complexityK-means clustering algorithm complexity

• Linear relationship with the number of data points, N

• CPU time required is proportional to cN– c does not depend on N, but rather the number of

clusters, k

• Low computational complexity

• High speed