clustering microarray data 09/26/07. sub-classes of lung cancer types have signature genes...

53
Clustering microarray data 09/26/07

Post on 22-Dec-2015

221 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Clustering microarray data

09/26/07

Page 2: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Sub-classes of lung cancer types have signature genes

(Bhattacharjee 2001)

Page 3: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

David J. Lockhart & Elizabeth A. Winzeler, NATURE | VOL 405 | 15 JUNE 2000, p827

Promoter analysis of commonly regulated genes

Page 4: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Discovery of new cancer subtype

These classes are unknown at the time of study.

Page 5: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Overview

• Clustering is an unsupervised learning clustering is used to build groups of genes with related expression patterns.

• The classes are not known in advance. • Aim is to discover new patterns from microarray

data.• In contrast, supervised learning refers to the

learning process where classes are known. The aim is to define classification rules to separate the classes. Supervised learning will be discussed in the next lecture.

Page 6: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Dissimilar function

To identify clusters, we first need to define what “close” means. There are many choices of distances

• Euclidian distance:

• 1 – Pearson correlation:

• Manhattan distance:

• …

p

iiixy yxd

1

||

p

iiixy yxd

1

2||||

yyxxyx ,,/,1

Page 7: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)
Page 8: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Where is the “truth”?

“ In the context of unsupervised learning, there is no such direct measure of success. It is difficult to ascertain the validity of inference drawn from the output of most unsupervised learning algorithms. One must often resort to heuristic arguments not only for motivating the algorithm, but also for judgments as to the quality of results. This uncomfortable situation has led to heavy proliferation of proposed methods, since effectiveness is a matter of opinion and cannot be verified directly.”

Hastie et al. 2001; ESL

Page 9: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Clustering Methods

• Partitioning methods– Seek to optimally divide objects into a fixed

number of clusters.

• Hierarchical methods– Produce a nested sequence of clusters

(Speed, Chapter 4)

Page 10: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Methods

• k-means

• Hierarchical clustering

• Self-organizing maps (SOM)

Page 11: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

k-means

• Divide objects into k clusters. • Goal is to minimize total intra-cluster

variance

• Global minimum is difficult to obtain.

K

k kiCkiK xCW

1 )(

2||ˆ||)(

Page 12: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Algorithm for k-means clustering

• Step 1: Initialization: randomly select k centroids.

• Step 2: For each object, find its closest centroid, assign the object to the corresponding cluster.

• Step 3: For each cluster, update its centroid to the mean position of all objects in that cluster.

• Repeat Steps 2 and 3 until convergence.

Page 13: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Shows the initial randomized centers and a number of points.

Page 14: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Centers have been associated with the points and have been moved to the respective centroids.

Page 15: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Now, the association is shown in more detail, once the centroids have been moved.

Page 16: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Again, the centers are moved to the centroids of the corresponding associated points.

Page 17: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Properties of k-means

• Achieves local minimum of

• Very fast.

K

k kiCkiK xCW

1 )(

2||ˆ||)(

Page 18: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Practical issues with k-means

• k must be known in advance

• Results are dependent on initial assignment of

centroids.

Page 19: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Milligan & Cooper(1985) compared 30 published rules.

1. Calinski & Harabasz (1974)

2. Hartigan (1975)

, Stop when H(k)<10

.

)/()(

)1/()()(max

knkW

kkBkCH

W(k)= total sum of squares within clustersB(k)= sum of squares between cluster means

How to choose k?

Page 20: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

How to choose k (continued)?lo

g W

K

Random

Observed

Gap

k

(Tibshriani 2001) Estimate log Wk for randomly data (uniformly distributed in a rectangle)

Choose k so that Gap is largest.

Page 21: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

How to select initial centroids

• Repeat the procedure many times with

randomly chosen initial centroids.

• Alternatively, initialize centroids “smartly”,

e.g. by hierarchical clustering

Page 22: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

with-in sum of Sq.X:965.32O:305.09

K-means requires good initial values. HierarchicalClustering could be used but sometimes performs poorly.

Page 23: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Hierarchical clusteringHierarchical clustering builds a hierarchy of clusters, represented by a tree (called a dendrogram). Close clusters are joined together. Height of a branch represents the dissimilarity between the two clusters joined by it.

Page 24: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

How to construct a dendrogram

• Bottom-up approach– Initialization: each cluster contains a single object

– Iteration: merge the “closest” clusters.

– Stop: when all objects are included in a single cluster

• Top-down approach– Starting from a single cluster containing all objects, iteratively

partition into smaller clusters.

• Truncate dendrogram at a similarity threshold level, e.g.,

correlation > 0.6; or requiring a cluster containing at least

a minimum number of objects.

Page 25: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Hierarchical Clustering

12

6

34

5

Dendrogram

1 6

3 4

5

2

Page 26: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Dendrogram can be reordered

1 6

3 4

5

2

1 6

4 3

5

2

1 63 4

5

2

Page 27: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Ordered dendrograms

2 n-1 linear orderings of n elements (n= # genes or conditions)

Maximizing adjacent similarity is impractical. So order by:•Average expression level, •Time of max induction, or•Chromosome positioning

Eisen98

Page 28: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Properties of Hierarchical Clustering

• Top-down approach is more favorable when only a few clusters are desired.

• Single linkage tends to produce long chains of clusters.

• Complete linkage tends to produce compact clusters.

Page 29: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)
Page 30: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Partitioning clustering vs hierarchical clustering

1 6

3 4

5

2

Dendrogram

12

6

34

5

k = 4

Page 31: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Partitioning clustering vs hierarchical clustering

1 6

3 4

5

2

Dendrogram

12

6

34

5

k = 3

Page 32: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

1 6

3 4

5

2

Dendrogram

12

6

34

5

k = 2

Partitioning clustering vs hierarchical clustering

Page 33: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

•Impose partial structure on the clusters (in contrast to the rigid structure of hierarchical clustering, the strong prior hypotheses used in Bayesian clus-tering, and the nonstructure of k-means clustering)

•easy visualization and interpretation.

Self-organizing Map

Page 34: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

SOM Algorithm

• Initialize prototypes mj on a lattice of p X q nodes. Each prototype is a weight vector whose dimension is the same as input data.

• Iteration: for each observation xi, find the closest prototype mj, and for all neighbors of mk of mj, move by

• During iterations, reduce learning rate and neighborhood size r gradually.

• May take many iterations before convergence.

)( kikk mxmm

Page 35: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)
Page 36: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

(Hastie 2001)

Page 37: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

(Hastie 2001)

Page 38: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

(Hastie 2001)

Page 39: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

SOM clustering of periodic genes

Page 40: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Applications to microarray data

Page 41: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

•With only a few nodes, one tends not to see distinct patterns and there is large within-cluster scatter. As nodes are added, distinctive and tight clusters emerge.

• SOM is an “incremental learning” algorithm involving cases by case presentation rather than batch presentation.

•As with all exploratory data analysis tools, the use of SOMs involves inspection of the data to extract insights.

Page 42: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Other Clustering Methods

• Gene Shaving

• MDS

• Affinity Propagation

• Spectral Clustering

• Two-way clustering

• …

Page 43: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

“Algorithms for unsupervised classification or cluster analysis abound. Unfortunately however, algorithm development seems to be a preferred activity to algorithm evaluation among methodologists.

……

No consensus or clear guidelines exist to guide these decisions. Cluster analysis always produces clustering, but whether a pattern observed in the sample data characterizes a pattern present in the population remains an open question. Resampling-based methods can address this last point, but results indicate that most clusterings in microarray data sets are unlikely to reflect reproducible patterns or patterns in the overall population.”

-Allison et al. (2006)

Page 44: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Stability of a cluster

Motivation: Real clusters should be reproducible under perturbation: adding noise, omission of data, etc.

Procedure: • Perturb observed data by adding noise.• Apply clustering procedure to cluster the

perturbed data.• Repeat the above procedures, generate a

sample of clusters.• Global test• Cluster-specific tests: R-index, D-index.

(McShane et al. 2002)

Page 45: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

12

6

34

5

12

6

34

5

Page 46: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Global test

• Null hypothesis: Data come from a multivariate Gaussian distribution.

Procedure:

• Consider a subspace spanned by top principle components.

• Estimate distribution of “nearest neighbor” distances

• Compare observed with simulated data.

Page 47: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

R-index

• If cluster i contains ni objects, then it contains mi = ni*(ni – 1)/2 of pairs.

• Let ci be the number of pairs that fall in the same cluster for the re-clustered perturbed data.

• ri = ci/mi measures the robustness of the cluster i.

• R-index = i ci / i mi measures overall stability of a clustering algorithm.

Page 48: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

D-index

• For each cluster, determine the closest cluster for the perturbed data

• Calculated the average discrepancy between the clusters for the original and perturbed data: omission vs addition.

• D-index is a summation of all cluster-specific discrepancy.

Page 49: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Applications

• 16 prostate cancer; 9 benign tumor

• 6500 genes

• Use hierarchical clustering to obtain 2,3, and 4 clusters.

• Questions: are these clusters reliable?

Page 50: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)
Page 51: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)
Page 52: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Issues with calculating R and D indices

• How big is the size of perturbation?

• How to quantify the significance level?

• What about nested consistency?

Page 53: Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Acknowldegment

• Slide sources from – Cheng Li