computational biology clustering parts taken from introduction to data mining by tan, steinbach,...

58
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Upload: brett-griffith

Post on 29-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Computational Biology

Clustering

Parts taken from

Introduction to Data Mining

by

Tan, Steinbach, Kumar

Lecture Slides Week 9

Page 2: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Clustering

• A clustering is a set of clusters

• Important distinction between hierarchical and partitional sets of clusters

• Partitional Clustering– A division data objects into non-overlapping subsets (clusters) such

that each data object is in exactly one subset

• Hierarchical clustering– A set of nested clusters organized as a hierarchical tree

Page 3: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Partitional Clustering

Original Points A Partitional Clustering

Page 4: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Hierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2

p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Traditional Dendrogram

Page 5: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Notion of a Cluster can be Ambiguous

How many clusters?

Four Clusters Two Clusters

Six Clusters

Page 6: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

What is Cluster Analysis?

• Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Page 7: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

Page 8: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

Page 9: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

10 Clusters Example

Starting with some pairs of clusters having three initial centroids, while other have only one.

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Page 10: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

10 Clusters Example

Starting with some pairs of clusters having three initial centroids, while other have only one.

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Page 11: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

Page 12: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

Page 13: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

Page 14: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Hierarchical Clustering

• Produces a set of nested clusters organized as a hierarchical tree

• Can be visualized as a dendrogram– A tree like diagram that records the sequences of merges or

splits

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1

2

3

4

5

6

1

23 4

5

Page 15: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Starting Situation

• Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 16: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Intermediate Situation

• After some merging steps, we have some clusters

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 17: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Intermediate Situation

• We want to merge the two closest clusters (C2 and C5) and update the proximity matrix.

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 18: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

After Merging

• The question is “How do we update the proximity matrix?”

C1

C4

C2 U C5

C3? ? ? ?

?

?

?

C2 U C5C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 19: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Similarity?

• MIN• MAX• Group Average• Distance Between Centroids• Other methods driven by an objective function

– Ward’s Method uses squared error

Proximity Matrix

Page 20: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

• MIN• MAX• Group Average• Distance Between Centroids• Other methods driven by an objective function

– Ward’s Method uses squared error

Page 21: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

• MIN• MAX• Group Average• Distance Between Centroids• Other methods driven by an objective function

– Ward’s Method uses squared error

Page 22: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

• MIN• MAX• Group Average• Distance Between Centroids• Other methods driven by an objective function

– Ward’s Method uses squared error

Page 23: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

• MIN• MAX• Group Average• Distance Between Centroids• Other methods driven by an objective function

– Ward’s Method uses squared error

Page 24: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Cluster Similarity: MIN or Single Link

• Similarity of two clusters is based on the two most similar (closest) points in the different clusters

– Determined by one pair of points, i.e., by one link in the proximity graph.

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

Page 25: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Hierarchical Clustering: MIN

Nested Clusters Dendrogram

1

2

3

4

5

6

1

2

3

4

5

3 6 2 5 4 10

0.05

0.1

0.15

0.2

Page 26: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

• Two matrices – Proximity Matrix– “Incidence” Matrix

• One row and one column for each data point

• An entry is 1 if the associated pair of points belong to the same cluster

• An entry is 0 if the associated pair of points belongs to different clusters

• Compute the correlation between the two matrices– Since the matrices are symmetric, only the correlation between

n(n-1) / 2 entries needs to be calculated.

• High correlation indicates that points that belong to the same cluster are close to each other.

• Not a good measure for some density or contiguity based clusters.

Measuring Cluster Validity Via Correlation

Page 27: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Measuring Cluster Validity Via Correlation

• Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Corr = -0.9235 Corr = -0.5810

Page 28: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

• Order the similarity matrix with respect to cluster labels and inspect visually.

Using Similarity Matrix for Cluster Validation

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Po

ints

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 29: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

End Theory I

• 5 min mindmapping• 10 min break

Page 30: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Practice I

Page 31: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Clustering Dataset

• We will use the same datasets as last week

• Have fun clustering with Orange– Try K-means clustering, hierachial clustering, MDS– Analyze differences in results

Page 32: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

K?

• What is the k in k-means again?

• How many clusters are in my dataset?

• Solution: iterate over a reasonable number of ks

• Do that and try to find out how many clusters there are in your data

Page 33: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

End Practice I

• 15 min break

Page 34: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Theory II

Page 35: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Microarrays

• Gene Expression:– We see difference between cels because of differential

gene expression,– Gene is expressed by transcribing DNA intosingle-stranded

mRNA,– mRNA is later translated into a protein,– Microarrays measure the level of mRNA expression

Page 36: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Microarrays

• Gene Expression:– mRNA expression represents dynamic aspects of cell,– mRNA is isolated and labeled using a fluorescent material,– mRNA is hybridized to the target; level of hybridization

corresponds to light emission which is measured with a laser

Page 37: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Microarrays

Page 38: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Microarrays

Page 39: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Microarrays

Page 40: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Processing Microarray Data

• Differentiating gene expression:

– R = G not differentiated

– R > G up-regulated

– R < G down regulated

Page 41: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Processing Microarray Data

• Problems:– Extract data from microarrays,– Analyze the meaning of the multiple arrays.

Page 42: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Processing Microarray Data

Page 43: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Processing Microarray Data

• Problems:– Extract data from microarrays,– Analyze the meaning of the multiple arrays.

Page 44: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Processing Microarray Data

• Microarray data:

Page 45: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Processing Microarray Data

• Clustering:– Find classes in the data,– Identify new classes,– Identify gene correlations,– Methods:

• K-means clustering,

• Hierarchical clustering,

• Self Organizing Maps (SOM)

Page 46: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Processing Microarray Data

• Distance Measures:– Euclidean Distance:

– Manhattan Distance:

Page 47: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Processing Microarray Data

• K-means Clustering:– Break the data into K clusters,– Start with random partitioning,– Improve it by iterating.

Page 48: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Processing Microarray Data

• Agglomerative Hierarchical Clustering:

Page 49: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Processing Microarray Data

• Self-Organizing Feature Maps:– by Teuvo Kohonen, – a data visualization technique which helps to understand high

dimensional data by reducing the dimensions of data to a map.

Page 50: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Processing Microarray Data

• Self-Organizing Feature Maps:– humans simply cannot visualize high dimensional data as is,– SOM help us understand this high dimensional data.

Page 51: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Processing Microarray Data

• Self-Organizing Feature Maps:– Based on competitive learning,– SOM helps us by producing a map of usually 1 or 2 dimensions,– SOM plot the similarities of the data by grouping– similar data items together.

Page 52: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Processing Microarray Data

• Self-Organizing Feature Maps:

Page 53: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Processing Microarray Data

• Self-Organizing Feature Maps:

Input vector, synaptic weight vectorx = [x1, x2, …, xm]T

wj=[wj1, wj2, …, wjm]T, j = 1, 2,3, l

Best matching, winning neuroni(x) = arg min ||x-wj||, j =1,2,3,..,l

Weights wi are updated.

Page 54: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Figure 2. Output map containing the distributions of genes from the alpha30 database.

Chavez-Alvarez R, Chavoya A, Mendez-Vazquez A (2014) Discovery of Possible Gene Relationships through the Application of Self-Organizing Maps to DNA Microarray Databases. PLoS ONE 9(4): e93233. doi:10.1371/journal.pone.0093233http://www.plosone.org/article/info:doi/10.1371/journal.pone.0093233

Page 55: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Figure 5. Color-coded output maps representing the final weight of neurons from the samples of the alpha30 database.

Chavez-Alvarez R, Chavoya A, Mendez-Vazquez A (2014) Discovery of Possible Gene Relationships through the Application of Self-Organizing Maps to DNA Microarray Databases. PLoS ONE 9(4): e93233. doi:10.1371/journal.pone.0093233http://www.plosone.org/article/info:doi/10.1371/journal.pone.0093233

Page 56: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

End Theory II

• 5 min mindmapping• 10 min break

Page 57: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Practice II

Page 58: Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Microarray

• http://www.biolab.si/supp/bi-visprog/dicty/dictyExample.htm

• Use the data as described on the page