lecture 10. clustering algorithms the chinese university of hong kong csci3220 algorithms for...

63
Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

Upload: elijah-booth

Post on 17-Jan-2016

224 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

Lecture 10. Clustering Algorithms

The Chinese University of Hong KongCSCI3220 Algorithms for Bioinformatics

Page 2: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 2

Lecture outline1. Numeric datasets and clustering2. Some clustering algorithms– Hierarchical clustering• Dendrograms

– K-means– Subspace and bi-clustering algorithms

Last update: 21-Nov-2015

Page 3: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

NUMERIC DATASETS AND CLUSTERING

Part 1

Page 4: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 4

Numeric datasets in bioinformatics• So far we have mainly studied problems

related to biological sequences• Sequences represent the static states of an

organism– Program and data stored in hard disk

• Numeric measurements represent the dynamic states– Program and data loaded into memory at run time

Last update: 21-Nov-2015

Page 5: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 5

Numeric measurements• Activity level of a gene (expression level)

– mRNA level (number of copies in a cell)– Protein level

• Occupancy of a transcription factor at a binding site• Fraction of C’s in CpG dinucleotides being

methylated• Frequency of a residue of a histone protein to be

acetylated• Fraction of A’s on a mRNA being edited to I’s

(inosines)• Higher level measurements: heart beat, blood

pressure and cholesterol level of patients• ...

Last update: 21-Nov-2015

Adenosin (A)

Inosine (I)

Image source: Wikipedia

Page 6: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 6

Gene expression• Protein abundance is the best way to measure

activity of a protein coding gene– However, not much data is available due to difficult

experiments• mRNA levels are not ideal indicators of gene activity

– mRNA level and protein level are not very correlated due to mRNA degradation, translational efficiency, translation and post-translational modifications, and so on

– However, it is very easy to measure mRNA levels– High-throughput experiments to measure the mRNA

levels of many genes at a time: microarrays and RNA (cDNA) sequencing

Last update: 21-Nov-2015

Page 7: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 7

Microarrays• The basic idea is “hybridization”– For each gene, we design short sequences that are

unique to the gene• Usually 25-75 nucleotides

– When RNA is converted back to DNA, if it is complementary to a probe, it will bind to the probe – “hybridization”• Ideally only for perfect match, but sometimes

hybridization also happens with some mismatches

– Note: We need to know the DNA sequences of the genes

Last update: 21-Nov-2015

Page 8: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 8

Hybridization

Last update: 21-Nov-2015

Image source: Wikipedia

Page 9: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 9

The arrays

Last update: 21-Nov-2015

Image sources: http://www4.carleton.ca/jmc/catalyst/2006s/images/dk-PersMed3.jpg, http://bioweb.wku.edu/courses/biol566/Images/stemAffyChip.jpg

Page 10: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 10

Processing workflows

Last update: 21-Nov-2015

Image source: http://www.stat.berkeley.edu/users/terry/Classes/s246.2004/Week9/2004L17Stat246.pdf

Page 11: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 11

RNA sequencing• Microarrays data are quite noisy because of cross-

hybridization, narrow signal range, analog measurements, etc.

• RNA sequencing is a new technology that gives better quality– Convert RNAs back to cDNAs, sequence them, and identify

which genes they correspond to– Better signal-to-noise ratio than microarrays– “Digital”: expression level represented by read counts– No need to have prior knowledge about the sequences– If a sequence is not unique to a gene, cannot determine which

gene it comes from• Also a problem for microarrays

Last update: 21-Nov-2015

Page 12: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 12

RNA sequencing

Last update: 21-Nov-2015

Image credit: Wang et al., Nature Review Genetics 10(1):57-63, (2009)

Page 13: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 13

Processing RNA-seq data• Many steps and we will not go into the details– Quality check– Read trimming and filtering– Read mapping (BWT, suffix array, etc.)– Data normalization– ...

Last update: 21-Nov-2015

Page 14: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 14

Gene expression data• Final form of data from microarray or RNA-seq:– A matrix of real numbers– Each row corresponds to a gene– Each column corresponds to a sample/experiment:

• A particular condition• A cell type (e.g., cancer)

• Questions:– Are there genes that show similar changes to their

expression levels across experiments?• The genes may have related functions

– Are there samples with similar set of genes expressed?• The samples may be of the same type

Last update: 21-Nov-2015

Page 15: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 15

Clustering of gene expression data• Clustering: Grouping of related objects into clusters

– An object could be a gene or a sample– Usually clustering is done on both. When genes are the

objects, each sample is an attribute. When samples are the objects, each gene is an attribute

• Goals:– Similar objects are in the same cluster– Dissimilar objects are in different clusters

• Could define a scoring function to evaluate how good a set of clusters is

• Most clustering problems are NP hard– We will study heuristic algorithms

Last update: 21-Nov-2015

Page 16: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 16

Heatmap and clustering results• Color: expression level

Last update: 21-Nov-2015

Image credit: Borries and Wang, Computational Statistics & Data Analysis 53(12):3987-3998, (2009); Alizadeh et al., Nature 403(6769):503-511, (2000)

Clustering

Gen

es

Samples

Page 17: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

SOME CLUSTERING ALGORITHMSPart 2

Page 18: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 18

Hierarchical clustering• One of the most commonly used clustering algorithms is

agglomerative hierarchical clustering– Agglomerative: Merging– There are also divisive hierarchical clustering algorithms

• The algorithm:1. Treat each object as a cluster by itself2. Compute the distance between every pair of clusters3. Merge the two closest clusters4. Re-compute distance values between the merged cluster and

each other cluster5. Repeat #3 and #4 until only one cluster is left

Same as UPGMA, but without the phylogenetic context

Last update: 21-Nov-2015

Page 19: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 19

Hierarchical clustering• 2D illustration:– Each point is a gene– The coordinate of a point indicates the expression

value of the gene in two samples

Last update: 21-Nov-2015

Sample 1

Sample 2

AB

C DE

FDendrogram

(similar to a phylogenetic tree)

A B C D E F

Page 20: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 20

Representing a dendrogram• Since a dendrogram is essentially a tree, we can represent

it using any tree format– For example, the Newick format:

• (((A,B),(C,(D,E))),F);

– We could use the Newick format to also specify how the leaves should be ordered in a visualization.• For example, for the Newick string (F,((B,A),((D,E),C))); from

the same merge order of the clusters but with the leaves ordered differently, the corresponding dendrogram is:

Last update: 21-Nov-2015

A B C D E F

B A CD EF

Page 21: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 21

More details• Three questions:– How to compute the distance between two

points?– How to compute the distance between two

clusters?– How to efficiently perform these computations?

Last update: 21-Nov-2015

Page 22: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 22

Distance• Most common: Euclidean distance

– xi1j is the expression level of the i1-th object (say, gene) and the j-th attribute (say, sample) and m is the total number of attributes

– Need to normalize the attributes

• Also common: (1 - Pearson correlation) / 2. Pearson correlation is a similarity measure with value between -1 and 1:

• , where

Last update: 21-Nov-2015

d൫𝑥𝑖1,𝑥𝑖2൯=ඩ ൫𝑥𝑖1𝑗 − 𝑥𝑖2𝑗൯2𝑚

𝑗=1

ρ൫𝑥𝑖1,𝑥𝑖2൯= σ ൫𝑥𝑖1𝑗 − 𝑥𝑖1തതതത൯൫𝑥𝑖2𝑗 − 𝑥𝑖2തതതത൯𝑚𝑗=1ටσ ൫𝑥𝑖1𝑗 − 𝑥𝑖1തതതത൯𝑚𝑗=1 ටσ ൫𝑥𝑖2𝑗 − 𝑥𝑖2തതതത൯𝑚𝑗=1

𝑥𝑖ഥ= σ 𝑥𝑖𝑗𝑚𝑗=1𝑚

Page 23: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 23

Euclidean distance vs. correlation• Two points have small Euclidean distance if their attribute

values are close (but not necessarily correlated)• Two points have large Pearson correlation if their attribute

values have consistent trends (but not necessarily close)

Last update: 21-Nov-2015

Euclidean distance Pearson correlation

Gene1, Gene2 3.32 0.43

Gene1, Gene3 5.66 0.83

Gene2, Gene3 5.57 0.53

0

1

2

3

4

5

6

7

8

Sample 1 Sample 2 Sample 3 Sample 4 Sample 5

Expr

essi

on le

vel

Gene 1

Gene 2

Gene 3

Page 24: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 24

Which one to use?• Sometimes absolute expression values are

more important– Example: When there is a set of homogenous

samples (e.g., all of a certain cancer type), and the goal is to find out genes that are all highly expressed or lowly expressed

• Usually the increase-decrease trend is more important than absolute expression values– Example: When detecting changes between two

sets of samples or across a number of time points

Last update: 21-Nov-2015

Page 25: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 25

Similarity between two clusters• Several schemes (using Euclidean distance as example):– Average-link: average between all pairs of points (used by

UPGMA)

– Single-link: closest among all pairs of points

– Complete-link: farthest among all pairs of points

– Centroid-link: between the centroids

Last update: 21-Nov-2015

dሺ𝐶,𝐶′ሻ= σ dሺ𝑥,𝑥′ሻ𝑥∈𝐶,𝑥′∈𝐶′ȁ�𝐶ȁ�ȁ�𝐶′ȁ�

dሺ𝐶,𝐶′ሻ= min𝑥∈𝐶,𝑥′∈𝐶′ dሺ𝑥,𝑥′ሻ

dሺ𝐶,𝐶′ሻ= mȁx𝑥∈𝐶,𝑥′∈𝐶′ dሺ𝑥,𝑥′ሻ

dሺ𝐶,𝐶′ሻ=ඩ ቆσ 𝑥𝑖𝑗𝑥𝑖∈𝐶ȁ�𝐶ȁ� − σ 𝑥𝑖′𝑗𝑥𝑖′ ∈𝐶′ȁ�𝐶′ȁ� ቇ

2𝑚𝑗=1

Page 26: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 26

Similarity between two clusters• Average-link: equal vote by all members of the

clusters, preferring to merge clusters liked by many

• Single-link: merge two clusters even if just one pair likes it very much

• Complete-link: not merge two clusters even if just one pair does not like it

• Centroid-link: similar to average-link, but easier to compute

Last update: 21-Nov-2015

Page 27: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 27

Similarity between two clusters• Suppose clusters C1 and C2 have already

been formed– Average-link prefers to merge I and C2 next,

as their points are close on average– Single-link prefers to merge C1 and E next, as

C and E are very close– Complete-link prefers to merge I and C2 next,

as I is not too far from F, G or H (as compared to A-E, C-H, E-H, etc.)

– Centroid-link prefers to merge C1 and C2 next, as their centroids are close (and not so affected by the long distance between C and H)

Last update: 21-Nov-2015

A

B

C

D

E

F

I

GH

C1C2

Page 28: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 28

Updating • To determine which two clusters are most similar, we need to

compute the distance between every pair of clusters– At the beginning, this involves O(n2) computations for n objects,

followed by a way to find out the smallest value, either• Linear scan, which would take O(n2) time OR• Sorting, which would take O(n2 log n2) = O(n2 log n) time

– After a merge, we need to remove the distances involving the two merging clusters, and add back the distances of the new cluster with all other clusters: O(n) between-cluster distance calculations (assuming that takes constant time for now – will come back to this topic later), followed by either• Linear scan of new list, which would take O(n2) time OR• Re-sorting, which would take O(n2 log n) time OR• Binary search and removing/inserting distances, which would take O(n log

n2) = O(n log n) time

Last update: 21-Nov-2015

Page 29: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 29

Updating • Summary:– At the beginning

• Linear scan: O(n2) time OR• Sorting: O(n2 log n) time

– After each merge• Linear scan: O(n2) time OR• Re-sorting: O(n2 log n) time OR• Binary search and removing/inserting distances: O(n log n)

time

– In total,• Linear scan: O(n3) time• Maintaining a sorted list: O(n2 log n) time

– Can these be done faster?Last update: 21-Nov-2015

Page 30: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 30

Heap• A heap (also called a priority queue) is for

maintaining the minimum value of a list of numbers without sorting

• Ideas:– Build a binary tree structure with each node

storing one of the numbers, and the root of a sub-tree is always smaller than all other nodes in the sub-tree

– Store the tree in an array that allows efficient updates

Last update: 21-Nov-2015

Page 31: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 31

Heap• Example: tree representation (each node is the distance

between two clusters)

• Corresponding array representation (notice that the array is not entirely sorted)

– If first entry has index 0, then the children of node at entry i are at entries 2i+1 and 2i+2

– Smallest value always at the first entry of the array

Last update: 21-Nov-2015

1

4 9

5 6 10 13

12 8 11

1 4 9 5 6 10 13 12 8 11

0 1 2 3 4 5 6 7 8 9

Page 32: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 32

Constructing a heap• Staring with any input array– From node at entry N/2 down to node at entry 1, swap

with smallest child iteratively if it is smaller than the current node• N is the total number of nodes, which is equal to n(n-1)/2, the

number of pairs for n clusters

• Example input: 13, 5, 10, 4, 11, 1, 9, 12, 8, 6

Last update: 21-Nov-2015

13

5 10

4 11 1 9

12 8 6

Page 33: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 33

Constructing a heap

Last update: 21-Nov-2015

13

5 10

4 11 1 9

12 8 6

13

5 10

4 6 1 9

12 8 11

13

5 10

4 6 1 9

12 8 11

13

5 10

4 6 1 9

12 8 11

13

5 1

4 6 10 9

12 8 11

13

5 1

4 6 10 9

12 8 11

13

4 1

5 6 10 9

12 8 11

13

4 1

5 6 10 9

12 8 11

1

4 9

5 6 10 13

12 8 11

Page 34: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 34

Constructing a heap• Input:

• Output:

Last update: 21-Nov-2015

13

5 10

4 11 1 9

12 8 6

1

4 9

5 6 10 13

12 8 11

1 4 9 5 6 10 13 12 8 11

13 5 10 4 11 1 9 12 8 6

Page 35: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 35

Constructing a heap• Time needed:– Apparently, for each of the O(N) nodes, up to O(log

N) swaps are needed, so O(N log N) time in total• Same as sorting

– However, by doing a careful amortized analysis, actually only O(N) time is needed• Why?• Because only one node could have log N swaps, two

nodes could have log N - 1 swaps, etc.• For example, for 15 nodes:

– N log N = 15 log2 15 15(3) = 45– 3 + 2(2) + 4(1) + 8(0) = 11

Last update: 21-Nov-2015

Page 36: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 36

Deleting a value• For many applications of heaps, deletion is

done to remove the value at the root only• In our clustering application, for each cluster

we maintain the entries corresponding to the distances related to it, so after the cluster is merged we remove all these distance values from the heap

• In both cases, deletion is done by moving the last value to the deleted entry, then re- “heapify”

Last update: 21-Nov-2015

Page 37: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 37

Deleting a value• Example: deleting 4

• Each deletion takes O(log N) time• After each merge of two clusters, need to remove

O(n) distances from heap – O(n log N) = O(n log n) time in total

Last update: 21-Nov-2015

1

4 9

5 6 10 13

12 8 11

1 4 9 5 6 10 13 12 8 11

1

11 9

5 6 10 13

12 8

1 11 9 5 6 10 13 12 8

1

5 9

8 6 10 13

12 11

1 5 9 8 6 10 13 12 11

Page 38: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 38

Inserting a new value• Add to the end• Iteratively swap with parent until larger than parent• Example: Adding value 3

• Each insertion takes O(log N) time• After each merge of two clusters, need to insert O(n)

distances to heap – O(n log N) = O(n log n) time in total

Last update: 21-Nov-2015

1

5 9

8 6 10 13

12 11

1 5 9 8 6 10 13 12 11 3

3

1

3 9

8 5 10 13

12 11 6

1 3 9 8 5 10 13 12 11 6

Page 39: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 39

Total time and space• O(N) = O(n2) space• Initial construction: O(N) = O(n2) time• After each merge: O(n log n) time– O(n) merges in total

• Therefore in total O(n2 log n) time is needed• Now we study another structure that needs

O(n2) space but only O(n2) time in total

Last update: 21-Nov-2015

Page 40: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 40

Quad tree• Proposed in Eppstein, Proceedings of the Ninth Annual ACM-

SIAM Symposium on Discrete Algorithms 619-628, (1998)• Main idea: Group the objects iteratively to form a tree, with

the minimum distance between all objects stored at the root of the sub-tree

• Example: Distances between 9 objects

Last update: 21-Nov-2015

Distance 0 1 2 3 4 5 6 7 8

0 0 6 5 15 17 11 11 14 11

1 6 0 10 16 12 13 13 9 6

2 5 10 0 12 20 8 8 16 13

3 15 16 12 0 17 4 4 14 12

4 17 12 20 17 0 17 18 4 7

5 11 13 8 4 17 0 1 13 11

6 11 13 8 4 18 1 0 14 12

7 14 9 16 14 4 13 14 0 3

8 11 6 13 12 7 11 12 3 0

Page 41: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 41

Quad tree

Last update: 21-Nov-2015

6Distance 0 1 2 3 4 5 6 7 8

0 0 6 5 15 17 11 11 14 11

1 6 0 10 16 12 13 13 9 6

2 5 10 0 12 20 8 8 16 13

3 15 16 12 0 17 4 4 14 12

4 17 12 20 17 0 17 18 4 7

5 11 13 8 4 17 0 1 13 11

6 11 13 8 4 18 1 0 14 12

7 14 9 16 14 4 13 14 0 3

8 11 6 13 12 7 11 12 3 0

5151711111411

1016121313

96

1220

88

1613

12345678

1744

1412

1718

47

11311

1412 3

0 1 2 3 4 5 6 7

65

151711111411

1016121313

96

1220

88

1613

12345678

1744

1412

1718

47

11311

1412 3

0 1 2 3 4 5 6 7

5

12 12

11 4 1

6 12 4 3

12345678

0 1 2 3 4 5 6 7

5

12 12

11 4 1

6 12 4 3

5

4 1

12345678

0 1 2 3 4 5 6 7

5

4 1

1

Page 42: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 42

Updating the quad tree• After a merge, the algorithm needs to– Delete distance values in two rows and two columns– Add back distance values into a row and a column

• If we do not want to compact the tree, simply fill in to the other row and column

– Re-compute minimum values at upper levels– Example: Merging clusters 5 and 6– Suppose new distance between this merged cluster and

other clusters are:

Last update: 21-Nov-2015

0 1 2 3 4 7 8

11 13 8 4 17 14 12

Page 43: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 43

Merging clusters 5 and 6

Last update: 21-Nov-2015

0 1 2 3 4 7 8

11 13 8 4 17 14 12

65

151711111411

1016121313

96

1220

88

1613

12345678

1744

1412

1718

47

11311

1412 3

0 1 2 3 4 5 6 7

Distances with new cluster:

65

151711

1411

1016121396

1220

8

1613

1234

5,6

78

174

1412

1747

1412

3

0 1 2 3 4 5,6 765

151711

1411

1016121396

1220

8

1613

1234

5,6

78

174

1412

1747

1412

3

0 1 2 3 4 5,6 7

5

12 12

11 4 17

6 12 4 3

1234

5,6

78

0 1 2 3 4 5,6 7

5

12 12

11 4 17

6 12 4 3

5

4 3

1234

5,6

78

0 1 2 3 4 5,6 7

5

4 3

3

Page 44: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 44

Space and time analysis• Space needed: O(n2)• Initial construction: O(n2 + n2/4 + n2/16 + ...) =

O(n2) time• After each merge, number of values to update:

O(2n + 2n/2 + 2n/4 + ...) = O(n), each taking a constant amount of time

• Time needed for the whole clustering process = O(n2)– More efficient than using a heap– There are data structures that require less space but

more time

Last update: 21-Nov-2015

Page 45: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 45

Computing within-cluster distances• If Ci and Cj are merged, how to compute

d(CiCj,Ck) based on d(Ci,Ck) and d(Cj,Ck)?– Single-link: d(CiCj,Ck) = min{d(Ci,Ck), d(Cj,Ck)}

– Complete-link: d(CiCj,Ck) = max{d(Ci,Ck), d(Cj,Ck)}

– Average-link: d(CiCj,Ck)= [d(Ci,Ck)|Ci||Ck| + d(Cj,Ck)|Cj||Ck|] / [(|Ci|+|Cj|)|Ck|]

– Centroid-link: Cen(CiCj)= [Cen(Ci)|Ci| + Cen(Cj)|Cj|] / (|Ci|+|Cj|)

• All can be performed in constant timeLast update: 21-Nov-2015

Page 46: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 46

K-means• K-means is another classical clustering algorithm– MacQueen, Proceedings of 5th Berkeley Symposium on

Mathematical Statistics and Probability 281-297, (1967)• Instead of hierarchically merging clusters, k-means

iteratively partitions the objects into k clusters by repeating two steps until stabilized:1. Determining cluster representatives• Randomly determined initially• Centroids of current members in subsequent iterations

2. Assigning each object to the cluster with the closest representative

Last update: 21-Nov-2015

Page 47: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 47

Example (k=2)

Last update: 21-Nov-2015

A

BC

D

EF

I

G

H

C1

C2

Assignment A

BC

D

EF

I

G

H

Re-determiningrepresentatives

A

BC

D

EF

I

G

H

Assignment

A

BC

D

EF

I

G

H

C1

C2Re-determining

representativesA

BC

D

EF

I

G

HAssignmentA

BC

D

EF

I

G

H

C1C2

Re-determiningrepresentatives

A

BC

D

EF

I

G

HAssignment

A

BC

D

EF

I

G

H

C1C2

Random initial representatives

Page 48: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 48

Hierarchical clustering vs. k-means

• There are hundreds of other clustering algorithms proposed:– Model-based– Density approach– Less sensitive to outliers

– More efficient– Allowing other data types– Considering domain knowledge– Finding clusters in subspaces

(coming up next)– ...

Last update: 21-Nov-2015

Hierarchical clustering k-means

Advantages • Providing the whole clustering tree (dendrogram), can cut to get any number of clusters

• No need to pre-determine k

• Fast• Low memory consumption• An object can move to another

clusterDisadvantages • Slow

• High memory consumption• Once assigned, an object always

stays in a cluster

• Providing only final clusters• Need to pre-determine k

Page 49: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 49

Embedded clusters• Euclidean distance and Pearson correlation consider all

attributes equally• It is possible that for each cluster, only some attributes are

relevant

Last update: 21-Nov-2015

Image credit: Pomeroy et al., Nature 415(6870):436-442, (2002)

Page 50: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 50

Finding clusters in a subspace• One way is not to distinguish between objects

and attributes, but to find a subset of rows and a subset of columns (a bicluster) so that the values inside the bicluster exhibit some coherent patterns

• Here we study one bi-clustering algorithm– Cheng and Church, 8th Annual International

Conference on Intelligent Systems for Molecular Biology, 93-103, (2000)

Last update: 21-Nov-2015

Page 51: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 51

Cheng and Church biclustering• Notations:– I is a subset of the rows– J is a subset of the columns– (I, J) defines a bicluster

• Model: Each value aij (at row i and column j) in a cluster is influenced by:– Background of the whole cluster– Effect of the i-th row– Effect of the j-th column

Last update: 21-Nov-2015

Page 52: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 52

Cheng and Church biclustering• Assumption:– In the ideal case, aij = aiJ + aIj – aIJ, where

– is the mean of values in the cluster at row i

– is the mean of values in the cluster at

column j– is the mean of all

values in the cluster

• Goal of the algorithm is to find I and J such that the following mean squared residue score is minimized:

Last update: 21-Nov-2015

𝐻ሺ𝐼,𝐽ሻ= 1ȁ�𝐼ȁ�ȁ�𝐽ȁ� ൫𝑎𝑖𝑗 − 𝑎𝑖𝐽− 𝑎𝐼𝑗 +𝑎𝐼𝐽൯2𝑖∈𝐼,𝑗∈𝐽

𝑎𝑖𝐽= 1ȁ�𝐽ȁ� 𝑎𝑖𝑗𝑗∈𝐽

𝑎𝐼𝑗 = 1ȁ�𝐼ȁ� 𝑎𝑖𝑗𝑖∈𝐼

𝑎𝐼𝐽= 1ȁ�𝐼ȁ�ȁ�𝐽ȁ� 𝑎𝑖𝑗𝑖∈𝐼,𝑗∈𝐽 = 1ȁ�𝐼ȁ� 𝑎𝑖𝐽𝑖∈𝐼 = 1ȁ�𝐽ȁ� 𝑎𝐼𝑗𝑗∈𝐽

Page 53: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 53

Example• Suppose the values in a cluster are generated

according to the following row and column effects:

• Then the corresponding averages values are:

• a11 – a1J – aI1 + aIJ = 12 – 12.5 – 13.75 + 14.25 = 0– You can verify for other i’s and j’s

Last update: 21-Nov-2015

Background: Column 1: Column 2: Column 3: Column 4:

3 5 8 1 8

Row 1: 4 12 15 8 15

Row 2: 2 10 13 6 13

Row 3: 7 15 18 11 18

Row 4: 10 18 21 14 21

Global: Column 1: Column 2: Column 3: Column 4:

14.25 13.75 16.75 9.75 16.75

Row 1: 12.5

Row 2: 10.5

Row 3: 15.5

Row 4: 18.5

Page 54: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 54

Why this model?• Assuming the expression level of a gene in a

particular sample is determined by three additive effects:– The cluster background• E.g., the activity of the whole functional pathway

– The gene• E.g., some genes are intrinsically more active

– The sample• E.g., in some samples, all the genes in the cluster are

activated

Last update: 21-Nov-2015

Page 55: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 55

Algorithm• How to find out clusters (i.e., (I, J)) that have

small H values?– It is proved that finding the largest cluster with H

less than a fixed threshold is NP hard– Heuristic method:

1. Randomly determine I and J2. Try all possible addition/deletion of one row/column,

and accept the one that results in smallest H– Some variations involve addition or deletion only, or

allowing addition or deletion of multiple rows/columns

3. Repeat #2 until H does not decrease or it is smaller than threshold

Last update: 21-Nov-2015

Page 56: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 56

More details• Obviously, if the cluster contains only one row and

one column, the residue H must be 0– Could limit the minimum number of rows/columns

• A cluster containing genes that do not change their expression values across different samples may not be interesting– Could use the variance of expression values across

samples as a secondary score• How to find more than one cluster?– After finding a cluster, replace it with random values

before calling the algorithm again

Last update: 21-Nov-2015

Page 57: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 57

Some clusters found• Each line is a gene. The horizontal axis

represents different time points

Last update: 21-Nov-2015

Page 58: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CASE STUDY, SUMMARY AND FURTHER READINGS

Epilogue

Page 59: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 59

Case study: Successful stories• Clustering of gene expression data has led to the

discovery of disease subtypes and key genes to some biological processes

• Example 1: Automatic identification of cancer subtypes acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without prior knowledge of these classes

Last update: 21-Nov-2015

2 clusters 4 clusters

Image credit: Golub et al., Science 286(5439):531-537, (1999)

Page 60: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 60

Case study: Successful stories• Example 2:

Identification of genes involved in the response to external stress– Each triangle: multiple

time points after producing an environmental change, such as heat shock or amino acid starvation

Last update: 21-Nov-2015

Image credit: Gasch et al., Molecular Biology of the Cell 11(12):4241-4257, (2000)

Page 61: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 61

Case study: Successful stories• Example 3: Segmentation of the human

genome into distinct region classes

Last update: 21-Nov-2015

Image credit: The ENCODE Project Consortium, Nature 489(7414):57-74, (2012)

Page 62: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 62

Summary• Clustering is the process to group similar

things into clusters– Many applications in bioinformatics, the most

well-known one is on gene expression analysis• Classical clustering algorithms– Agglomerative hierarchical clustering– K-means

• Subspace/bi-clustering algorithms

Last update: 21-Nov-2015

Page 63: Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 63

Further readings• The book by Leonard Kaufman and Peter J.

Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley Inter-Science 1990– A classical reference book on cluster analysis

Last update: 21-Nov-2015