db seminar series: validation and presentation of clustering results presented by: kevin yip 26...

37
DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

Upload: corey-jefferson

Post on 18-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

DB Seminar Series:Validation and Presentation of

Clustering Results

Presented by:

Kevin Yip

26 March 2003

Page 2: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

2

Introduction

Focus of most research works on data clustering:

Accuracy: tailor for different data characteristics (cluster shapes, presence of noise attributes and outliers, small number of objects, etc.)Speed performance (statistical summaries, data structures, sampling, etc.)

Any other issues to consider?

Page 3: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

3

Introduction

Question 1: given a dataset with unknown data characteristics, if different clustering algorithms give different results, which one is more reliable?

“Reliable”:“Object similarity”: objects of the same cluster are similar, objects of different clusters are dissimilar.

“Robustness”: stability of results when different algorithm parameters are used.

=> The validation problem (a confusing term!)

Page 4: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

4

Introduction

Question 2: given a set of clustering results, how should it be presented so that users can gain most insights from it?

=> The presentation problem.

Page 5: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

5

The validation problem

Two types of validation:External validation (based on some “gold standards”).

Internal validation (based on some statistics of the results).

Page 6: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

6

The validation problem

External validation:Confusion matrix

Class 1 Class 2 Class 3

Cluster A 10 0 0

Cluster B 0 0 10

Cluster C 0 10 0

Class 1 Class 2 Class 3

Cluster A’ 4 0 8

Cluster B’ 6 0

Cluster C’ 0 10 2

Evaluation function:• Precision (Cluster A: 10/10, Cluster A’: 8/12)• Recall (Cluster A: 10/10, Cluster A’: 8/10)• Many others

Page 7: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

7

The validation problem

Problems with external validation:Gold standards are usually not available in real situation (or we can do classification instead).

There can be multiple ways to assign class labels.

Page 8: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

8

The validation problemInternal validation:Method 1: criterion function (E.g. average within-cluster distance to centroid, C-index, U-statistics) – “validating clusters/sets of clusters”.

The function values are easy to compute, and the clusters produced by different algorithms can be easily compared.However, the functions can bias against certain kinds of data, algorithms, outlier handling methods, etc. (e.g. spherical v.s. non-spherical clusters).

Page 9: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

9

The validation problem

• Karypis, Han and Kumar, 1999.

Page 10: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

10

The validation problem

Method 1b: criterion function with null hypothesisIf each record forms a cluster, the average within-cluster distance to centroid = 0, but it does not imply a good clustering.

For each cluster, compare its value of the criterion function to that of a cluster with the same characteristics (e.g. no. of records) from random data.

Usually requires heavy computation, or is even computationally impossible with large datasets.

Page 11: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

11

The validation problemMethod 2: strong clusters determination

Repeat the clustering process with different parameter values (e.g. no. of target clusters), identify the records that are always found in the same cluster – “validating clustering algorithm” (in terms of robustness).This method identifies groups of records that are similar in different conditions.The absolute quality of the clusters is not determined. Sometimes it is hard to find records that are “always” clustered together. There is no guarantee on how many records are in the strong clusters.

Page 12: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

12

The validation problemMethod 3: agreement between different attributes (“unsupervised LOOCV” – “validating data and algorithm”)

Take out an attribute A and perform clustering. Calculate the average similarity of the values of A in each cluster. Repeat for all attributes and obtain an aggregate index.

A1 A2 A3

R1 1 6 4

R2 3 9 5

R3 8 7 5

R4 10 6 6

Page 13: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

13

The validation problemAssumption: objects of the same class have similar attribute value patterns in all attributes.

It evaluates the quality of clusters by a “semi-external” criteria – the left out attribute.

An example index:

In practical use, the index values are plotted against different parameter values (e.g. no. of target clusters) and the curves of different algorithms are compared.

Page 14: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

14

The validation problem

• Yeung, Haynor and Ruzzo 2001 (Rat CNS data: 9 time points, 112 genes)

Page 15: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

15

The validation problemMethod 4: method 2 + method 3

Similar to method 3, clusters are produced when each attribute is taken out. But instead of using the left-out attribute to evaluate the clustering result, the results are compared to the result with all attributes.The is similar to method 2 in that the results are similar if the clusters are strong. But instead of finding strong clusters, the goal of this method is to evaluate how stable are the clustering algorithms.An example similarity measure for the cluster sets:

Page 16: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

16

The validation problem

This method may suggest to reject the use of some algorithms if the results are not stable.

However, it may not be able to suggest which algorithm is good as the quality of clusters is not considered.

For instance, if an algorithm always puts most of the objects into a single large cluster, it will have a very high stability, yet the clusters are not meaningful.

Page 17: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

17

The validation problem

The above methods may not be able to validate projected clusters:

Method 1: some criterion functions give a monotonic increasing/decreasing value with the number of selected attributes. Normalizing the function value by the number may not work as well.

A1 A2 A3 A4R1 3 2 3 2R2 5 6 8 5R3 5 4 1 10

Mean 4.3 4.0 4.0 5.7

A1-A4 A1-A3 A1-A2 A1Avg. Dist to centroid 23.1 12.2 3.6 0.9

Avg. Dist to centroid /no. of selected attr.

5.8 4.1 1.8 0.9

Page 18: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

18

The validation problemMethod 1b: when generating reference results on random data, it is very hard to obtain clusters with desired numbers of records and selected attributes.Method 2: projected clustering algorithms are usually parameter-sensitive, so it is not easy to find strong clusters.Methods 3 and 4: the basic assumption of the methods contradict with the basic assumption of projected clustering.

A new validation problem for projected clusters: validating the selected attributes.

Page 19: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

19

The validation problemSummary:

Different clustering algorithms work well on data with different data characteristics.If the data characteristics of a dataset is unknown, and external validation criteria are not available, some internal validation methods may be helpful.Each kind of method provides a different kind of validation with different assumptions.Just as no single clustering algorithm is the best in all situations, no single validation method can provide accurate validation in all cases.

Page 20: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

20

The presentation problem

Question: how much to present?Extreme 1:

• cluster 1: records 1, 2, 7, 10, 16…cluster 2: records 3, 4, 9, 14, 20……

• Easy to understand, suitable for initial validation by domain experts.

• Can miss out a lot of important information.

Page 21: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

21

The presentation problemExtreme 2:

• Rebuilding score lists with 500 clusters.Totally 12 possible merges.500 clusters remained.Best merge:Cluster with first record 1 and cluster with first record 229: Score=0.95No of selected attributes=19, average relevance of the selected attributes=1.0, mutual disagreement=0.0Cluster with first record 1: no of rec=1, no of selected attr=20, avg rel of the selected attr=1.0Cluster with first record 229: no of rec=1, no of selected attr=20, avg rel of the selected attr=1.0Summary of the merged cluster:Average relevance of the selected attributes: 1.019 selected attributes: {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19}2 records: {1,229}499 clusters remained.…

• Too detail, difficult to read or interpret.

• Not to present, but good to store such detailed logs for further investigation.

Page 22: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

22

The presentation problemIn between the extremes, some useful summaries:

Dimension reduction (PCA, ICA, MDS, FastMap, etc.) followed by 2-D plot:

• Yeung and Ruzzo, 2001.

• It is not always possible to find a good 2D space where the points of different clusters are well separated.

• But even some clusters overlap, the clearly separated parts can be already very useful.

Page 23: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

23

The presentation problemReachability plots

• Ankerst et al., 1999.• Allow the identification of sub-clusters.

Page 24: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

24

The presentation problem

• Ng, Sander and Sleumer, 2001.

Page 25: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

25

The presentation problemDendrograms (from hierarchical algorithms)

• Alizadeh et al., 2000.• Corresponding confusion matrix:

Activatedblood B CLL DLBCL FL

Germinalcentre B

NI. Lymphnode/tonsil

Restingblood B

Resting/activated T

Transformedcell lines

Cluster 0 0 0 1 0 0 0 0 0 0Cluster 1 0 0 1 0 0 0 0 0 0Cluster 2 0 0 1 0 0 0 0 0 0Cluster 3 0 0 1 0 0 0 0 6 6Cluster 4 0 0 28 0 0 2 0 0 0Cluster 5 0 11 0 9 0 0 4 0 0Cluster 6 10 0 0 0 0 0 0 0 0Cluster 7 0 0 6 0 0 0 0 0 0Cluster 8 0 0 8 0 2 0 0 0 0

Page 26: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

26

The presentation problem

Finding leaf ordering:• There are 2n-1 possible orderings.

• Greedy: each time a new cluster is formed, the locally-optimal ordering is determined.

Page 27: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

27

The presentation problem

Finding leaf ordering:• There are 2n-1 possible orderings.

• Greedy: each time a new cluster is formed, the locally-optimal ordering is determined. Greedy: not globally-optimal, but efficient (O(n) time).

Page 28: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

28

The presentation problem

Finding leaf ordering:• Globally-optimal: maximizing the sum of the

similarity of adjacent elements in the ordering.

1

11),()(

n

iii zzSTD

Some better orderings (validated by domain knowledge) compared to those produced by heuristics have been observed. An O(n4) time dynamic programming algorithm is available.

Page 29: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

29

The presentation problem

• Bar-Joseph, Gifford and Jaakkola, 2003.

Page 30: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

30

The presentation problem

Pros and cons of dendrograms:• Users can find clusters in any subtrees, not

necessarily rooted at the last-formed nodes.

• Sub-clusters can be identified.

• The relationship between different clusters is shown.

• Hard to interpret cluster boundaries when there are many objects.

Page 31: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

31

ConclusionsWhen clustering is used in data analysis, it is rarely possible to yield some excellent results with a single run of a single algorithm.Usually before getting some interesting findings, a lot of results are produced by different algorithms with different data preprocessing , similarity functions, parameter values, etc.Internal validation methods provide hints for evaluating the results and choosing the suitable algorithms to use.

Page 32: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

32

Conclusions

However, not all internal validation methods are appropriate in each situation. A wrong method can lead to a wrong “proof” of a good clustering result.

Similarly, good clustering results can be ruined by a bad presentation method. The way to present the results should always take into account how they are to be used in further investigation.

Page 33: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

33

References

Internal validation:Marie-Odile Delorme and Alain Henaut, Merging of Distance Matrices and Classification by Dynamic Clustering, CABIOS vol. 4, no. 4, 1988.

George Karypis, Eui-Hong (Sam) Han, Vipin Kumar, CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling, IEEE Computer vol. 32, no. 8, p.68-75, 1999.

Zhexue Huang, David W. Cheung and Michael K. Ng, An Empirical Study on the Visual Cluster Validation Method with Fastmap, DASFAA 2001.

Page 34: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

34

References

Internal validation:Ka Yee Yeung, David R. Haynor and Walter L. Ruzzo, Validating Clustering for Gene Expression Data, Bioinformatics vol. 17, no. 4, 2001.

Susmita Datta and Somnath Datta, Comparisons and Validation of Statistical Clustering Techniques for Microarray Gene Expression Data, Bioinformatics vol. 19, no. 4, 2003.

Page 35: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

35

References

Result presentation:Michael B. Eisen et al., Cluster Analysis and Display of Genome-wide Expression Patterns, Proc. Natl. Acad. Sci, USA vol. 95, 1998.

Michael Anakerst et al., OPTICS: Ordering Points To Identify the Clustering Structure, SIGMOD 1999.

Ash A. Alizadeh et al., Distinct Types of Diffuse Large B-cell Lymphoma Identified by Gene Expression Profiling, Nature vol. 403, Feb 2000.

Page 36: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

36

References

Result presentation:K. Y. Yeung and W. L. Ruzzo, An Empirical Study of Principal Component Analysis for Clustering Gene Expression Data, Bioinformatics, vol. 17, no. 9, 2001.

Ziv Bar-Joseph, David K. Gifford and Tommi S. Jaakkola, Fast Optimal Leaf Ordering for Hierarchical Clustering, Bioinformatics, vol. 17, suppl. 1, 2001.

Raymond T. Ng, Jörg Sander and Monica C. Sleumer, Hierarchical Cluster Analysis of SAGE Data for Cancer Profiling, BIOKDD 2001.

Page 37: DB Seminar Series: Validation and Presentation of Clustering Results Presented by: Kevin Yip 26 March 2003

37

Thank You!