db seminar series: validation and presentation of clustering results presented by: kevin yip 26...
TRANSCRIPT
DB Seminar Series:Validation and Presentation of
Clustering Results
Presented by:
Kevin Yip
26 March 2003
2
Introduction
Focus of most research works on data clustering:
Accuracy: tailor for different data characteristics (cluster shapes, presence of noise attributes and outliers, small number of objects, etc.)Speed performance (statistical summaries, data structures, sampling, etc.)
Any other issues to consider?
3
Introduction
Question 1: given a dataset with unknown data characteristics, if different clustering algorithms give different results, which one is more reliable?
“Reliable”:“Object similarity”: objects of the same cluster are similar, objects of different clusters are dissimilar.
“Robustness”: stability of results when different algorithm parameters are used.
=> The validation problem (a confusing term!)
4
Introduction
Question 2: given a set of clustering results, how should it be presented so that users can gain most insights from it?
=> The presentation problem.
5
The validation problem
Two types of validation:External validation (based on some “gold standards”).
Internal validation (based on some statistics of the results).
6
The validation problem
External validation:Confusion matrix
Class 1 Class 2 Class 3
Cluster A 10 0 0
Cluster B 0 0 10
Cluster C 0 10 0
Class 1 Class 2 Class 3
Cluster A’ 4 0 8
Cluster B’ 6 0
Cluster C’ 0 10 2
Evaluation function:• Precision (Cluster A: 10/10, Cluster A’: 8/12)• Recall (Cluster A: 10/10, Cluster A’: 8/10)• Many others
7
The validation problem
Problems with external validation:Gold standards are usually not available in real situation (or we can do classification instead).
There can be multiple ways to assign class labels.
8
The validation problemInternal validation:Method 1: criterion function (E.g. average within-cluster distance to centroid, C-index, U-statistics) – “validating clusters/sets of clusters”.
The function values are easy to compute, and the clusters produced by different algorithms can be easily compared.However, the functions can bias against certain kinds of data, algorithms, outlier handling methods, etc. (e.g. spherical v.s. non-spherical clusters).
9
The validation problem
• Karypis, Han and Kumar, 1999.
10
The validation problem
Method 1b: criterion function with null hypothesisIf each record forms a cluster, the average within-cluster distance to centroid = 0, but it does not imply a good clustering.
For each cluster, compare its value of the criterion function to that of a cluster with the same characteristics (e.g. no. of records) from random data.
Usually requires heavy computation, or is even computationally impossible with large datasets.
11
The validation problemMethod 2: strong clusters determination
Repeat the clustering process with different parameter values (e.g. no. of target clusters), identify the records that are always found in the same cluster – “validating clustering algorithm” (in terms of robustness).This method identifies groups of records that are similar in different conditions.The absolute quality of the clusters is not determined. Sometimes it is hard to find records that are “always” clustered together. There is no guarantee on how many records are in the strong clusters.
12
The validation problemMethod 3: agreement between different attributes (“unsupervised LOOCV” – “validating data and algorithm”)
Take out an attribute A and perform clustering. Calculate the average similarity of the values of A in each cluster. Repeat for all attributes and obtain an aggregate index.
A1 A2 A3
R1 1 6 4
R2 3 9 5
R3 8 7 5
R4 10 6 6
13
The validation problemAssumption: objects of the same class have similar attribute value patterns in all attributes.
It evaluates the quality of clusters by a “semi-external” criteria – the left out attribute.
An example index:
In practical use, the index values are plotted against different parameter values (e.g. no. of target clusters) and the curves of different algorithms are compared.
14
The validation problem
• Yeung, Haynor and Ruzzo 2001 (Rat CNS data: 9 time points, 112 genes)
15
The validation problemMethod 4: method 2 + method 3
Similar to method 3, clusters are produced when each attribute is taken out. But instead of using the left-out attribute to evaluate the clustering result, the results are compared to the result with all attributes.The is similar to method 2 in that the results are similar if the clusters are strong. But instead of finding strong clusters, the goal of this method is to evaluate how stable are the clustering algorithms.An example similarity measure for the cluster sets:
16
The validation problem
This method may suggest to reject the use of some algorithms if the results are not stable.
However, it may not be able to suggest which algorithm is good as the quality of clusters is not considered.
For instance, if an algorithm always puts most of the objects into a single large cluster, it will have a very high stability, yet the clusters are not meaningful.
17
The validation problem
The above methods may not be able to validate projected clusters:
Method 1: some criterion functions give a monotonic increasing/decreasing value with the number of selected attributes. Normalizing the function value by the number may not work as well.
A1 A2 A3 A4R1 3 2 3 2R2 5 6 8 5R3 5 4 1 10
Mean 4.3 4.0 4.0 5.7
A1-A4 A1-A3 A1-A2 A1Avg. Dist to centroid 23.1 12.2 3.6 0.9
Avg. Dist to centroid /no. of selected attr.
5.8 4.1 1.8 0.9
18
The validation problemMethod 1b: when generating reference results on random data, it is very hard to obtain clusters with desired numbers of records and selected attributes.Method 2: projected clustering algorithms are usually parameter-sensitive, so it is not easy to find strong clusters.Methods 3 and 4: the basic assumption of the methods contradict with the basic assumption of projected clustering.
A new validation problem for projected clusters: validating the selected attributes.
19
The validation problemSummary:
Different clustering algorithms work well on data with different data characteristics.If the data characteristics of a dataset is unknown, and external validation criteria are not available, some internal validation methods may be helpful.Each kind of method provides a different kind of validation with different assumptions.Just as no single clustering algorithm is the best in all situations, no single validation method can provide accurate validation in all cases.
20
The presentation problem
Question: how much to present?Extreme 1:
• cluster 1: records 1, 2, 7, 10, 16…cluster 2: records 3, 4, 9, 14, 20……
• Easy to understand, suitable for initial validation by domain experts.
• Can miss out a lot of important information.
21
The presentation problemExtreme 2:
• Rebuilding score lists with 500 clusters.Totally 12 possible merges.500 clusters remained.Best merge:Cluster with first record 1 and cluster with first record 229: Score=0.95No of selected attributes=19, average relevance of the selected attributes=1.0, mutual disagreement=0.0Cluster with first record 1: no of rec=1, no of selected attr=20, avg rel of the selected attr=1.0Cluster with first record 229: no of rec=1, no of selected attr=20, avg rel of the selected attr=1.0Summary of the merged cluster:Average relevance of the selected attributes: 1.019 selected attributes: {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19}2 records: {1,229}499 clusters remained.…
• Too detail, difficult to read or interpret.
• Not to present, but good to store such detailed logs for further investigation.
22
The presentation problemIn between the extremes, some useful summaries:
Dimension reduction (PCA, ICA, MDS, FastMap, etc.) followed by 2-D plot:
• Yeung and Ruzzo, 2001.
• It is not always possible to find a good 2D space where the points of different clusters are well separated.
• But even some clusters overlap, the clearly separated parts can be already very useful.
23
The presentation problemReachability plots
• Ankerst et al., 1999.• Allow the identification of sub-clusters.
24
The presentation problem
• Ng, Sander and Sleumer, 2001.
25
The presentation problemDendrograms (from hierarchical algorithms)
• Alizadeh et al., 2000.• Corresponding confusion matrix:
Activatedblood B CLL DLBCL FL
Germinalcentre B
NI. Lymphnode/tonsil
Restingblood B
Resting/activated T
Transformedcell lines
Cluster 0 0 0 1 0 0 0 0 0 0Cluster 1 0 0 1 0 0 0 0 0 0Cluster 2 0 0 1 0 0 0 0 0 0Cluster 3 0 0 1 0 0 0 0 6 6Cluster 4 0 0 28 0 0 2 0 0 0Cluster 5 0 11 0 9 0 0 4 0 0Cluster 6 10 0 0 0 0 0 0 0 0Cluster 7 0 0 6 0 0 0 0 0 0Cluster 8 0 0 8 0 2 0 0 0 0
26
The presentation problem
Finding leaf ordering:• There are 2n-1 possible orderings.
• Greedy: each time a new cluster is formed, the locally-optimal ordering is determined.
27
The presentation problem
Finding leaf ordering:• There are 2n-1 possible orderings.
• Greedy: each time a new cluster is formed, the locally-optimal ordering is determined. Greedy: not globally-optimal, but efficient (O(n) time).
28
The presentation problem
Finding leaf ordering:• Globally-optimal: maximizing the sum of the
similarity of adjacent elements in the ordering.
1
11),()(
n
iii zzSTD
Some better orderings (validated by domain knowledge) compared to those produced by heuristics have been observed. An O(n4) time dynamic programming algorithm is available.
29
The presentation problem
• Bar-Joseph, Gifford and Jaakkola, 2003.
30
The presentation problem
Pros and cons of dendrograms:• Users can find clusters in any subtrees, not
necessarily rooted at the last-formed nodes.
• Sub-clusters can be identified.
• The relationship between different clusters is shown.
• Hard to interpret cluster boundaries when there are many objects.
31
ConclusionsWhen clustering is used in data analysis, it is rarely possible to yield some excellent results with a single run of a single algorithm.Usually before getting some interesting findings, a lot of results are produced by different algorithms with different data preprocessing , similarity functions, parameter values, etc.Internal validation methods provide hints for evaluating the results and choosing the suitable algorithms to use.
32
Conclusions
However, not all internal validation methods are appropriate in each situation. A wrong method can lead to a wrong “proof” of a good clustering result.
Similarly, good clustering results can be ruined by a bad presentation method. The way to present the results should always take into account how they are to be used in further investigation.
33
References
Internal validation:Marie-Odile Delorme and Alain Henaut, Merging of Distance Matrices and Classification by Dynamic Clustering, CABIOS vol. 4, no. 4, 1988.
George Karypis, Eui-Hong (Sam) Han, Vipin Kumar, CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling, IEEE Computer vol. 32, no. 8, p.68-75, 1999.
Zhexue Huang, David W. Cheung and Michael K. Ng, An Empirical Study on the Visual Cluster Validation Method with Fastmap, DASFAA 2001.
34
References
Internal validation:Ka Yee Yeung, David R. Haynor and Walter L. Ruzzo, Validating Clustering for Gene Expression Data, Bioinformatics vol. 17, no. 4, 2001.
Susmita Datta and Somnath Datta, Comparisons and Validation of Statistical Clustering Techniques for Microarray Gene Expression Data, Bioinformatics vol. 19, no. 4, 2003.
35
References
Result presentation:Michael B. Eisen et al., Cluster Analysis and Display of Genome-wide Expression Patterns, Proc. Natl. Acad. Sci, USA vol. 95, 1998.
Michael Anakerst et al., OPTICS: Ordering Points To Identify the Clustering Structure, SIGMOD 1999.
Ash A. Alizadeh et al., Distinct Types of Diffuse Large B-cell Lymphoma Identified by Gene Expression Profiling, Nature vol. 403, Feb 2000.
36
References
Result presentation:K. Y. Yeung and W. L. Ruzzo, An Empirical Study of Principal Component Analysis for Clustering Gene Expression Data, Bioinformatics, vol. 17, no. 9, 2001.
Ziv Bar-Joseph, David K. Gifford and Tommi S. Jaakkola, Fast Optimal Leaf Ordering for Hierarchical Clustering, Bioinformatics, vol. 17, suppl. 1, 2001.
Raymond T. Ng, Jörg Sander and Monica C. Sleumer, Hierarchical Cluster Analysis of SAGE Data for Cancer Profiling, BIOKDD 2001.
37
Thank You!