methods from mathematical data mining (supported by optimization)

Methods from Mathematical Data Mining (Supported by Optimization)

4th International Summer SchoolAchievements and Applications of Contemporary Informatics, Mathematics and PhysicsNational University of Technology of the UkraineKiev, Ukraine, August 5-16, 2009

1August 8, 2009

GerhardGerhardGerhardGerhardGerhardGerhardGerhardGerhard--------Wilhelm Weber Wilhelm Weber Wilhelm Weber Wilhelm Weber Wilhelm Weber Wilhelm Weber Wilhelm Weber Wilhelm Weber ** and Baand Başşak Aktekeak Akteke--ÖztürkÖztürk

Institute of Applied Mathematics Institute of Applied Mathematics Middle East Technical University, Ankara, TurkeyMiddle East Technical University, Ankara, Turkey

** Faculty of Economics, Management and Law, University of Siegen, GermanyFaculty of Economics, Management and Law, University of Siegen, Germany

Center for Research on Optimization and Control, University of Aveiro, Portugal

CE*OCEUEURORO CBBMCBBM EUEURORO ORDORD

(Supported by Optimization)

Clustering Theory

Cluster Number and Cluster Stability Estimation

Z. Volkovich

4th International Summer SchoolAchievements and Applications of Contemporary Informatics, Mathematics and PhysicsNational University of Technology of the UkraineKiev, Ukraine, August 5-16, 2009

2August 8, 2009

Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel

Z. BarzilySoftware Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel

G.-W. WeberDepartments of Scientific Computing, Financial Mathematics and Actuarial Sciences,

Institute of Applied Mathematics, Middle East Technical University, 06531, Ankara, Turkey

D. Toledano-KitaiSoftware Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel

Clustering

• An essential tool for “unsupervised” learning is

cluster analysis which suggests categorizing data (objects, instances) into groups such that the likeness within a group is much higher than the one between the groups.

3August 8, 2009

between the groups.

• This resemblance is often described by a distance function.

Clustering

For a given set S ⊂ IRd a clustering algorithmCL

constructs a clustered set:

CL(S, int-part, k) = Π(S) = (π1(S) ,…, πk (S)),

4August 8, 2009

CL(S, int-part, k) = Π(S) = (π1(S) ,…, πk (S)),

such thatCL(x) = CL(y) = i, if x and y are similar:

x, y ∈ πi(S), for somei= 1,…,k;

and CL(x) ≠ CL(y), if x and y are dissimilar.

Clustering

The disjoint subsets πi (S), i=1,…,k, are named clusters:

( ) and for .π π π= ∩ = ∅ ≠k

i i jS S, i jUUUU

5August 8, 2009

1

( ) and for .π π π=

= ∩ = ∅ ≠i i ji

S S, i jUUUU

Clustering

6August 8, 2009

CL(x) = CL(y) CL(x) ≠ CL(y)

ClusteringThe iterative clustering process is usually carried out in two phases: a partitioning phaseand a quality assessment phase.

In the partitioning phase, a label is assigned to each element in view of the assumption that, in addition to the observed features, for each data item, there is a hidden, unobserved feature representing cluster membership.

7August 8, 2009

The quality assessment phase measures the grouping quality.

The outcome of the clustering process is a partition that acquiresthe highest quality score.

Except for the data itself, two essential input parameters are typically required: an initial partition and a suggested number of clusters. Here, the parameters are denoted as • int-part ;• k .

The Problem Partitions generated by the iterative algorithms are commonly sensitive to initial partitions fed in as an input parameter. Selection of “good” initial partitions is an essentialclustering problem.

Another problem arising here is choosing the right number of theclusters. It is well known that this key task of the cluster analysis

8August 8, 2009

clusters. It is well known that this key task of the cluster analysisis ill posed. For instance, the “correct” number of clusters in adata set can dependon the scale in which the data are measured.

In this talk, we address to the last problem concerning determination of the number of clusters.

The Problem Partitions generated by the iterative algorithms are commonly sensitive to initial partitions fed in as an input parameter. Selection of “good” initial partitions is an essentialclustering problem.

Another problem arising here is choosing the right number of theclusters. It is well known that this key task of the cluster analysis

9August 8, 2009

clusters. It is well known that this key task of the cluster analysisis ill posed. For instance, the “correct” number of clusters in adata set can dependon the scale in which the data are measured.

The ProblemMany approaches to this problem exploit the within-cluster dispersion matrix (defined according to the pattern of a covariance matrix). The span of this matrix (column space) usually decreases as the number of groups rises, and may havea point in which it “falls”. Such an “elbow” on the graph locates,

in several known methods, the “true” number of clusters.

10August 8, 2009

in several known methods, the “true” number of clusters.

Stability based approaches, for the cluster validation problem, evaluate the partitions’ variability under repeated applications of a clustering algorithm. Low variability is understood as high consistency in the result obtained, and the number of clusters that maximizes cluster stability is accepted as an estimate for the

“true” number of clusters.

The ConceptIn the current talk, the problem of determining the true number of clusters is addressed by the cluster stability approach. We propose a method for the study of cluster stability. This method suggests a geometricalstability of a

11August 8, 2009

This method suggests a geometricalstability of apartition.• We draw samples from the source data and estimate

the clusters by means of each of the drawn samples. • We compare pairs of the partitions obtained. • A pair is considered to be consistent if the obtained

division is close.

The Concept• We quantify this closeness by the number of edges

connecting points from different samplesin a minimal spanning tree(MST) constructed for each one of the clusters.

• We use the Friedmanand Rafskytwo sampletest

12August 8, 2009

• We use the Friedmanand Rafskytwo sampletest statisticwhich measures these quantities. Under the null hypothesis on the homogeneity of the source data, this statistic is approximatelynormally distributed.

So, the case of well mingled samples within the clusters leads to normal distribution of the considered statistic.

The Concept

Examples of MST produced by samples within a cluster:

13August 8, 2009

The Concept

The left-side picture is an example of “a goodcluster”

where the quantity of edges connecting points from

different samples (marked by solid red lines) is

relatively big.

14August 8, 2009

relatively big.

The right-side picture images a “poorsituation” when

only one(and long) edge connects the (sub-) clusters.

The Two-Sample MST-TestHenzeand Penrose(1979) considered the asymptotic behavior of Rmn:the number of edges of V which connect a point of Sto a point of T .

Suppose that |S|=m → ∞ and |T|=n → ∞ such thatm /(m+n) → p∈∈∈∈ (0, 1).

Introducing q = 1 − p and r = 2pq, they obtained:

15August 8, 2009

Introducing q = 1 − p and r = 2pq, they obtained:

,

where the convergence is in distribution and N(0, ) denotes the normal distribution with a 0 expectation and a variance

:= r (r + Cd (1 − 2r)), for some constant Cddepending only on the space’s dimension d.

( )2,021

dmn Nnm

mnR

nmσ→

+−

+

2dσ

2dσ

Concept

• Resting upon this fact, the standard score

of the mentioned edges quantity is calculated for each clusterj=1,…, K ,

2:= −

jj

K m

m KY R

16August 8, 2009

for each clusterj=1,…, K , where m is the sample size and K denotes the number of clusters.

• The partition quality is represented by the worst clustercorresponding to the minimalstandard score value obtained.

%Y

• It is natural to expect that the true number of clusters can be characterized by the empirical distributionof the partition standard score having the shortest left tail.

• The proposed methodology is expressed as a

Concept

17August 8, 2009

• The proposed methodology is expressed as a sequentialcreation of the described distribution with its left-asymmetry estimation.

One of important problems appearing here is the

so-called clusters coordination problem.

Actually, the same cluster can be differently tagged

within repeated rerunning of the algorithm.

Concept

18August 8, 2009

within repeated rerunning of the algorithm.

This fact results from the inherentsymmetryof

the partitions according to their clusters labels.

We solve this problem by the following way:

Let . Consider three categorizations:

21 SSS ∪=( )

( ),1 1

: , ,

: , ,

Π =

Π =K

K

Cl S K

Cl S K

Concept

19August 8, 2009

Thus, we get two partitions for each of the samples

Si, i=1,2. The first one is induced by ΠK and the

second one is .

( )( )

,1 1

,2 2

: , ,

: , .

Π =

Π =K

K

Cl S K

Cl S K

, , 1,2Π =K i i

For each one of the samplesi =1,2, our purpose is

to find thepermutationψ of the set {1,…,K} which

minimizes the quantities of the misclassified items:

( ) ( ) i

Concept

20August 8, 2009

,

where I(z) is the indicator function of the event z and

are assignments defined by ,

correspondingly.

( )iKiK αα ,, iKK ,,∏∏

( )( ) ( ) ( )*,arg min I , 1,2

ψα αψ ψ

∈

= ≠ =

∑K

iK i

Xi

x

x x i

The well-knownHungarian methodfor solving

this problem has computational complexity ofO(K3).

After changing the cluster labels of the partitions

, consistentwith ,

, 1, 2∏ =i* , 1, 2ψ =i

Concept

21August 8, 2009

, consistentwith ,

we can assume that these partitions arecoordinated,

i.e., the clusters are consistently designated.

, , 1, 2∏ =K i i* , 1, 2ψ =i i

1. Choose the parameters:K*, J, m, Cl .

2. For K = 2 to K*

3. For j = 1 to J

4. Sj,1= sample (X, m) , Sj,2= sample (X \ Sj,1, m)

Algorithm

22August 8, 2009

5. Calculate

ΠK , j =Cl( S(j), K) ,

ΠK , j,1 =Cl( Sj ,1, K) ,

ΠK , j,2 =Cl( Sj ,2, K) .

6. Solve the coordination problem.

7. Calculate Yj(k), k=1,…,K, .

8. end if j

9. Calculate an asymmetry index (percentile) IK

for .

% ( )KjY

% ( ){ 1 }

KjY | j = ,...,J

Algorithm

23August 8, 2009

10. end if K

11. The “true” K* is selected as the one which yields the maximalvalue of the index.

Here, sample(S,m) is a procedure which selects a random sample of size m from the set S, without replacement.

Numerical Experiments

We have carried out various numerical experiments onsynthetic

andrealdata sets. We chooseK*=7 in all tests, and we provide

10 trials for each experiment.

The results are presented via theerror-bar plotsof thesample

percentiles’ meanwithin thetrials. Thesizesof theerrorbars

24August 8, 2009

percentiles’ meanwithin thetrials. Thesizesof theerrorbars

equal two standard deviations, found inside the trials of the results.

The standard version of thePartitioning Around Medoids(PAM)

algorithm has been used for clustering.

Theempirical percentilesof 25%, 75% and 90% have been used

as the asymmetry indexes.

Numerical Experiments –SyntheticData

The synthesized data aremixturesof 2-dimensional

Gaussian distributions with independent coordinates

owning the same standard deviationσ.

Meanvaluesof thecomponentsareplacedon the

25August 8, 2009

Meanvaluesof thecomponentsareplacedon the

unit circle on theangular neighboring distance .

Each data set contains 4000 items.

Here, we tookJ=100 (J: number of samples) and

m=200 (m: size of samples).

k̂/2π

SyntheticData - Example 1

The first data set has the parameters andσ = 0.3.4ˆ =k

26August 8, 2009

As we see, all of the three indexes clearly indicate

four clusters.


The second synthetic data set has the parameters

and σ = 0.3.

5ˆ =k

27August 8, 2009

The components are obviously overlapping in this case.


28August 8, 2009

As it can be seen, the true number of clusters has been successfullyfound by all indexes.

Numerical Experiments –Real-World DataFirst Data Sets

The first real data set was chosen from thetext collection

http://ftp.cs.cornell.edu/pub/smart/.

29August 8, 2009

This set consists of the followingthreesub-collections

DC0: Medlars Collection (1033 medical abstracts),

DC1: CISI Collection (1460 information science abstracts),

DC2: Cranfield Collection (1400 aerodynamics abstracts).

We picked the 600 “best” terms, following the common

bag of wordsmethod.

It is knownthatthiscollectionis well separated

Numerical Experiments –Real-World DataFirst Data Sets

30August 8, 2009

It is knownthatthiscollectionis well separated

by means of its first two leading principal components.

Here, we also tookJ=100 andm=200.

Real-World Data - First Data Sets

31August 8, 2009

All the indexes receive their maximal values at K=3, i.e., the number of clusters is properly determined.

Numerical Experiments –Real-World DataSecond Data Set

Another considered data set is the famous

Iris Flower Data Set, available, for example, at

http://archive.ics.uci.edu/ml/datasets/Iris.

32August 8, 2009

This dataset is composed from 150 4-dimensional

feature vectors ofthreeequally sized sets of iris flowers.

We choose J=200 and the sample size equals 70.

Real-World Data –Iris Flower Data Set

33August 8, 2009

Our method turns out a threeclusters structure.

• In this paper, we propose a novelapproach, based on the Minimal Spanning Tree two sample test, for the cluster stability assessment.

• The method offers to quantify the partitions’ features

Conclusions -The Rationale of Our Approach

34August 8, 2009

• The method offers to quantify the partitions’ features through the test statistic computed within the clusters built by means of sample pairs.

• The worst cluster, determined by the lowest standardized statistic value, characterizes the partition quality.

• The departure from the theoretical model, which suggests well-mingled samples within the clusters, is described by the left tail of the score distribution.

• The shortest tail corresponds to the “true” number

Conclusions -The Rationale of Our Approach

35August 8, 2009

• The shortest tail corresponds to the “true” number of clusters.

• All presented experiments detect the true number of clusters.

Conclusions

• In the case of the five components Gaussian data set, the true number of clusters was found even though a certain overlapping of the clusters exists.

• The four Gaussian components dataset contains

36August 8, 2009

• The four Gaussian components dataset contains sufficiently separated components. Therefore, it is of no revelation that the true number of clusters is attained here.

Conclusions

• The analysis of the abstracts data set is carried out with 600 terms and the true number of clusters was also detected.

• The Iris Flower dataset is sufficiently difficult to

37August 8, 2009

• The Iris Flower dataset is sufficiently difficult to analyze due to the fact that two clusters are not linearly separable. However, the true number of clusters was found here as well.

ReferencesBarzily, Z., Volkovich, Z.V., Akteke-Öztürk, B., and Weber, G.-W., Cluster stability using minimal spanning trees,

ISI Proceedings of 20th Mini-EURO Conference Continuous Optimization and Knowledge-Based Technologies

(Neringa, Lithuania, May 20-23, 2008) 248-252.

Barzily, Z., Volkovich, Z.V., Akteke-Öztürk, B., and Weber, G.-W., On a minimal spanning tree approach in the

cluster validation problem, to appear in the special issue of INFORMATICA at the occasion of 20th Mini-EURO

Conference Continuous Optimization and Knowledge Based Technologies(Neringa, Lithuania, May 20-23, 2008),

Dzemyda, G., Miettinen, K., and Sakalauskas, L., guest editors.

Volkovich, V., Barzily, Z., Weber, G.-W., and Toledano-Kitai, D., Cluster stability estimation based on a minimal

38August 8, 2009

Volkovich, V., Barzily, Z., Weber, G.-W., and Toledano-Kitai, D., Cluster stability estimation based on a minimal

spanning trees approach, Proceedings of the Second Global Conference on Power Control and Optimization, AIP

Conference Proceedings 1159, Bali, Indonesia, 1-3 June 2009, Subseries: Mathematical and Statistical Physics; ISBN

978-0-7354-0696-4 (August 2009) 299-305; Hakim, A.H., Vasant, P., and Barsoum, N., guest eds..

methods from mathematical data mining (supported by optimization)

Education