selection k in k-means clustering

2013 KSE Seminar

2013/10/11

Jung hoon Kim

Selection of K in K-means clustering

Why I choose this paper

• There is always an assumption in k-means algo-rithm, but I really want to execute without human’s intuition or insight.• This paper is first review existing automatical

method for selecting the number of clusters for k-means algorithm

Paper Format

1) Introduction2) review the main known method for selecting K3) analyses the factors influencing the selection of K4) describes the proposed evaluation measure5) presents the results of applying the proposed

measure to select K for different data sets6) concludes the paper

Small introduction

K-means Algorithm

• k-means algorithm is a method of clustering al-gorithm originally from signal processing, that is popular for machine learning and data mining. • k-means clustering aims to partition n observa-

tions into k clusters in which each observation be-longs to the cluster with the nearest mean until move distance is smaller than threshold

K-means Algorithm

1) Pick a number (k) of point randomly2) Assign every node to its nearest cluster center3) Move each cluster center to the mean of its as-

signed nodes4) Repeat 2-3 until convergence

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

exp

ress

ion

in c

on

dit

ion

2

Clustering: Example 2, Step 1Algorithm: k-means, Distance Metric: Euclidean Distance

k1

k2

k3

0

1

2

3

4

5

0 1 2 3 4 5


exp

ress

ion

in c

on

dit

ion

2


k1

k2

k3

0

1

2

3

4

5

0 1 2 3 4 5


exp

ress

ion

in c

on

dit

ion

2


k1

k2k3

Comments on the K-Means Method

• Strength • Relatively efficient: O(tkn), where n is # instances, c is # clus-

ters, and t is # iterations. Normally, k, t << n.• Often terminates at a local optimum. The global optimum may

be found using techniques such as: simulated annealing or genetic algorithms

• Weakness• Need to specify c, the number of clusters, in advance• Initialization Problem• Not suitable to discover clusters with non-convex shapes

What’s the problem?

What’s the problem?• Initialization problem

• it's a problem which is caused when much point is assigned to the part of high density and less point is assigned to the part of low density

What’s the problem?• hard to find cluster in non-convex shape

What’s the problem?

• Selection of K

Existing Method

• Values of K determined through human’s viewpoint

• Using probabilistic theory• Akeike’s information criterion

• if data sets are constructed by a set of Gaussian dist• Hardy method

• if data sets are constructed by a set of Possion dist• Monte Carlo techniques(associated null hypothesis)

Paper proposed

• : for evaluating the clustering result could be used to select the number of clusters• the value of f is the ratio of the real distortion to the esti-

mated distortion and is close to 1 when the data distribu-tion is uniform

• is the distortion of cluster j• is the distance between object

• is the specific number of clusters• the value of S would be getting reduced as the value of K

would increase

Formula

• if K = 1 • if != 0, >1• if == 0, >1

• if K=2 and • if K>2 and

Research Method

• The method has been validated on 15 artificial and 12 benchmark data sets.• Also there are 12 benchmark data sets from the UCI

Repository Machine Learning Databases• These fifteen artificial data sets show effective

sample of lots of distribution which can be usually generated.

Sample

Recommendation Examplef(X) < 0.85, K = Xelse K=1

Conclusion

• The new method is closely related to the approach of K-means clustering because it takes into account information reflecting the performance of the algo-rithm• The proposed method can suggest multiple values

of K to users for cases when different clustering re-sults could be obtained with various required levels of detail• this method is computationally expensive if used

with large data sets

improvement

• This paper did not mentioned how can we calculate threshold(e.g, f(x) < 0.85), if we have lots of data sets, we can apply learning algorithm to determine threshold• Experiment data sets are almost biased. This

means, having set of data is too ideal. It doesn't consider the complexity in reality at all. It can be a way to evaluate data randomly.• It is an important issue that we know the range, or

maximum value of K.

Do youhave any question?

thank you

selection k in k-means clustering

Technology

method of clustering

condition 254k132k2k310

distance metric

euclidean distanceexpression

condition 254k1 3k221k3

condition 254k132k3

number of clusters

paper format