selection k in k-means clustering

31
2013 KSE Seminar 2013/10/11 Jung hoon Kim

Upload: junghoon-kim

Post on 23-Jun-2015

433 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Selection K in K-means Clustering

2013 KSE Seminar

2013/10/11

Jung hoon Kim

Page 2: Selection K in K-means Clustering

TOPIC

Page 3: Selection K in K-means Clustering

Selection of K in K-means clustering

Page 4: Selection K in K-means Clustering

Why I choose this paper

• There is always an assumption in k-means algo-rithm, but I really want to execute without human’s intuition or insight.• This paper is first review existing automatical

method for selecting the number of clusters for k-means algorithm

Page 5: Selection K in K-means Clustering

Paper Format

1) Introduction2) review the main known method for selecting K3) analyses the factors influencing the selection of K4) describes the proposed evaluation measure5) presents the results of applying the proposed

measure to select K for different data sets6) concludes the paper

Page 6: Selection K in K-means Clustering

Small introduction

Page 7: Selection K in K-means Clustering

K-means Algorithm

• k-means algorithm is a method of clustering al-gorithm originally from signal processing, that is popular for machine learning and data mining. • k-means clustering aims to partition n observa-

tions into k clusters in which each observation be-longs to the cluster with the nearest mean until move distance is smaller than threshold

Page 8: Selection K in K-means Clustering

K-means Algorithm

1) Pick a number (k) of point randomly2) Assign every node to its nearest cluster center3) Move each cluster center to the mean of its as-

signed nodes4) Repeat 2-3 until convergence

Page 9: Selection K in K-means Clustering

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

exp

ress

ion

in c

on

dit

ion

2

Clustering: Example 2, Step 1Algorithm: k-means, Distance Metric: Euclidean Distance

k1

k2

k3

Page 10: Selection K in K-means Clustering

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

exp

ress

ion

in c

on

dit

ion

2

Clustering: Example 2, Step 2Algorithm: k-means, Distance Metric: Euclidean Distance

k1

k2

k3

Page 11: Selection K in K-means Clustering

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

exp

ress

ion

in c

on

dit

ion

2

Clustering: Example 2, Step 3Algorithm: k-means, Distance Metric: Euclidean Distance

k1

k2

k3

Page 12: Selection K in K-means Clustering

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

exp

ress

ion

in c

on

dit

ion

2

Clustering: Example 2, Step 4Algorithm: k-means, Distance Metric: Euclidean Distance

k1

k2

k3

Page 13: Selection K in K-means Clustering

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

exp

ress

ion

in c

on

dit

ion

2

Clustering: Example 2, Step 5Algorithm: k-means, Distance Metric: Euclidean Distance

k1

k2k3

Page 14: Selection K in K-means Clustering

Comments on the K-Means Method

• Strength • Relatively efficient: O(tkn), where n is # instances, c is # clus-

ters, and t is # iterations. Normally, k, t << n.• Often terminates at a local optimum. The global optimum may

be found using techniques such as: simulated annealing or genetic algorithms

• Weakness• Need to specify c, the number of clusters, in advance• Initialization Problem• Not suitable to discover clusters with non-convex shapes

Page 15: Selection K in K-means Clustering

What’s the problem?

Page 16: Selection K in K-means Clustering

What’s the problem?• Initialization problem

• it's a problem which is caused when much point is assigned to the part of high density and less point is assigned to the part of low density

Page 17: Selection K in K-means Clustering

What’s the problem?• hard to find cluster in non-convex shape

Page 18: Selection K in K-means Clustering

What’s the problem?

• Selection of K

Page 19: Selection K in K-means Clustering

Existing Method

• Values of K determined through human’s viewpoint

• Using probabilistic theory• Akeike’s information criterion

• if data sets are constructed by a set of Gaussian dist• Hardy method

• if data sets are constructed by a set of Possion dist• Monte Carlo techniques(associated null hypothesis)

Page 20: Selection K in K-means Clustering

Paper proposed

• : for evaluating the clustering result could be used to select the number of clusters• the value of f is the ratio of the real distortion to the esti-

mated distortion and is close to 1 when the data distribu-tion is uniform

• is the distortion of cluster j• is the distance between object

• is the specific number of clusters• the value of S would be getting reduced as the value of K

would increase

Page 21: Selection K in K-means Clustering

Formula

• if K = 1 • if != 0, >1• if == 0, >1

• if K=2 and • if K>2 and

Page 22: Selection K in K-means Clustering

Research Method

• The method has been validated on 15 artificial and 12 benchmark data sets.• Also there are 12 benchmark data sets from the UCI

Repository Machine Learning Databases• These fifteen artificial data sets show effective

sample of lots of distribution which can be usually generated.

Page 23: Selection K in K-means Clustering

Sample

Page 24: Selection K in K-means Clustering

Sample

Page 25: Selection K in K-means Clustering

Sample

Page 26: Selection K in K-means Clustering

Sample

Page 27: Selection K in K-means Clustering

Recommendation Examplef(X) < 0.85, K = Xelse K=1

Page 28: Selection K in K-means Clustering

Conclusion

• The new method is closely related to the approach of K-means clustering because it takes into account information reflecting the performance of the algo-rithm• The proposed method can suggest multiple values

of K to users for cases when different clustering re-sults could be obtained with various required levels of detail• this method is computationally expensive if used

with large data sets

Page 29: Selection K in K-means Clustering

improvement

• This paper did not mentioned how can we calculate threshold(e.g, f(x) < 0.85), if we have lots of data sets, we can apply learning algorithm to determine threshold• Experiment data sets are almost biased. This

means, having set of data is too ideal. It doesn't consider the complexity in reality at all. It can be a way to evaluate data randomly.• It is an important issue that we know the range, or

maximum value of K.

Page 30: Selection K in K-means Clustering

Do youhave any question?

Page 31: Selection K in K-means Clustering

thank you