selection k in k-means clustering
TRANSCRIPT
2013 KSE Seminar
2013/10/11
Jung hoon Kim
TOPIC
Selection of K in K-means clustering
Why I choose this paper
• There is always an assumption in k-means algo-rithm, but I really want to execute without human’s intuition or insight.• This paper is first review existing automatical
method for selecting the number of clusters for k-means algorithm
Paper Format
1) Introduction2) review the main known method for selecting K3) analyses the factors influencing the selection of K4) describes the proposed evaluation measure5) presents the results of applying the proposed
measure to select K for different data sets6) concludes the paper
Small introduction
K-means Algorithm
• k-means algorithm is a method of clustering al-gorithm originally from signal processing, that is popular for machine learning and data mining. • k-means clustering aims to partition n observa-
tions into k clusters in which each observation be-longs to the cluster with the nearest mean until move distance is smaller than threshold
K-means Algorithm
1) Pick a number (k) of point randomly2) Assign every node to its nearest cluster center3) Move each cluster center to the mean of its as-
signed nodes4) Repeat 2-3 until convergence
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
exp
ress
ion
in c
on
dit
ion
2
Clustering: Example 2, Step 1Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
exp
ress
ion
in c
on
dit
ion
2
Clustering: Example 2, Step 2Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
exp
ress
ion
in c
on
dit
ion
2
Clustering: Example 2, Step 3Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
exp
ress
ion
in c
on
dit
ion
2
Clustering: Example 2, Step 4Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
exp
ress
ion
in c
on
dit
ion
2
Clustering: Example 2, Step 5Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2k3
Comments on the K-Means Method
• Strength • Relatively efficient: O(tkn), where n is # instances, c is # clus-
ters, and t is # iterations. Normally, k, t << n.• Often terminates at a local optimum. The global optimum may
be found using techniques such as: simulated annealing or genetic algorithms
• Weakness• Need to specify c, the number of clusters, in advance• Initialization Problem• Not suitable to discover clusters with non-convex shapes
What’s the problem?
What’s the problem?• Initialization problem
• it's a problem which is caused when much point is assigned to the part of high density and less point is assigned to the part of low density
What’s the problem?• hard to find cluster in non-convex shape
What’s the problem?
• Selection of K
Existing Method
• Values of K determined through human’s viewpoint
• Using probabilistic theory• Akeike’s information criterion
• if data sets are constructed by a set of Gaussian dist• Hardy method
• if data sets are constructed by a set of Possion dist• Monte Carlo techniques(associated null hypothesis)
Paper proposed
• : for evaluating the clustering result could be used to select the number of clusters• the value of f is the ratio of the real distortion to the esti-
mated distortion and is close to 1 when the data distribu-tion is uniform
• is the distortion of cluster j• is the distance between object
• is the specific number of clusters• the value of S would be getting reduced as the value of K
would increase
Formula
• if K = 1 • if != 0, >1• if == 0, >1
• if K=2 and • if K>2 and
Research Method
• The method has been validated on 15 artificial and 12 benchmark data sets.• Also there are 12 benchmark data sets from the UCI
Repository Machine Learning Databases• These fifteen artificial data sets show effective
sample of lots of distribution which can be usually generated.
Sample
Sample
Sample
Sample
Recommendation Examplef(X) < 0.85, K = Xelse K=1
Conclusion
• The new method is closely related to the approach of K-means clustering because it takes into account information reflecting the performance of the algo-rithm• The proposed method can suggest multiple values
of K to users for cases when different clustering re-sults could be obtained with various required levels of detail• this method is computationally expensive if used
with large data sets
improvement
• This paper did not mentioned how can we calculate threshold(e.g, f(x) < 0.85), if we have lots of data sets, we can apply learning algorithm to determine threshold• Experiment data sets are almost biased. This
means, having set of data is too ideal. It doesn't consider the complexity in reality at all. It can be a way to evaluate data randomly.• It is an important issue that we know the range, or
maximum value of K.
Do youhave any question?
thank you