presenter : cheng-han tsai authors : liang bai , jiye liang, chuangyin dang kbs, 2011
DESCRIPTION
An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data. Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011. Outlines. Motivation Objectives Methodology Experiments - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/1.jpg)
Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
1
An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data
Presenter : Cheng-Han Tsai Authors : Liang Bai, Jiye Liang, Chuangyin Dang
KBS, 2011
![Page 2: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/2.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
2
Outlines
Motivation Objectives Methodology Experiments Conclusions Comments
![Page 3: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/3.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Motivation
· The k-modes algorithm is sensitive to initial cluster centers and needs to give the number of clusters in advance.
· We can’t guarantee the number of clusters we select are the best.
3
![Page 4: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/4.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Objectives
4
• To propose an initialization method to find initial cluster centers and the number of clusters.
• The method can efficiently deal with large categorical data in linear time.
![Page 5: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/5.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
5
Data SetConstruct a
potential exemplars set S
Set the estimated number of clusters
K-modes-type algorithm
The clustering result
1 2
3
4
5
67
![Page 6: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/6.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
· The k-modes algorithm
6
· Hamming distance:Differences between two codes(using XOR)ex: 10001001XOR 10110001------------------------
00111000 → Hamming distance = 3
![Page 7: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/7.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
· New cluster centers initialization method· Finding the number of clusters
7
![Page 8: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/8.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
· New cluster centers initialization method.
8
![Page 9: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/9.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
9
![Page 10: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/10.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
10
![Page 11: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/11.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
11
![Page 12: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/12.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
· Finding the number of clusters─ We need to input a value k’ which is a estimated
number of clusters─ If k’ can’t be determined, we set k’ = |S|
12
![Page 13: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/13.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
13
![Page 14: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/14.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
14
![Page 15: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/15.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
· More than 1 knee point of the function P(k)· More than 1 peak of the function C(k)
15
![Page 16: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/16.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiments
· Performance analysis─ Soybean dada (4 diseases)─ Lung cancer data (3 classes)─ Zoo data (7 classes which has 3 big clusters and 4
small clusters)─ Mushroom data (2 classes)
· Scalability analysis
16
![Page 17: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/17.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiments
· Performance analysis
17
![Page 18: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/18.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiments
18
![Page 19: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/19.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiments
· Scalability analysis─ 67557 data points and 42 categorical attribute
19
![Page 20: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/20.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Conclusions
· The proposed method is effective and efficient for obtaining the good initial cluster centers and the number of clusters
· The time complexity has been analyzed in linear time
20
![Page 21: Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011](https://reader033.vdocuments.site/reader033/viewer/2022061607/568132b9550346895d9973b2/html5/thumbnails/21.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
21
Comments
· Advantages─ Improve the old method about setting the two
parameters· Applications
─ Data clustering