wine clustering ling lin. contents ❏ motivation ❏ data ❏ dimensionality reduction-mds, isomap...
DESCRIPTION
Motivation Clustering is a main task of exploratory data mining Make market Segementation, marketing strategies Document Clustering Target appropriate treatment to patients with similar response patterns Image segementation Apply clustering methods to a real dataTRANSCRIPT
WineClustering
Ling Lin
Contents❏ Motivation❏ Data❏ Dimensionality Reduction-MDS, Isomap❏ Clustering-Kmeans, Ncut, Ratio Cut, SCC❏ Conclustion❏ Reference
Motivation• Clustering is a main task of exploratory data mining
Make market Segementation, marketing strategies Document Clustering Target appropriate treatment to patients with similar response
patterns Image segementation
• Apply clustering methods to a real data
Data➢ Wine data
Source of the data set : “Machine Learning Repository” , University of California, Irvine.
Data sample size : 14 variables and 178 observations in 3 classes : different cultivar
Variables :
1) Alcohol 2) Malic acid 3) Ash 4) Alcalinity of ash 5) Magnesium 6) Total phenols
7) Flavanoids 8) Nonflavanoid phenols 9) Proanthocyanins 10)Color intensity 11)Hue
12)OD280/OD315 of diluted wines 13)Proline
MDS
Can I seperate objects better? ---> change the ways to find the distances
Cityblock(L1)Distance
Chebychev Distance
Cosine Distance Mahalanobis Distance
Distances• Euclidean Distance-Straight line distance between two points.
• City-block Distance- (L1 Distance)
Sum of the distances of two points in any coordinate dimension.
Distances• Chebychev Distance-(Chessboard Distance)
The greatest distance of two points’ difference in any coordinate dimension.
• Cosine Distance-
The cosine of the angle between two vectors
Distances• Mahalanobis Distance-The dissimilarity of two vectors. S is the
covariance matrix.
Euclidean Distance = c
City-block Distance = a+b
Chebychev Distance = max(a,b) = a
Cosine Distance = cos(θ)
a
b
cθ
MDS in 3D
MDS in 2D
Isomap
Cosine
Mahalanobis
Isomap
Cosine
Mahalanobis
Kmeans Clustering
Error rate = 0.03
True Labeled Kmeans Clustering
Normalized Cut Ratio Cut SCC
ClusteringComparison
Conclusion• Dimensionality Reduction-
Different methods for calculating distances and reducing dimension
--->Wine dataV X
3D MDS Cosine Distance Mahalanobis
2D MDS Cosine Distance Mahalanobis
Isomap make Mahalanobis distance a better display
Conclusion• Clustering:
Kmeans= Rcut→ SCC→ Ncut
Ncut and Rcut : consider both inter and intra cluster connections.
However, in this dataset, the intra cluster connections are weak.