data set used. k means k means clusters 1.k means begins with a user specified amount of clusters...

17
Data Set used

Upload: kylee-moxham

Post on 31-Mar-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds

Data Set used

Page 2: Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds

K Means

Page 3: Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds

K Means Clusters

1. K Means begins with a user specified amount of clusters

2. Randomly places the K centroids on the data set3. Finds all the points closest to each centroid and

makes them clusters4. Changes the centroid of each cluster to the

mean of the subset of points5. Repeats step 5 until the change of the centroids

is minimal.

Page 4: Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds

Kmeans Implementation Issues

• If K is too small the algorithm did not converge (no stable clusters)– Further investigation of this is needed

• If K is too small, some clusters were null

Page 5: Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds

K- Means Matlab code

Page 6: Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds

Ease of Doing Business vs Paying Taxes

Page 7: Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds

Interesting case

• The border points are clearly defined by distance not density

• We ask for each point “What is the closest centroid?”

Page 8: Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds

Why we like it

• It is relatively straight forward in concept and implementation

• Good for globular data• We can specify the amount of clusters

Page 9: Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds

Why we don’t like it

• Subject to initialization problems and heterogeneous results.

• Not good for non-globular data (but can find clusters given a large enough K)

• Sensitive to outliers (cleaning data set helps)• Data must have the notion of a “center”

Page 10: Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds

Variations

• Bisecting K-means• K-median• K - medoid• Several others

Page 11: Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds

DBSCAN Algo

Pick a point P, find distance of every next point P' from P.

If(Dist < K Factor)P' is in same cluster as P.

else if (Dist = K Factor)P' is a border point.

else Allot P' a new cluster.

Page 12: Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds

SNAPSHOTS

For K_Factor = 20

Page 13: Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds

For K_Factor = 10

Page 14: Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds

For K_Factor = 120

Page 15: Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds

Calculation of K-Factor

Page 16: Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds

Issues faced

When adding a new point P' to the present cluster, the whole cluster of P' has to be merged with the present cluster.

No lower bound on number of clusters.

Choice of K Factor

Page 17: Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds

Further Enhancements

• Calculation for K-Factor and clustering could be integrated together.

• Dynamic programming could be made use of since many computations are being repeated.

• Static vs Dynamic data