determining the k in k-means with mapreduce

26
Determining the k in k-means with MapReduce Thibault Debatty, Pietro Michiardi, Wim Mees & Olivier Thonnard Algorithms for MapReduce and Beyond 2014

Upload: thibault-debatty

Post on 11-May-2015

460 views

Category:

Science


2 download

DESCRIPTION

Presentation given at Algorithms for MapReduce and Beyond, 2014-03-28

TRANSCRIPT

Page 1: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce

Thibault Debatty, Pietro Michiardi,Wim Mees & Olivier Thonnard

Algorithms for MapReduce and Beyond 2014

Page 2: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 2

Clustering & k-means

● Clustering● K-means

[Stuart P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28:129–137, 1982.]

– 1982 (a great year!)– But still largely used– Drawbacks (amongst others):

● Local minimum● K is a parameter!

Page 3: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 3

Clustering & k-means

● Determine k:– VERY difficult

[Anil K Jain. Data Clustering : 50 Years Beyond K-Means. Pattern Recognition Letters, 2009]

– Using cluster evaluation metrics:Dunn's index, Elbow, Silhouette, “jump method” (based on information theory), “Gap statistic”,...

O(k²)

Page 4: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 4

G-means

● G-means[Greg Hamerly and Charles Elkan. Learning the k in k-means. In Neural Information Processing Systems. MIT Press, 2003]

● K-means : points in each cluster are spherically distributed around the center

Source: scikit-learn

Page 5: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 5

G-means

● G-means[Greg Hamerly and Charles Elkan. Learning the k in k-means. In Neural Information Processing Systems. MIT Press, 2003]

● K-means : points in each cluster are spherically distributed around the center

normality test & recursion

Page 6: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 6

G-means

Dataset

Page 7: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 7

G-means

1. Pick 2 centers

Page 8: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 8

G-means

2. k-means

Page 9: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 9

G-means

3. Project

Page 10: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 10

G-means

3. Project

Page 11: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 11

G-means

Normal?No=> recursion

4. Normality test

Page 12: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 12

G-means

5. Recursion

Page 13: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 13

MapReduce G-means

● Challenges:

1. Reduce I/O operations

2. Reduce number of jobs

3. Maximize parallelism

4. Limit memory usage

Page 14: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 14

MapReduce G-means

● Challenges:

1. Reduce I/O operations

2. Reduce number of jobs

3. Maximize parallelism

4. Limit memory usage

Page 15: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 15

MapReduce G-means

2. Reduce number of jobs

PickInitialCenters

while Not ClusteringCompleted do

KMeans

KMeansAndFindNewCenters

TestClusters

end while

Page 16: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 16

MapReduce G-means

TestClusters

Map(key, point)Find clusterFind vectorProject point on vectorEmit(cluster, projection)

end procedure

Reduce(cluster, projections)Build a vectorADtest(vector)if normal then

Mark clusterend if

end procedure

3. Maximize parallelism

4. Limit memory usage

Page 17: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 17

MapReduce G-means

TestClusters

Map(key, point)Find clusterFind vectorProject point on vectorEmit(cluster, projection)

end procedure

Reduce(cluster, projections)Build a vectorADtest(vector)if normal then

Mark clusterend if

end procedure Bottle

neck

3. Maximize parallelism

4. Limit memory usage (risk of crash)

Page 18: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 18

MapReduce G-means

TestClusters

Map(key, point)Find clusterFind vectorProject point on vectorEmit(cluster, projection)

end procedure

Reduce(cluster, projections)Build a vectorADtest(vector)if normal then

Mark clusterend if

end procedure

TestFewClusters

Map(key, point)Find clusterFind vectorProject point on vectorAdd projection to list

end procedure

Close()For each list do

Build a vectorA2 = ADtest(vector)Emit(cluster, A2)

End for eachend procedure

In memory combiner

Page 19: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 19

MapReduce G-means

TestClusters

Map(key, point)Find clusterFind vectorProject point on vectorEmit(cluster, projection)

end procedure

Reduce(cluster, projections)Build a vectorADtest(vector)if normal then

Mark clusterend if

end procedure

TestFewClusters

Map(key, point)Find clusterFind vectorProject point on vectorAdd projection to list

end procedure

Close()For each list do

Build a vectorA2 = ADtest(vector)Emit(cluster, A2)

End for eachend procedure

#clusters > #reducers

&

Estimated required memory < Java heap

Page 20: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 20

MapReduce G-means

TestClusters

Map(key, point)Find clusterFind vectorProject point on vectorEmit(cluster, projection)

end procedure

Reduce(cluster, projections)Build a vectorADtest(vector)if normal then

Mark clusterend if

end procedure

TestFewClusters

Map(key, point)Find clusterFind vectorProject point on vectorAdd projection to list

end procedure

Close()For each list do

Build a vectorA2 = ADtest(vector)Emit(cluster, A2)

End for eachend procedure

#clusters > #reducers

&

Estimated required memory < Java heap

Experimentally:64 Bytes / point

Page 21: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 21

Comparison

MR multi-k-means MR G-means

Speed

Quality

all possible values of kin a single job

Page 22: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 22

Comparison

MR multi-k-means MR G-means

Speed O(nk²) computations O(nk) computations

But:● more iterations● more dataset reads● log

2(k)

Quality New centers added if and where needed

But:tends to overestimate k!

Page 23: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 23

Experimental results : Speed

● Hadoop● Synthetic dataset● 10M points in R10

● Euclidean distance● 8 machines

Page 24: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 24

Experimental results : Quality

● Hadoop● Synthetic dataset● 10M points in R10

● Euclidean distance● 8 machines

k 100 200 400

kfound

150 279 639

Within Cluster Sum of Square(less is better)

MR G-means 3.34 3.33 3.23

multi-k-means 3.71 3.6 3.39

(with same k)

x ~1.5

Page 25: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 25

Conclusions & future work...

● MapReduce algorithm to determine k● Running time proportional to k● Future:

– Overestimation of k– Test on real data– Test scalability– Reduce I/O (using Spark)– Consider skewed data– Consider impact of machine failure

Page 26: Determining the k in k-means with MapReduce

Determining the k in k-means with MapReduce 26

Thank you!