Transcript
Page 1: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Clustering

http://net.pku.edu.cn/~course/cs402/2009

Hongfei YanSchool of EECS, Peking University

7/8/2009

Refer to Aaron Kimball’s slides

Page 2: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Google News

• They didn’t pick all 3,400,217 related articles by hand…

• Or Amazon.com • Or Netflix…

Page 3: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Other less glamorous things...

• Hospital Records• Scientific Imaging

– Related genes, related stars, related sequences• Market Research

– Segmenting markets, product positioning• Social Network Analysis• Data mining• Image segmentation…

Page 4: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

The Distance Measure• How the similarity of two elements in a set is

determined, e.g.– Euclidean Distance

– Manhattan Distance– Inner Product Space– Maximum Norm – Or any metric you define over the space…

Page 5: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

• Hierarchical Clustering vs.• Partitional Clustering

Types of Algorithms

Page 6: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Hierarchical Clustering

• Builds or breaks up a hierarchy of clusters.

Page 7: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Partitional Clustering

• Partitions set into all clusters simultaneously.

Page 8: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Partitional Clustering

• Partitions set into all clusters simultaneously.

Page 9: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

K-Means Clustering • Simple Partitional Clustering

• Choose the number of clusters, k• Choose k points to be cluster centers• Then…

Page 10: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

K-Means Clustering

iterate { Compute distance from all points to all k- centers Assign each point to the nearest k-center Compute the average of all points assigned to all specific k-centers

Replace the k-centers with the new averages}

Page 11: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

But!

• The complexity is pretty high: – k * n * O ( distance metric ) * num (iterations)

• Moreover, it can be necessary to send tons of data to each Mapper Node. Depending on your bandwidth and memory available, this could be impossible.

Page 12: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Furthermore

• There are three big ways a data set can be large:– There are a large number of elements in the

set.– Each element can have many features.– There can be many clusters to discover

• Conclusion – Clustering can be huge, even when you distribute it.

Page 13: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Canopy Clustering

• Preliminary step to help parallelize computation.

• Clusters data into overlapping Canopies using super cheap distance metric.

• Efficient• Accurate

Page 14: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Canopy Clustering

While there are unmarked points { pick a point which is not strongly marked call it a canopy centermark all points within some threshold of

it as in it’s canopystrongly mark all points within some

stronger threshold }

Page 15: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

After the canopy clustering…

• Resume hierarchical or partitional clustering as usual.

• Treat objects in separate clusters as being at infinite distances.

Page 16: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

MapReduce Implementation:

• Problem – Efficiently partition a large data set (say… movies with user ratings!) into a fixed number of clusters using Canopy Clustering, K-Means Clustering, and a Euclidean distance measure.

Page 17: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

The Distance Metric• The Canopy Metric ($)

• The K-Means Metric ($$$)

Page 18: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Steps!

• Get Data into a form you can use (MR)• Picking Canopy Centers (MR)• Assign Data Points to Canopies (MR)• Pick K-Means Cluster Centers• K-Means algorithm (MR)

– Iterate!

Page 19: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Data Massage• This isn’t interesting, but it has to be done.

Page 20: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Selecting Canopy Centers

Page 21: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 22: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 23: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 24: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 25: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 26: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 27: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 28: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 29: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 30: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 31: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 32: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 33: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 34: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 35: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 36: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Assigning Points to Canopies

Page 37: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 38: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 39: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 40: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 41: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 42: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

K-Means Map

Page 43: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 44: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 45: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 46: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 47: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 48: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 49: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 50: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Elbow Criterion• Choose a number of clusters s.t. adding a

cluster doesn’t add interesting information.

• Rule of thumb to determine what number of Clusters should be chosen.

• Initial assignment of cluster seeds has bearing on final model performance.

• Often required to run clustering several times to get maximal performance

Page 51: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Clustering Conclusions

• Clustering is slick• And it can be done super efficiently• And in lots of different ways

Page 52: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Homework

• Lab 4 - Clustering the Netflix Movie Data• Hw4 – Read IIR chapter 16

– Flat clustering


Top Related