clustering hongfei yan school of eecs, peking university 7/8/2009 refer to aaron kimball’s slides

52
Clustering http://net.pku.edu.cn/~course/cs402/2009 Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Upload: stephen-atkinson

Post on 08-Jan-2018

217 views

Category:

Documents


0 download

DESCRIPTION

Other less glamorous things... Hospital Records Scientific Imaging –Related genes, related stars, related sequences Market Research –Segmenting markets, product positioning Social Network Analysis Data mining Image segmentation…

TRANSCRIPT

Page 1: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Clustering

http://net.pku.edu.cn/~course/cs402/2009

Hongfei YanSchool of EECS, Peking University

7/8/2009

Refer to Aaron Kimball’s slides

Page 2: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Google News

• They didn’t pick all 3,400,217 related articles by hand…

• Or Amazon.com • Or Netflix…

Page 3: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Other less glamorous things...

• Hospital Records• Scientific Imaging

– Related genes, related stars, related sequences• Market Research

– Segmenting markets, product positioning• Social Network Analysis• Data mining• Image segmentation…

Page 4: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

The Distance Measure• How the similarity of two elements in a set is

determined, e.g.– Euclidean Distance

– Manhattan Distance– Inner Product Space– Maximum Norm – Or any metric you define over the space…

Page 5: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

• Hierarchical Clustering vs.• Partitional Clustering

Types of Algorithms

Page 6: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Hierarchical Clustering

• Builds or breaks up a hierarchy of clusters.

Page 7: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Partitional Clustering

• Partitions set into all clusters simultaneously.

Page 8: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Partitional Clustering

• Partitions set into all clusters simultaneously.

Page 9: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

K-Means Clustering • Simple Partitional Clustering

• Choose the number of clusters, k• Choose k points to be cluster centers• Then…

Page 10: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

K-Means Clustering

iterate { Compute distance from all points to all k- centers Assign each point to the nearest k-center Compute the average of all points assigned to all specific k-centers

Replace the k-centers with the new averages}

Page 11: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

But!

• The complexity is pretty high: – k * n * O ( distance metric ) * num (iterations)

• Moreover, it can be necessary to send tons of data to each Mapper Node. Depending on your bandwidth and memory available, this could be impossible.

Page 12: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Furthermore

• There are three big ways a data set can be large:– There are a large number of elements in the

set.– Each element can have many features.– There can be many clusters to discover

• Conclusion – Clustering can be huge, even when you distribute it.

Page 13: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Canopy Clustering

• Preliminary step to help parallelize computation.

• Clusters data into overlapping Canopies using super cheap distance metric.

• Efficient• Accurate

Page 14: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Canopy Clustering

While there are unmarked points { pick a point which is not strongly marked call it a canopy centermark all points within some threshold of

it as in it’s canopystrongly mark all points within some

stronger threshold }

Page 15: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

After the canopy clustering…

• Resume hierarchical or partitional clustering as usual.

• Treat objects in separate clusters as being at infinite distances.

Page 16: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

MapReduce Implementation:

• Problem – Efficiently partition a large data set (say… movies with user ratings!) into a fixed number of clusters using Canopy Clustering, K-Means Clustering, and a Euclidean distance measure.

Page 17: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

The Distance Metric• The Canopy Metric ($)

• The K-Means Metric ($$$)

Page 18: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Steps!

• Get Data into a form you can use (MR)• Picking Canopy Centers (MR)• Assign Data Points to Canopies (MR)• Pick K-Means Cluster Centers• K-Means algorithm (MR)

– Iterate!

Page 19: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Data Massage• This isn’t interesting, but it has to be done.

Page 20: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Selecting Canopy Centers

Page 21: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 22: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 23: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 24: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 25: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 26: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 27: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 28: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 29: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 30: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 31: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 32: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 33: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 34: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 35: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 36: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Assigning Points to Canopies

Page 37: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 38: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 39: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 40: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 41: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 42: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

K-Means Map

Page 43: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 44: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 45: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 46: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 47: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 48: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 49: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides
Page 50: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Elbow Criterion• Choose a number of clusters s.t. adding a

cluster doesn’t add interesting information.

• Rule of thumb to determine what number of Clusters should be chosen.

• Initial assignment of cluster seeds has bearing on final model performance.

• Often required to run clustering several times to get maximal performance

Page 51: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Clustering Conclusions

• Clustering is slick• And it can be done super efficiently• And in lots of different ways

Page 52: Clustering  Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides

Homework

• Lab 4 - Clustering the Netflix Movie Data• Hw4 – Read IIR chapter 16

– Flat clustering