fall 2018: introduction to data scienceusers.cis.fiu.edu/~giri/teach/5768/f18/lecs/unitx3... ·...

22
Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

Upload: others

Post on 14-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

Fall 2018: Introduction to Data ScienceGIRI NARASIMHAN, SCIS, FIU

Page 2: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

Giri Narasimhan

Clustering

6/26/18

!2

Page 3: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

Giri Narasimhan

Clustering dogs using height & weight

6/26/18

!3

Page 4: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

Giri Narasimhan

Clustering dogs using height & weight

6/26/18

!4

Page 5: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

Giri Narasimhan

Clustering

!Clustering is the process of making clusters, which put similar things together into same cluster …

!And put dissimilar things into different clusters !Need a similarity function !Need a similarity distance function

❑ Convenient to map items to points in space

6/26/18

!5

Page 6: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

Giri Narasimhan

Distance Functions

! Jaccard Distance ! Hamming Distance ! Euclidean Distance ! Cosine Distance ! Edit Distance ! …

! What is a distance function ❑ D(x,y) >= 0 ❑ D(x,y) = D(y,x) ❑ D(x,y) <= D(x,z) + D(z,y)

6/26/18

!6

Page 7: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

Giri Narasimhan

Clustering Strategies

! Hierarchical or Agglomerative ❑ Bottom-up

! Partitioning methods ❑ Top-down

! Density-based ! Cluster-based ! Iterative methods

6/26/18

!7

Page 8: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

Giri Narasimhan

Curse of Dimensionality

! N points in d-dimensional space ❑ If d = 1, then average distance = 1/3 ❑ As d gets larger, what is the average distance? Distribution of distances?

▪ # of nearby points for any a given point vanishes. So, clustering does not work well ▪ # of points at max distance (~sqrt(d)) also vanishes. Real range actually very small

❑ Angle ABC given 3 points approaches 90 ▪ Denominator grows linearly with d ▪ Expected cos = 0 since equal points expected in all 4 quadrants

6/26/18

!8

Page 9: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

Giri Narasimhan

Hierarchical Clustering

6/26/18

!9

Page 10: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

Giri Narasimhan

Hierarchical Clustering

! Starts with each item in different clusters ! Bottom up ! In each iteration

❑ Two clusters are identified and merged into one ! Items are combined as the algorithm progresses ! Questions:

❑ How are clusters represented ❑ How to decide which ones to merge ❑ What is the sopping condition

! Typical algorithm: find smallest distance between nodes of different clusters

6/26/18

!10

Page 11: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

Giri Narasimhan

Hierarchical Clustering248 CHAPTER 7. CLUSTERING

(7,10)

(9,3)

(11,4)

(12,6)

(5,2)(2,2)

(3,4)

(12,3)

(4,8)

(4,10)

(4,9)

(10,5)

(2.5, 3)

(6,8)

(4.7, 8.7)

(10.5, 3.8)

Figure 7.5: Three more steps of the hierarchical clustering

1. Takethedistancebetween two clusters to be theminimum of thedistancesbetween any two points, one chosen from each cluster. For example, inFig. 7.3 wewould next chose to cluster the point (10,5) with the cluster oftwo points, since (10,5) has distance

√2, and no other pair of unclustered

points is that close. Note that in Example 7.2, we did make this combi-nat ion eventually, but not unt il we had combined another pair of points.In general, it is possible that this rule will result in an ent i rely differentclustering from that obtained using the distance-of-centroids rule.

2. Take the distance between two clusters to be the average distance of allpairs of points, one from each cluster.

(2,2) (3,4) (5,2) (4,8) (4,10) (6,8) (7,10) (11,4) (12,3) (10,5) (9,3) (12,6)

Figure 7.6: Tree showing the complete grouping of the points of Fig. 7.2

6/26/18

!11

Page 12: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

Giri Narasimhan

Output of Clustering: Dendrogram

248 CHAPTER 7. CLUSTERING

(7,10)

(9,3)

(11,4)

(12,6)

(5,2)(2,2)

(3,4)

(12,3)

(4,8)

(4,10)

(4,9)

(10,5)

(2.5, 3)

(6,8)

(4.7, 8.7)

(10.5, 3.8)

Figure 7.5: Three more steps of the hierarchical clustering

1. Takethedistancebetween two clusters to be theminimum of thedistancesbetween any two points, one chosen from each cluster. For example, inFig. 7.3 wewould next chose to cluster the point (10,5) with the cluster oftwo points, since (10,5) has distance

√2, and no other pair of unclustered

points is that close. Note that in Example 7.2, we did make this combi-nat ion eventually, but not unt il we had combined another pair of points.In general, it is possible that this rule will result in an ent i rely differentclustering from that obtained using the distance-of-cent roids rule.

2. Take the distance between two clusters to be the average distance of allpairs of points, one from each cluster.

(2,2) (3,4) (5,2) (4,8) (4,10) (6,8) (7,10) (11,4) (12,3) (10,5) (9,3) (12,6)

Figure 7.6: Tree showing the complete grouping of the points of Fig. 7.26/26/18

!12

Page 13: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

Giri Narasimhan

Measures for a cluster

! Radius: largest distance from a centroid ! Diameter: largest distance between some pair of points in cluster ! Density: # of points per unit volume ! Volume: some power of radius or diameter

! Good cluster: when diameter of each cluster is much larger than its nearest cluster or nearest point outside cluster

6/26/18

!13

Page 14: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

Giri Narasimhan

Stopping condition for clustering

! Cluster radius or diameter crosses a threshold ! Cluster density drops below a certain threshold ! Ratio of diameter to distance to nearest cluster drops below a certain

threshold

6/26/18

!14

Page 15: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

Giri Narasimhan

K-Means Clustering

6/26/18

!15

Page 16: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

CAP5510 / CGS 5166 2/11/13

!16

Start

Example from Andrew Moore’s tutorial on Clustering.

Page 17: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

CAP5510 / CGS 5166 2/11/13

!17

Example from Andrew Moore’s tutorial on Clustering.

Page 18: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

CAP5510 / CGS 5166 2/11/13

!18

Example from Andrew Moore’s tutorial on Clustering.

Page 19: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

CAP5510 / CGS 5166 2/11/13

!19

Start

End

Example from Andrew Moore’s tutorial on Clustering.

Page 20: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

CAP5510 / CGS 5166 2/11/13

!20K-Means Clustering [McQueen ’67]

Repeat ❑ Start with randomly chosen cluster centers

❑ Assign points to give greatest increase in score

❑ Recompute cluster centers

❑ Reassign points

until (no changes) Try the applet at: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

Page 21: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

Giri Narasimhan

How to find K for K-means?

6/26/18

!21

Page 22: Fall 2018: Introduction to Data Scienceusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/UnitX3... · 2018-10-11 · Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU. Giri Narasimhan

CAP5510 / CGS 5166 2/11/13

!22Comparisons

! Hierarchical clustering ❑ Number of clusters not preset. ❑ Complete hierarchy of clusters ❑ Not very robust, not very efficient.

! K-Means ❑ Need definition of a mean. Categorical data? ❑ Can be sensitive to initial cluster centers; Stopping condition unclear ❑ More efficient and often finds optimum clustering.