data clustering - hcbravo lab: biomedical data …a clustering method then uses the similarity...

58
Motivating Example Time series dataset of mortgage affordability as calculated and distributed by Zillow: https://www.zillow.com/research/data/. The dataset consists of monthly mortgage affordability values for 76 counties with data from 1979 to 2017. 1 / 57 Motivating Example "To calculate mortgage affordability, we first calculate the mortgage payment for the median-valued home in a metropolitan area by using the metro-level Zillow Home Value Index for a given quarter and the 30-year fixed mortgage interest rate during that time period, provided by the Freddie Mac Primary Mortgage Market Survey (based on a 20 percent down payment)." 2 / 57 Motivating Example "Then, we consider what portion of the monthly median household income (U.S. Census) goes toward this monthly mortgage payment. Median household income is available with a lag. " 3 / 57 Can we partition counties into groups of counties with similar value trends across time? Motivating Example 4 / 57 Cluster Analysis The high-level goal of cluster analysis is to organize objects (observations) that are similar to each other into groups. We want objects within a group to be more similar to each other than objects in different groups. 5 / 57 Cluster Analysis The high-level goal of cluster analysis is to organize objects (observations) that are similar to each other into groups. We want objects within a group to be more similar to each other than objects in different groups. Central to this high-level goal is how to measure the degree of similarity between objects. A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Cluster Analysis Result of the k-means algorithm partitioning the data into 9 clusters. The darker series within each cluster shows the average time series within the cluster. 7 / 57 Dissimilarity-based Clustering For certain algorithms, instead of similarity we work with dissimilarity, often represented as distances. When we have observations defined over attributes, or predictors, we define dissimilarity based on these attributes. 8 / 57 Dissimilarity-based Clustering Given measurements for observations over predictors. Suppose we define a dissimilarity , we can then define a dissimilarity between objects as 9 / 57 Dissimilarity-based Clustering In the k-means algorithm, and many other algorithms, the most common usage is squared distance We can use different dissimilarities, for example which may affect our choice of clustering algorithm later on. 10 / 57 Dissimilarity-based Clustering For categorical variables, we could set 11 / 57 Dissimilarity-based Clustering If the values the categorical variable have an intrinsic similarity Generalize using symmetric matrix with elements , and otherwise. This may of course lead to a dissimilarity that is not a proper distance. 12 / 57 Limitations of Distance-based clustering When working with distances, we are quickly confronted with the curse of dimensionality. One flavor: in high dimensions: all points are equi-distant 13 / 57 The curse of dimensionality Consider the case where we have many covariates. We want to use a distance-based clustering method. 14 / 57 The curse of dimensionality Consider the case where we have many covariates. We want to use a distance-based clustering method. Basically, we need to define distance and look for small multi- dimensional "balls" around the data points. With many covariates this becomes difficult. 15 / 57 The curse of dimensionality Imagine we have equally spaced data and that each covariate is in . We want something that clusters points into local regions, containg some reasonably small amount of data points (say 10%). Let's imagine these are high-dimensional cubes. 16 / 57 The curse of dimensionality Imagine we have equally spaced data and that each covariate is in . We want something that clusters points into local regions, containg some reasonably small amount of data points (say 10%). Let's imagine these are high-dimensional cubes. If we have covariates and we are forming -dimensional cubes, then each side of the cube must have size determined by . 17 / 57 The curse of dimensionality If the number of covariates is p=10, then . So it really isn't local! If we reduce the percent of data we consider to 1%, . Still not very local. 18 / 57 The curse of dimensionality If the number of covariates is p=10, then . So it really isn't local! If we reduce the percent of data we consider to 1%, . Still not very local. If we keep reducing the size of the neighborhoods we will end up with very small number of data points in each cluster and require a large number of clusters. 19 / 57 K-means Clustering A commonly used algorithm to perform clustering is the K-means algorithm. It is appropriate when using squared Euclidean distance as the measure of object dissimilarity. 20 / 57 K-means Clustering K-means partitions observations into clusters, with provided as a parameter. Given some clustering, or partition, , denote cluster assignment of observation to cluster is denoted as . 21 / 57 K-means Clustering K-means partitions observations into clusters, with provided as a parameter. Given some clustering, or partition, , denote cluster assignment of observation to cluster is denoted as . K-means minimizes this clustering criterion: 22 / 57 K-means Clustering This is equivalent to minimizing with: is the average of predictor over the observations assigned to cluster , is the number of observations assigned to cluster 23 / 57 K-means Clustering Minimize the total distance given by each observation to the mean (centroid) of the cluster to which the observation is assigned. 24 / 57 K-means Clustering An iterative algorithm is used to minimize this criterion 1. Initialize by choosing observations as centroids 2. Assign each observation to the cluster with the nearest centroid, i.e., set 3. Update centroids 4. Iterate steps 2 and 3 until convergence 25 / 57 Here we illustrate the k-means algorithm over four iterations on our example data with . K-means Clustering 26 / 57 K-means Clustering Criterion is reduced in each iteration so the algorithm is assured to converge. Not a convex criterion, the clustering we obtain may not be globally optimal. In practice, the algorithm is run with multiple initializations (step 0) and the best clustering achieved is used. 27 / 57 K-means Clustering Also, selection of observations as centroids can be improved using the K-means++ algorithm: 1. Choose an observation as centroid uniformly at random 2. To choose centroid , compute for each observation not chosen as a centroid the distance to the nearest centroid 3. Set centroid to an observation randomly chosen with probability 4. Iterate steps 1 and 2 until centroids are chosen 28 / 57 Choosing the number of clusters The number of parameters must be determined before running the K- means algorithm. There is no clean direct method for choosing the number of clusters to use in the K-means algorithm (e.g. no cross-validation method) 29 / 57 Looking at criterion alone is not sufficient as the criterion will become smaller as the value of is reduced. Choosing the number of clusters 30 / 57 Choosing the number of clusters We can use properties of this plot for ad-hoc selection. Suppose there is a true underlying number of clusters in the data, improvement in the statistic will be fast for values of slower for values of . 31 / 57 Choosing the number of clusters Improvement in the statistic will be fast for values of In this case, there will be a cluster which will contain observations belonging to two of the true underlying clusters, and therefore will have poor within cluster similarity. As is increased, observations may then be separated into separate clusters, providing a sharp improvement in the statistic. 32 / 57 Choosing the number of clusters Improvement in the statistic will be slower for values of In this case, observations belonging to a single true cluster are split into multiple cluster, all with generally high within-cluster similarity, Splitting these clusters further will not improve the statistic very sharply. 33 / 57 The curve will therefore have an inflection point around . Choosing the number of clusters 34 / 57 Choosing the number of clusters The gap statistic is used to identify the inflection point in the curve. It compares the behavior of the statistic based on the data with the behavior of the statistic for data generated uniformly at random over the range of the data. Chooses the that maximizes the gap between these two curves. 35 / 57 For this dataset, the gap statistic suggests there is no clear cluster structure and therefore is the best choice. A choice of is also appropriate. Choosing the number of clusters 36 / 57 K-medioids clustering A variant of the same algorithm for distance that are not Euclidean Input: Distance matrix between observations Output: Cluster assignments, and a representative data point for each cluster (medioid) Advantage: Can apply to situations where distances between observations are available but not feature vectors (e.g., network data) 37 / 57 K-medioids clustering 1. Initialize by choosing observations as medioids 2. Assign each observation to the cluster with the nearest medioid, i.e., set 3. Update medioids 4. Iterate steps 2 and 3 until convergence 38 / 57 Large-scale clustering Cost of K-means as presented: Each iteration: Compute distance for each point to centroid 39 / 57 Large-scale clustering Cost of K-means as presented: Each iteration: Compute distance for each point to centroid This implies we have to do multiple passes over entire dataset. Not good for massive datasets 40 / 57 Large-scale clustering Cost of K-means as presented: Each iteration: Compute distance for each point to centroid This implies we have to do multiple passes over entire dataset. Not good for massive datasets Can we do this in "almost" a single pass? 41 / 57 Large-scale clustering (BFR Algorithm) 1. Select points as before 2. Process data file in chunks: Set chunk size so each can be processed in main memory Will use memory for workspace so not entire memory available 42 / 57 Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster (Discard Set) 43 / 57 Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster (Discard Set) Remaining points are clustered (e.g., using -means with some value of . Two cases Clusters with more than one point (Compressed Set) Singleton clusters (Retained Set) 44 / 57 Large-scale clustering (BFR Algorithm) For each chunk All points sufficiently close to the centroid of one of the clusters is assigned to that cluster (Discard Set) Remaining points are clustered (e.g., using -means with some value of . Two cases Clusters with more than one point (Compressed Set) Singleton clusters (Retained Set) Try to merge clusters in Compressed Set 45 / 57 Large-scale clustering (BFR Algorithm) What is sufficiently close? "Weighted" distance to centroid below some threshold. : cluster mean for feature : cluster standard deviation of feature 46 / 57 Large-scale clustering (BFR Algorithm) Assumption: points are distributed along axis-parallel ellipses 47 / 57 Large-scale clustering (BFR Algorithm) Under this assumption, we only need to store means and variances to calculate distances 48 / 57 Large-scale clustering (BFR Algorithm) Under this assumption, we only need to store means and variances to calculate distances We can do this by storing for each cluster : number of points assigned to cluster sum of values of feature in cluster sum of squares of values of feature in cluster 49 / 57 Large-scale clustering (BFR Algorithm) Under this assumption, we only need to store means and variances to calculate distances We can do this by storing for each cluster : number of points assigned to cluster sum of values of feature in cluster sum of squares of values of feature in cluster Constant amount of space to represent cluster 50 / 57 Large-scale clustering (BFR Algorithm) Under this assumption, we only need to store means and variances to calculate distances We can do this by storing for each cluster : number of points assigned to cluster sum of values of feature in cluster sum of squares of values of feature in cluster Constant amount of space to represent cluster Exercise. show these are sufficient to calculate weighted distance 51 / 57 Large-scale clustering (BFR Algorithm) This is used to represent (final) clusters in Discard Set and (partial) clusters in Compressed Set Only points explicitly in memory are those in the Retained Set 52 / 57 Large-scale clustering (BFR Algorithm) This is used to represent (final) clusters in Discard Set and (partial) clusters in Compressed Set Only points explicitly in memory are those in the Retained Set Points outside of Retained Set are never kept in memory (written out along with cluster assignment) 53 / 57 Large-scale clustering (BFR Algorithm) Merging clusters in Compressed Set Example: Merge if the variance of combined clusters is sufficiently close to variance of separate clusters 54 / 57 Large-scale clustering (BFR Algorithm) After all data is processed: Assign points in Retained Set to cluster with nearest centroid Merge partial clusters in Compressed Set with final clusters in Discarded Set 55 / 57 Large-scale clustering (BFR Algorithm) After all data is processed: Assign points in Retained Set to cluster with nearest centroid Merge partial clusters in Compressed Set with final clusters in Discarded Set Or, Flag all of these as outliers 56 / 57 Summary Clustering algorithms used to partition data into groups of similar observations K-means and K-mediods: iterative algorithms to minimize a partition objective function Optimization solution depends on initialization: K-means++ improved initialization BFR Algorithm: how to solve for massive datasets in "almost" one pass of algorithm 57 / 57 Data Clustering Héctor Corrada Bravo University of Maryland, College Park, USA DATA606: 2020-04-12

Upload: others

Post on 24-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Motivating ExampleTime series dataset of mortgage affordability as calculated anddistributed by Zillow: https://www.zillow.com/research/data/.

The dataset consists of monthly mortgage affordability values for 76counties with data from 1979 to 2017.

1 / 57

Motivating Example"To calculate mortgage affordability, we first calculate themortgage payment for the median-valued home in ametropolitan area by using the metro-level Zillow Home ValueIndex for a given quarter and the 30-year fixed mortgageinterest rate during that time period, provided by the FreddieMac Primary Mortgage Market Survey (based on a 20 percentdown payment)."

2 / 57

Motivating Example"Then, we consider what portion of the monthly medianhousehold income (U.S. Census) goes toward this monthlymortgage payment. Median household income is available witha lag. "

3 / 57

Can we partition countiesinto groups of counties withsimilar value trends acrosstime?

Motivating Example

4 / 57

Cluster AnalysisThe high-level goal of cluster analysis is to organize objects(observations) that are similar to each other into groups.

We want objects within a group to be more similar to each other thanobjects in different groups.

5 / 57

Cluster AnalysisThe high-level goal of cluster analysis is to organize objects(observations) that are similar to each other into groups.

We want objects within a group to be more similar to each other thanobjects in different groups.

Central to this high-level goal is how to measure the degree of similaritybetween objects.

A clustering method then uses the similarity measure provided to it togroup objects into clusters.

6 / 57

Cluster AnalysisResult of the k-means algorithmpartitioning the data into 9 clusters.

The darker series within eachcluster shows the average timeseries within the cluster.

7 / 57

Dissimilarity-based ClusteringFor certain algorithms, instead of similarity we work with dissimilarity,often represented as distances.

When we have observations defined over attributes, or predictors, wedefine dissimilarity based on these attributes.

8 / 57

Dissimilarity-based ClusteringGiven measurements for observations over

predictors.

Suppose we define a dissimilarity , we can then define adissimilarity between objects as

xij i = 1, … ,Nj = 1, … , p

dj(xij,xi′j)

d(xi,xi′) =p

∑j=1

dj(xij,xi′j)

9 / 57

Dissimilarity-based ClusteringIn the k-means algorithm, and many other algorithms, the most commonusage is squared distance

We can use different dissimilarities, for example

which may affect our choice of clustering algorithm later on.

dj(xij,xi′j) = (xij − xi′j)2

dj(xij,xi′j) = |xij − xi′j|

10 / 57

Dissimilarity-based ClusteringFor categorical variables, we could set

dj(xij,xi′j) =⎧⎨⎩

0 if xij = xi′j

1 o.w.

11 / 57

Dissimilarity-based ClusteringIf the values the categorical variable have an intrinsic similarity

Generalize using symmetric matrix with elements

, and otherwise.

This may of course lead to a dissimilarity that is not a proper distance.

L

Lrr′ = Lr′r

Lrr = 0Lrr′ ≥ 0

12 / 57

Limitations of Distance-based clusteringWhen working with distances, we are quickly confronted with the curseof dimensionality.

One flavor: in high dimensions: all points are equi-distant

13 / 57

The curse of dimensionalityConsider the case where we have many covariates. We want to use adistance-based clustering method.

14 / 57

The curse of dimensionalityConsider the case where we have many covariates. We want to use adistance-based clustering method.

Basically, we need to define distance and look for small multi-dimensional "balls" around the data points. With many covariates thisbecomes difficult.

15 / 57

The curse of dimensionalityImagine we have equally spaced data and that each covariate is in . We want something that clusters points into local regions, containgsome reasonably small amount of data points (say 10%). Let's imaginethese are high-dimensional cubes.

[0, 1]

16 / 57

The curse of dimensionalityImagine we have equally spaced data and that each covariate is in . We want something that clusters points into local regions, containgsome reasonably small amount of data points (say 10%). Let's imaginethese are high-dimensional cubes.

If we have covariates and we are forming -dimensional cubes, theneach side of the cube must have size determined by

.

[0, 1]

p p

l

l × l × ⋯ × l = lp = .10

17 / 57

The curse of dimensionality

If the number of covariates is p=10, then . So it reallyisn't local! If we reduce the percent of data we consider to 1%, .Still not very local.

l = .11/10 = .8l = 0.63

18 / 57

The curse of dimensionality

If the number of covariates is p=10, then . So it reallyisn't local! If we reduce the percent of data we consider to 1%, .Still not very local.

If we keep reducing the size of the neighborhoods we will end up withvery small number of data points in each cluster and require a largenumber of clusters.

l = .11/10 = .8l = 0.63

19 / 57

K-means ClusteringA commonly used algorithm to perform clustering is the K-meansalgorithm.

It is appropriate when using squared Euclidean distance as the measureof object dissimilarity.

d(xi,xi′) =p

∑j=1

(xij − xi′j)2

= ∥xi − xi′∥2

20 / 57

K-means ClusteringK-means partitions observations into clusters, with provided as aparameter.

Given some clustering, or partition, , denote cluster assignment ofobservation to cluster is denoted as .

K K

C

xi k ∈ {1, … ,K} C(i) = k

21 / 57

K-means ClusteringK-means partitions observations into clusters, with provided as aparameter.

Given some clustering, or partition, , denote cluster assignment ofobservation to cluster is denoted as .

K-means minimizes this clustering criterion:

K K

C

xi k ∈ {1, … ,K} C(i) = k

W(C) =K

∑k=1

∑i: C(i)=k

∑i′: C(i′)=k

∥xi − xi′∥21

2

22 / 57

K-means ClusteringThis is equivalent to minimizing

with:

is the average of predictor over the observations assigned tocluster ,

is the number of observations assigned to cluster

W(C) =K

∑k=1

Nk ∑i: C(i)=k

∥xi − x̄k∥21

2

x̄k = (x̄k1, … , x̄kp)x̄kj j

k

Nk k

23 / 57

K-means Clustering

Minimize the total distance given by each observation to the mean(centroid) of the cluster to which the observation is assigned.

W(C) =K

∑k=1

Nk ∑i: C(i)=k

∥xi − x̄k∥21

2

24 / 57

K-means ClusteringAn iterative algorithm is used to minimize this criterion

1. Initialize by choosing observations as centroids 2. Assign each observation to the cluster with the nearest centroid, i.e.,

set 3. Update centroids 4. Iterate steps 2 and 3 until convergence

K m1,m2, … ,mk

i

C(i) = arg min1≤k≤K ∥xi − mk∥2

mk = x̄k

25 / 57

Here we illustrate thek-means algorithmover four iterations onour example datawith .

K-means Clustering

K = 4

26 / 57

K-means ClusteringCriterion is reduced in each iteration so the algorithm is assuredto converge.

Not a convex criterion, the clustering we obtain may not be globallyoptimal.

In practice, the algorithm is run with multiple initializations (step 0) andthe best clustering achieved is used.

W(C)

27 / 57

K-means ClusteringAlso, selection of observations as centroids can be improved using theK-means++ algorithm:

1. Choose an observation as centroid uniformly at random2. To choose centroid , compute for each observation not chosen

as a centroid the distance to the nearest centroid

3. Set centroid to an observation randomly chosen with probability

4. Iterate steps 1 and 2 until centroids are chosen

m1

mk i

di = min1≤l<k ∥xi − ml∥2

mk

edi

∑i′ ed

i′

K

28 / 57

Choosing the number of clustersThe number of parameters must be determined before running the K-means algorithm.

There is no clean direct method for choosing the number of clusters touse in the K-means algorithm (e.g. no cross-validation method)

29 / 57

Looking at criterion alone is not

sufficient as thecriterion will becomesmaller as the valueof is reduced.

Choosing the number of clusters

W(C)

K

30 / 57

Choosing the number of clustersWe can use properties of this plot for ad-hoc selection.

Suppose there is a true underlying number of clusters in the data,

improvement in the statistic will be fast for values of

slower for values of .

K∗

WK(C)K ≤ K∗

K > K∗

31 / 57

Choosing the number of clustersImprovement in the statistic will be fast for values of

In this case, there will be a cluster which will contain observationsbelonging to two of the true underlying clusters, and therefore will havepoor within cluster similarity.

As is increased, observations may then be separated into separateclusters, providing a sharp improvement in the statistic.

WK(C) K ≤ K∗

K

WK(C)

32 / 57

Choosing the number of clustersImprovement in the statistic will be slower for values of

In this case, observations belonging to a single true cluster are split intomultiple cluster, all with generally high within-cluster similarity,

Splitting these clusters further will not improve the statistic verysharply.

WK(C)K > K∗

WK(C)

33 / 57

The curve willtherefore have aninflection pointaround .

Choosing the number of clusters

K∗

34 / 57

Choosing the number of clustersThe gap statistic is used to identify the inflection point in the curve.

It compares the behavior of the statistic based on the data withthe behavior of the statistic for data generated uniformly atrandom over the range of the data.

Chooses the that maximizes the gap between these two curves.

WK(C)WK(C)

K WK(C)

35 / 57

For this dataset, thegap statistic suggeststhere is no clearcluster structure andtherefore isthe best choice.

A choice of isalso appropriate.

Choosing the number of clusters

K = 1

K = 4

36 / 57

K-medioids clusteringA variant of the same algorithm for distance that are not Euclidean

Input: Distance matrix between observationsOutput: Cluster assignments, and a representative data point for eachcluster (medioid)

Advantage: Can apply to situations where distances betweenobservations are available but not feature vectors (e.g., network data)

d(xi,xk)

37 / 57

K-medioids clustering1. Initialize by choosing observations as medioids 2. Assign each observation to the cluster with the nearest medioid, i.e.,

set 3. Update medioids

4. Iterate steps 2 and 3 until convergence

K m1,m2, … ,mk

i

C(i) = arg min1≤k≤K d(xi,mk)mk = arg minxi s.t. C(i)=k∑j s.t. C(j)=k d(xi,xj)

38 / 57

Large-scale clusteringCost of K-means as presented:

Each iteration: Compute distance for each point to centroid O(knp)

39 / 57

Large-scale clusteringCost of K-means as presented:

Each iteration: Compute distance for each point to centroid

This implies we have to do multiple passes over entire dataset.

Not good for massive datasets

O(knp)

40 / 57

Large-scale clusteringCost of K-means as presented:

Each iteration: Compute distance for each point to centroid

This implies we have to do multiple passes over entire dataset.

Not good for massive datasets

Can we do this in "almost" a single pass?

O(knp)

41 / 57

Large-scale clustering (BFR Algorithm)1. Select points as before2. Process data file in chunks:

Set chunk size so each can be processed in main memoryWill use memory for workspace so not entire memory available

k

42 / 57

Large-scale clustering (BFR Algorithm)For each chunk

All points sufficiently close to the centroid of one of the clusters isassigned to that cluster (Discard Set)

k

43 / 57

Large-scale clustering (BFR Algorithm)For each chunk

All points sufficiently close to the centroid of one of the clusters isassigned to that cluster (Discard Set)

Remaining points are clustered (e.g., using -means with some valueof . Two cases

Clusters with more than one point (Compressed Set)Singleton clusters (Retained Set)

k

k

k

44 / 57

Large-scale clustering (BFR Algorithm)For each chunk

All points sufficiently close to the centroid of one of the clusters isassigned to that cluster (Discard Set)

Remaining points are clustered (e.g., using -means with some valueof . Two cases

Clusters with more than one point (Compressed Set)Singleton clusters (Retained Set)

Try to merge clusters in Compressed Set

k

k

k

45 / 57

Large-scale clustering (BFR Algorithm)What is sufficiently close?

"Weighted" distance to centroid below some threshold.

: cluster mean for feature : cluster standard deviation of feature

d

∑i=1

(pi − ci)2

σi

ci i

σi i

46 / 57

Large-scale clustering (BFR Algorithm)

Assumption: points are distributed along axis-parallel ellipses

47 / 57

Large-scale clustering (BFR Algorithm)Under this assumption, we only need to store means and variances tocalculate distances

48 / 57

Large-scale clustering (BFR Algorithm)Under this assumption, we only need to store means and variances tocalculate distances

We can do this by storing for each cluster :

number of points assigned to cluster sum of values of feature in cluster sum of squares of values of feature in cluster

j

Nj

sij i j

s2ij i j

49 / 57

Large-scale clustering (BFR Algorithm)Under this assumption, we only need to store means and variances tocalculate distances

We can do this by storing for each cluster :

number of points assigned to cluster sum of values of feature in cluster sum of squares of values of feature in cluster

Constant amount of space to represent cluster

j

Nj

sij i j

s2ij i j

50 / 57

Large-scale clustering (BFR Algorithm)Under this assumption, we only need to store means and variances tocalculate distances

We can do this by storing for each cluster :

number of points assigned to cluster sum of values of feature in cluster sum of squares of values of feature in cluster

Constant amount of space to represent cluster

Exercise. show these are sufficient to calculate weighted distance

j

Nj

sij i j

s2ij i j

51 / 57

Large-scale clustering (BFR Algorithm)This is used to represent (final) clusters in Discard Set and (partial)clusters in Compressed Set

Only points explicitly in memory are those in the Retained Set

52 / 57

Large-scale clustering (BFR Algorithm)This is used to represent (final) clusters in Discard Set and (partial)clusters in Compressed Set

Only points explicitly in memory are those in the Retained Set

Points outside of Retained Set are never kept in memory (written outalong with cluster assignment)

53 / 57

Large-scale clustering (BFR Algorithm)

Merging clusters in Compressed Set

Example: Merge if the variance of combined clusters is sufficiently closeto variance of separate clusters

54 / 57

Large-scale clustering (BFR Algorithm)After all data is processed:

Assign points in Retained Set to cluster with nearest centroid

Merge partial clusters in Compressed Set with final clusters inDiscarded Set

55 / 57

Large-scale clustering (BFR Algorithm)After all data is processed:

Assign points in Retained Set to cluster with nearest centroid

Merge partial clusters in Compressed Set with final clusters inDiscarded Set

Or,

Flag all of these as outliers

56 / 57

SummaryClustering algorithms used to partition data into groups of similarobservationsK-means and K-mediods: iterative algorithms to minimize a partitionobjective functionOptimization solution depends on initialization: K-means++ improvedinitializationBFR Algorithm: how to solve for massive datasets in "almost" onepass of algorithm

57 / 57

Data ClusteringHéctor Corrada Bravo

University of Maryland, College Park, USADATA606: 2020-04-12

Page 2: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Motivating ExampleTime series dataset of mortgage affordability as calculated anddistributed by Zillow: https://www.zillow.com/research/data/.

The dataset consists of monthly mortgage affordability values for 76counties with data from 1979 to 2017.

1 / 57

Page 3: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Motivating Example"To calculate mortgage affordability, we first calculate themortgage payment for the median-valued home in ametropolitan area by using the metro-level Zillow Home ValueIndex for a given quarter and the 30-year fixed mortgageinterest rate during that time period, provided by the FreddieMac Primary Mortgage Market Survey (based on a 20 percentdown payment)."

2 / 57

Page 4: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Motivating Example"Then, we consider what portion of the monthly medianhousehold income (U.S. Census) goes toward this monthlymortgage payment. Median household income is available witha lag. "

3 / 57

Page 5: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Can we partition countiesinto groups of counties withsimilar value trends acrosstime?

Motivating Example

4 / 57

Page 6: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Cluster AnalysisThe high-level goal of cluster analysis is to organize objects(observations) that are similar to each other into groups.

We want objects within a group to be more similar to each other thanobjects in different groups.

5 / 57

Page 7: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Cluster AnalysisThe high-level goal of cluster analysis is to organize objects(observations) that are similar to each other into groups.

We want objects within a group to be more similar to each other thanobjects in different groups.

Central to this high-level goal is how to measure the degree of similaritybetween objects.

A clustering method then uses the similarity measure provided to it togroup objects into clusters.

6 / 57

Page 8: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Cluster AnalysisResult of the k-means algorithmpartitioning the data into 9 clusters.

The darker series within eachcluster shows the average timeseries within the cluster.

7 / 57

Page 9: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Dissimilarity-based ClusteringFor certain algorithms, instead of similarity we work with dissimilarity,often represented as distances.

When we have observations defined over attributes, or predictors, wedefine dissimilarity based on these attributes.

8 / 57

Page 10: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Dissimilarity-based ClusteringGiven measurements for observations over

predictors.

Suppose we define a dissimilarity , we can then define adissimilarity between objects as

xij i = 1, … ,N

j = 1, … , p

dj(xij,xi′j)

d(xi,xi′) =

p

∑j=1

dj(xij,xi′j)

9 / 57

Page 11: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Dissimilarity-based ClusteringIn the k-means algorithm, and many other algorithms, the most commonusage is squared distance

We can use different dissimilarities, for example

which may affect our choice of clustering algorithm later on.

dj(xij,xi′j) = (xij − xi′j)2

dj(xij,xi′j) = |xij − xi′j|

10 / 57

Page 12: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Dissimilarity-based ClusteringFor categorical variables, we could set

dj(xij,xi′j) =⎧⎨⎩

0 if xij = xi′j

1 o.w.

11 / 57

Page 13: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Dissimilarity-based ClusteringIf the values the categorical variable have an intrinsic similarity

Generalize using symmetric matrix with elements

, and otherwise.

This may of course lead to a dissimilarity that is not a proper distance.

L

Lrr′ = Lr′r

Lrr = 0

Lrr′ ≥ 0

12 / 57

Page 14: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Limitations of Distance-based clusteringWhen working with distances, we are quickly confronted with the curseof dimensionality.

One flavor: in high dimensions: all points are equi-distant

13 / 57

Page 15: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

The curse of dimensionalityConsider the case where we have many covariates. We want to use adistance-based clustering method.

14 / 57

Page 16: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

The curse of dimensionalityConsider the case where we have many covariates. We want to use adistance-based clustering method.

Basically, we need to define distance and look for small multi-dimensional "balls" around the data points. With many covariates thisbecomes difficult.

15 / 57

Page 17: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

The curse of dimensionalityImagine we have equally spaced data and that each covariate is in . We want something that clusters points into local regions, containgsome reasonably small amount of data points (say 10%). Let's imaginethese are high-dimensional cubes.

[0, 1]

16 / 57

Page 18: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

The curse of dimensionalityImagine we have equally spaced data and that each covariate is in . We want something that clusters points into local regions, containgsome reasonably small amount of data points (say 10%). Let's imaginethese are high-dimensional cubes.

If we have covariates and we are forming -dimensional cubes, theneach side of the cube must have size determined by

.

[0, 1]

p p

l

l × l × ⋯ × l = lp = .10

17 / 57

Page 19: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

The curse of dimensionality

If the number of covariates is p=10, then . So it reallyisn't local! If we reduce the percent of data we consider to 1%, .Still not very local.

l = .11/10 = .8

l = 0.63

18 / 57

Page 20: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

The curse of dimensionality

If the number of covariates is p=10, then . So it reallyisn't local! If we reduce the percent of data we consider to 1%, .Still not very local.

If we keep reducing the size of the neighborhoods we will end up withvery small number of data points in each cluster and require a largenumber of clusters.

l = .11/10 = .8

l = 0.63

19 / 57

Page 21: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

K-means ClusteringA commonly used algorithm to perform clustering is the K-meansalgorithm.

It is appropriate when using squared Euclidean distance as the measureof object dissimilarity.

d(xi,xi′) =p

∑j=1

(xij − xi′j)2

= ∥xi − xi′∥2

20 / 57

Page 22: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

K-means ClusteringK-means partitions observations into clusters, with provided as aparameter.

Given some clustering, or partition, , denote cluster assignment ofobservation to cluster is denoted as .

K K

C

xi k ∈ {1, … , K} C(i) = k

21 / 57

Page 23: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

K-means ClusteringK-means partitions observations into clusters, with provided as aparameter.

Given some clustering, or partition, , denote cluster assignment ofobservation to cluster is denoted as .

K-means minimizes this clustering criterion:

K K

C

xi k ∈ {1, … , K} C(i) = k

W(C) =K

∑k=1

∑i: C(i)=k

∑i′: C(i

′)=k

∥xi − xi′∥21

2

22 / 57

Page 24: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

K-means ClusteringThis is equivalent to minimizing

with:

is the average of predictor over the observations assigned tocluster ,

is the number of observations assigned to cluster

W(C) =K

∑k=1

Nk ∑i: C(i)=k

∥xi − x̄k∥21

2

x̄k = (x̄k1, … , x̄kp)x̄kj j

k

Nk k

23 / 57

Page 25: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

K-means Clustering

Minimize the total distance given by each observation to the mean(centroid) of the cluster to which the observation is assigned.

W(C) =K

∑k=1

Nk ∑i: C(i)=k

∥xi − x̄k∥21

2

24 / 57

Page 26: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

K-means ClusteringAn iterative algorithm is used to minimize this criterion

1. Initialize by choosing observations as centroids 2. Assign each observation to the cluster with the nearest centroid, i.e.,

set 3. Update centroids 4. Iterate steps 2 and 3 until convergence

K m1, m2, … , mk

i

C(i) = arg min1≤k≤K ∥xi − mk∥2

mk = x̄k

25 / 57

Page 27: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Here we illustrate thek-means algorithmover four iterations onour example datawith .

K-means Clustering

K = 4

26 / 57

Page 28: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

K-means ClusteringCriterion is reduced in each iteration so the algorithm is assuredto converge.

Not a convex criterion, the clustering we obtain may not be globallyoptimal.

In practice, the algorithm is run with multiple initializations (step 0) andthe best clustering achieved is used.

W(C)

27 / 57

Page 29: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

K-means ClusteringAlso, selection of observations as centroids can be improved using theK-means++ algorithm:

1. Choose an observation as centroid uniformly at random2. To choose centroid , compute for each observation not chosen

as a centroid the distance to the nearest centroid

3. Set centroid to an observation randomly chosen with probability

4. Iterate steps 1 and 2 until centroids are chosen

m1

mk i

di = min1≤l<k ∥xi − ml∥2

mk

ed

i

∑i′ e

d

i′

K

28 / 57

Page 30: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Choosing the number of clustersThe number of parameters must be determined before running the K-means algorithm.

There is no clean direct method for choosing the number of clusters touse in the K-means algorithm (e.g. no cross-validation method)

29 / 57

Page 31: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Looking at criterion alone is not

sufficient as thecriterion will becomesmaller as the valueof is reduced.

Choosing the number of clusters

W(C)

K

30 / 57

Page 32: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Choosing the number of clustersWe can use properties of this plot for ad-hoc selection.

Suppose there is a true underlying number of clusters in the data,

improvement in the statistic will be fast for values of

slower for values of .

K∗

WK(C)

K ≤ K∗

K > K∗

31 / 57

Page 33: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Choosing the number of clustersImprovement in the statistic will be fast for values of

In this case, there will be a cluster which will contain observationsbelonging to two of the true underlying clusters, and therefore will havepoor within cluster similarity.

As is increased, observations may then be separated into separateclusters, providing a sharp improvement in the statistic.

WK(C) K ≤ K∗

K

WK(C)

32 / 57

Page 34: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Choosing the number of clustersImprovement in the statistic will be slower for values of

In this case, observations belonging to a single true cluster are split intomultiple cluster, all with generally high within-cluster similarity,

Splitting these clusters further will not improve the statistic verysharply.

WK(C)

K > K∗

WK(C)

33 / 57

Page 35: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

The curve willtherefore have aninflection pointaround .

Choosing the number of clusters

K∗

34 / 57

Page 36: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Choosing the number of clustersThe gap statistic is used to identify the inflection point in the curve.

It compares the behavior of the statistic based on the data withthe behavior of the statistic for data generated uniformly atrandom over the range of the data.

Chooses the that maximizes the gap between these two curves.

WK(C)

WK(C)

K WK(C)

35 / 57

Page 37: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

For this dataset, thegap statistic suggeststhere is no clearcluster structure andtherefore isthe best choice.

A choice of isalso appropriate.

Choosing the number of clusters

K = 1

K = 4

36 / 57

Page 38: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

K-medioids clusteringA variant of the same algorithm for distance that are not Euclidean

Input: Distance matrix between observationsOutput: Cluster assignments, and a representative data point for eachcluster (medioid)

Advantage: Can apply to situations where distances betweenobservations are available but not feature vectors (e.g., network data)

d(xi,xk)

37 / 57

Page 39: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

K-medioids clustering1. Initialize by choosing observations as medioids 2. Assign each observation to the cluster with the nearest medioid, i.e.,

set 3. Update medioids

4. Iterate steps 2 and 3 until convergence

K m1,m2, … ,mk

i

C(i) = arg min1≤k≤K d(xi,mk)

mk = arg minxi s.t. C(i)=k∑j s.t. C(j)=k d(xi,xj)

38 / 57

Page 40: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Large-scale clusteringCost of K-means as presented:

Each iteration: Compute distance for each point to centroid O(knp)

39 / 57

Page 41: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Large-scale clusteringCost of K-means as presented:

Each iteration: Compute distance for each point to centroid

This implies we have to do multiple passes over entire dataset.

Not good for massive datasets

O(knp)

40 / 57

Page 42: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Large-scale clusteringCost of K-means as presented:

Each iteration: Compute distance for each point to centroid

This implies we have to do multiple passes over entire dataset.

Not good for massive datasets

Can we do this in "almost" a single pass?

O(knp)

41 / 57

Page 43: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Large-scale clustering (BFR Algorithm)1. Select points as before2. Process data file in chunks:

Set chunk size so each can be processed in main memoryWill use memory for workspace so not entire memory available

k

42 / 57

Page 44: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Large-scale clustering (BFR Algorithm)For each chunk

All points sufficiently close to the centroid of one of the clusters isassigned to that cluster (Discard Set)

k

43 / 57

Page 45: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Large-scale clustering (BFR Algorithm)For each chunk

All points sufficiently close to the centroid of one of the clusters isassigned to that cluster (Discard Set)

Remaining points are clustered (e.g., using -means with some valueof . Two cases

Clusters with more than one point (Compressed Set)Singleton clusters (Retained Set)

k

k

k

44 / 57

Page 46: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Large-scale clustering (BFR Algorithm)For each chunk

All points sufficiently close to the centroid of one of the clusters isassigned to that cluster (Discard Set)

Remaining points are clustered (e.g., using -means with some valueof . Two cases

Clusters with more than one point (Compressed Set)Singleton clusters (Retained Set)

Try to merge clusters in Compressed Set

k

k

k

45 / 57

Page 47: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Large-scale clustering (BFR Algorithm)What is sufficiently close?

"Weighted" distance to centroid below some threshold.

: cluster mean for feature : cluster standard deviation of feature

d

∑i=1

(pi − ci)2

σi

ci i

σi i

46 / 57

Page 48: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Large-scale clustering (BFR Algorithm)

Assumption: points are distributed along axis-parallel ellipses

47 / 57

Page 49: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Large-scale clustering (BFR Algorithm)Under this assumption, we only need to store means and variances tocalculate distances

48 / 57

Page 50: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Large-scale clustering (BFR Algorithm)Under this assumption, we only need to store means and variances tocalculate distances

We can do this by storing for each cluster :

number of points assigned to cluster sum of values of feature in cluster sum of squares of values of feature in cluster

j

Nj

sij i j

s2

ij i j

49 / 57

Page 51: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Large-scale clustering (BFR Algorithm)Under this assumption, we only need to store means and variances tocalculate distances

We can do this by storing for each cluster :

number of points assigned to cluster sum of values of feature in cluster sum of squares of values of feature in cluster

Constant amount of space to represent cluster

j

Nj

sij i j

s2

ij i j

50 / 57

Page 52: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Large-scale clustering (BFR Algorithm)Under this assumption, we only need to store means and variances tocalculate distances

We can do this by storing for each cluster :

number of points assigned to cluster sum of values of feature in cluster sum of squares of values of feature in cluster

Constant amount of space to represent cluster

Exercise. show these are sufficient to calculate weighted distance

j

Nj

sij i j

s2

ij i j

51 / 57

Page 53: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Large-scale clustering (BFR Algorithm)This is used to represent (final) clusters in Discard Set and (partial)clusters in Compressed Set

Only points explicitly in memory are those in the Retained Set

52 / 57

Page 54: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Large-scale clustering (BFR Algorithm)This is used to represent (final) clusters in Discard Set and (partial)clusters in Compressed Set

Only points explicitly in memory are those in the Retained Set

Points outside of Retained Set are never kept in memory (written outalong with cluster assignment)

53 / 57

Page 55: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Large-scale clustering (BFR Algorithm)

Merging clusters in Compressed Set

Example: Merge if the variance of combined clusters is sufficiently closeto variance of separate clusters

54 / 57

Page 56: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Large-scale clustering (BFR Algorithm)After all data is processed:

Assign points in Retained Set to cluster with nearest centroid

Merge partial clusters in Compressed Set with final clusters inDiscarded Set

55 / 57

Page 57: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

Large-scale clustering (BFR Algorithm)After all data is processed:

Assign points in Retained Set to cluster with nearest centroid

Merge partial clusters in Compressed Set with final clusters inDiscarded Set

Or,

Flag all of these as outliers

56 / 57

Page 58: Data Clustering - HCBravo Lab: Biomedical Data …A clustering method then uses the similarity measure provided to it to group objects into clusters. 6 / 57 Result of the k-means algorithm

SummaryClustering algorithms used to partition data into groups of similarobservationsK-means and K-mediods: iterative algorithms to minimize a partitionobjective functionOptimization solution depends on initialization: K-means++ improvedinitializationBFR Algorithm: how to solve for massive datasets in "almost" onepass of algorithm

57 / 57