clustering and k-means - github pages · clustering 3 2 2 3 2 3 1 1 1 3 clustering 4 1 1 1 1 3 3 3...

17
Clustering and K-means

Upload: vubao

Post on 11-Mar-2019

256 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering and K-means - GitHub Pages · Clustering 3 2 2 3 2 3 1 1 1 3 Clustering 4 1 1 1 1 3 3 3 3 1 Entry in row “clustering j”, column “xi” contains the index of the closest

ClusteringandK-means

Page 2: Clustering and K-means - GitHub Pages · Clustering 3 2 2 3 2 3 1 1 1 3 Clustering 4 1 1 1 1 3 3 3 3 1 Entry in row “clustering j”, column “xi” contains the index of the closest

RootMeanSquareError(RMS)

Data: !x1, !x2,…, !xN ∈Rd

Approximations: !z1,!z2,…, !zN ∈Rd

Root Mean Square error = 1N

!xi −!yi 2

2

i=1

N

z
Page 3: Clustering and K-means - GitHub Pages · Clustering 3 2 2 3 2 3 1 1 1 3 Clustering 4 1 1 1 1 3 3 3 3 1 Entry in row “clustering j”, column “xi” contains the index of the closest

PCAbasedpredic>on

Data: !x1, !x2,…, !xN ∈Rd

Mean vector: !µ Top k eigenvectors: !v1,

!v2,…, !vk

Approximation of !x j : !oj =!µ +

i=1

k

∑ !vi ⋅!x j( ) !vi

x

x

x

x

xx

x

x

o

o

oo

o

o

o

o

RMS Error = 1

N i=1

N

∑ !xi −!oi 2

2

Page 4: Clustering and K-means - GitHub Pages · Clustering 3 2 2 3 2 3 1 1 1 3 Clustering 4 1 1 1 1 3 3 3 3 1 Entry in row “clustering j”, column “xi” contains the index of the closest

RegressionbasedPredic>on

Data: (!x1, y1), (!x2, y2 ), … , (!xN , yN )∈Rd

Input: !x ∈Rd Output: y∈R

Approximation of y given !x : y = a0 + aii=1

d

∑ xi

x

x

x

x

xx

x

x

o

o

o

o

o

o

o

o

RMS Error = 1N i=1

N

∑ yi − yi( )2

Page 5: Clustering and K-means - GitHub Pages · Clustering 3 2 2 3 2 3 1 1 1 3 Clustering 4 1 1 1 1 3 3 3 3 1 Entry in row “clustering j”, column “xi” contains the index of the closest

K-meansclustering

RMS Error = 1

N i=1

N

∑ !xi −!oi 2

2

Data: !x1, !x2,…, !xN ∈Rd

Model: k representatives: !r1,!r2,…, !rk ∈R

d

Approximation of !x j : !oj = argmin!ri!x j −!ri 2

2

= the representative closest to !x j

Page 6: Clustering and K-means - GitHub Pages · Clustering 3 2 2 3 2 3 1 1 1 3 Clustering 4 1 1 1 1 3 3 3 3 1 Entry in row “clustering j”, column “xi” contains the index of the closest

K-meansAlgorithm

Initialize k representatives !r1,!r2,…, !rk ∈R

d

Iterate until convergence:

a. Associate each !xi with it's closest representative xi"!→ rj"!

b. Replace each representative !rj with the mean of the points assigned to !rjBoth a step and b step reduce RMSE

Page 7: Clustering and K-means - GitHub Pages · Clustering 3 2 2 3 2 3 1 1 1 3 Clustering 4 1 1 1 1 3 3 3 3 1 Entry in row “clustering j”, column “xi” contains the index of the closest
Page 8: Clustering and K-means - GitHub Pages · Clustering 3 2 2 3 2 3 1 1 1 3 Clustering 4 1 1 1 1 3 3 3 3 1 Entry in row “clustering j”, column “xi” contains the index of the closest

SimpleIni>aliza>on

SimplestIni>aliza>on:chooserepresenta>ve

fromdatapointsindependentlyatrandom.

– Problem:somerepresenta>vesareclosetoeach

otherandsomepartsofthedatahaveno

representa>ves.

– Kmeansisalocalsearchmethod–cangetstuckin

localminima.

Page 9: Clustering and K-means - GitHub Pages · Clustering 3 2 2 3 2 3 1 1 1 3 Clustering 4 1 1 1 1 3 3 3 3 1 Entry in row “clustering j”, column “xi” contains the index of the closest

Kmeans++

Data: !x1,…, !xN Current Reps: !r1,…, !rjDistance of example to Reps: d(!x,{!r1,…, !rj}) = min1≤i≤ j‖

!x − !ri‖

Prob. of selecting example !x as next representative: P(!x) = 1Z

1d(!x,{!r1,…, !rj})

•  Adifferentmethodforini>alizingrepresenta>ves.

•  Spreadsoutini>alrepresenta>ves•  Addrepresenta>vesonebyone•  Beforeaddingrepresenta>ve,definedistribu>onoverunselecteddatapoints.

Page 10: Clustering and K-means - GitHub Pages · Clustering 3 2 2 3 2 3 1 1 1 3 Clustering 4 1 1 1 1 3 3 3 3 1 Entry in row “clustering j”, column “xi” contains the index of the closest

ExampleforKmeans++

Thisisanunlikelyini>aliza>onforkmeans++

Page 11: Clustering and K-means - GitHub Pages · Clustering 3 2 2 3 2 3 1 1 1 3 Clustering 4 1 1 1 1 3 3 3 3 1 Entry in row “clustering j”, column “xi” contains the index of the closest

ParallelizedKmeans

•  Supposethedatapointsarepar>>onedrandomlyacross

severalmachines.

•  Wewanttoperformthea,bstepswithminimal

communica>onbtwnmachines.

1.  Chooseini>alrepresenta>vesandbroadcasttoall

machines.

2.  Eachmachinepar>>onsitsowndatapointsaccordingto

closestrepresenta>ve.Defines(key,value)pairswhere

key=indexofclosestrepresenta>ve.Value=example.

3.  Computethemeanforeachsetbyperforming

reduceByKey.(mostofthesummingdonelocallyoneach

machine).

4.  Broadcastnewrepstoallmachines.

Page 12: Clustering and K-means - GitHub Pages · Clustering 3 2 2 3 2 3 1 1 1 3 Clustering 4 1 1 1 1 3 3 3 3 1 Entry in row “clustering j”, column “xi” contains the index of the closest

Clusteringstability

Page 13: Clustering and K-means - GitHub Pages · Clustering 3 2 2 3 2 3 1 1 1 3 Clustering 4 1 1 1 1 3 3 3 3 1 Entry in row “clustering j”, column “xi” contains the index of the closest

Clusteringstability

Clusteringusing

Star>ngpointsI

Clusteringusing

Star>ngpoints2

Clusteringusing

Star>ngpoints3

Page 14: Clustering and K-means - GitHub Pages · Clustering 3 2 2 3 2 3 1 1 1 3 Clustering 4 1 1 1 1 3 3 3 3 1 Entry in row “clustering j”, column “xi” contains the index of the closest

Measuringclusteringstability

x1 x2 x3 x4 x5 x6 xn

Clustering1 1 1 3 1 3 2 2 2 3

Clustering2 2 2 1 2 1 3 3 3 1

Clustering3 2 2 3 2 3 1 1 1 3

Clustering4 1 1 1 1 3 3 3 3 1

Entryinrow“clusteringj”,column“xi”containstheindexoftheclosest

representa>vetoxiforclusteringj

Thefirstthreeclusteringsarecompletelyconsistentwitheachother

Thefourthclusteringhasadisagreementinx5

Page 15: Clustering and K-means - GitHub Pages · Clustering 3 2 2 3 2 3 1 1 1 3 Clustering 4 1 1 1 1 3 3 3 3 1 Entry in row “clustering j”, column “xi” contains the index of the closest

Howtoquan>fystability?

•  Wesaythataclusteringisstableiftheexamplesarealwaysgroupedinthesame

way.

•  Whenwehavethousandsofexamples,we

cannotexpectallofthemtoalwaysbe

groupedthesameway.

•  Weneedawaytoquan>fythestability.

•  Basicidea:measurehowmuchgroupings

differbetweenclusterings.

Page 16: Clustering and K-means - GitHub Pages · Clustering 3 2 2 3 2 3 1 1 1 3 Clustering 4 1 1 1 1 3 3 3 3 1 Entry in row “clustering j”, column “xi” contains the index of the closest

Entropy

A partition G of the data defines a distribution over the parts:p1 + p2 +!+ pk = 1The information in this partition is measured by the Entropy:

H (G) = H (p1, p2,…, pk ) = pii=1

k

∑ log21pi

H (G) is a number between 0 (one part with prob. 1)and

log2 k (p1 = p2 =!= pk =1k

)

Page 17: Clustering and K-means - GitHub Pages · Clustering 3 2 2 3 2 3 1 1 1 3 Clustering 4 1 1 1 1 3 3 3 3 1 Entry in row “clustering j”, column “xi” contains the index of the closest

Entropyofacombinedpar>>on

If clustering1 and clustering 2 partition the data in the exact same waythen G1 = G2, H (G1,G2 ) = H (G1) = H (G2 )

Suppse we produce many clusterings, using many starting points.Suppose we plot H (G1),H (G1,G2 ),…,H (G1,G2,…,Gi ),…As a function of iIf the graph increases like i log2 k then the clustering is completely unstableIf the graph stops increasing after some i then we reached stability.

If clustering1 and clustering 2 are independent (partition the data independently from each other).then H (G1,G2 ) = H (G1)+ H (G2 )