clustering and k-means - github pages · clustering 3 2 2 3 2 3 1 1 1 3 clustering 4 1 1 1 1 3 3 3...
TRANSCRIPT
ClusteringandK-means
RootMeanSquareError(RMS)
Data: !x1, !x2,…, !xN ∈Rd
Approximations: !z1,!z2,…, !zN ∈Rd
Root Mean Square error = 1N
!xi −!yi 2
2
i=1
N
∑
PCAbasedpredic>on
Data: !x1, !x2,…, !xN ∈Rd
Mean vector: !µ Top k eigenvectors: !v1,
!v2,…, !vk
Approximation of !x j : !oj =!µ +
i=1
k
∑ !vi ⋅!x j( ) !vi
x
x
x
x
xx
x
x
o
o
oo
o
o
o
o
RMS Error = 1
N i=1
N
∑ !xi −!oi 2
2
RegressionbasedPredic>on
Data: (!x1, y1), (!x2, y2 ), … , (!xN , yN )∈Rd
Input: !x ∈Rd Output: y∈R
Approximation of y given !x : y = a0 + aii=1
d
∑ xi
x
x
x
x
xx
x
x
o
o
o
o
o
o
o
o
RMS Error = 1N i=1
N
∑ yi − yi( )2
K-meansclustering
RMS Error = 1
N i=1
N
∑ !xi −!oi 2
2
Data: !x1, !x2,…, !xN ∈Rd
Model: k representatives: !r1,!r2,…, !rk ∈R
d
Approximation of !x j : !oj = argmin!ri!x j −!ri 2
2
= the representative closest to !x j
K-meansAlgorithm
Initialize k representatives !r1,!r2,…, !rk ∈R
d
Iterate until convergence:
a. Associate each !xi with it's closest representative xi"!→ rj"!
b. Replace each representative !rj with the mean of the points assigned to !rjBoth a step and b step reduce RMSE
SimpleIni>aliza>on
SimplestIni>aliza>on:chooserepresenta>ve
fromdatapointsindependentlyatrandom.
– Problem:somerepresenta>vesareclosetoeach
otherandsomepartsofthedatahaveno
representa>ves.
– Kmeansisalocalsearchmethod–cangetstuckin
localminima.
Kmeans++
Data: !x1,…, !xN Current Reps: !r1,…, !rjDistance of example to Reps: d(!x,{!r1,…, !rj}) = min1≤i≤ j‖
!x − !ri‖
Prob. of selecting example !x as next representative: P(!x) = 1Z
1d(!x,{!r1,…, !rj})
• Adifferentmethodforini>alizingrepresenta>ves.
• Spreadsoutini>alrepresenta>ves• Addrepresenta>vesonebyone• Beforeaddingrepresenta>ve,definedistribu>onoverunselecteddatapoints.
ExampleforKmeans++
Thisisanunlikelyini>aliza>onforkmeans++
ParallelizedKmeans
• Supposethedatapointsarepar>>onedrandomlyacross
severalmachines.
• Wewanttoperformthea,bstepswithminimal
communica>onbtwnmachines.
1. Chooseini>alrepresenta>vesandbroadcasttoall
machines.
2. Eachmachinepar>>onsitsowndatapointsaccordingto
closestrepresenta>ve.Defines(key,value)pairswhere
key=indexofclosestrepresenta>ve.Value=example.
3. Computethemeanforeachsetbyperforming
reduceByKey.(mostofthesummingdonelocallyoneach
machine).
4. Broadcastnewrepstoallmachines.
Clusteringstability
Clusteringstability
Clusteringusing
Star>ngpointsI
Clusteringusing
Star>ngpoints2
Clusteringusing
Star>ngpoints3
Measuringclusteringstability
x1 x2 x3 x4 x5 x6 xn
Clustering1 1 1 3 1 3 2 2 2 3
Clustering2 2 2 1 2 1 3 3 3 1
Clustering3 2 2 3 2 3 1 1 1 3
Clustering4 1 1 1 1 3 3 3 3 1
Entryinrow“clusteringj”,column“xi”containstheindexoftheclosest
representa>vetoxiforclusteringj
Thefirstthreeclusteringsarecompletelyconsistentwitheachother
Thefourthclusteringhasadisagreementinx5
Howtoquan>fystability?
• Wesaythataclusteringisstableiftheexamplesarealwaysgroupedinthesame
way.
• Whenwehavethousandsofexamples,we
cannotexpectallofthemtoalwaysbe
groupedthesameway.
• Weneedawaytoquan>fythestability.
• Basicidea:measurehowmuchgroupings
differbetweenclusterings.
Entropy
A partition G of the data defines a distribution over the parts:p1 + p2 +!+ pk = 1The information in this partition is measured by the Entropy:
H (G) = H (p1, p2,…, pk ) = pii=1
k
∑ log21pi
H (G) is a number between 0 (one part with prob. 1)and
log2 k (p1 = p2 =!= pk =1k
)
Entropyofacombinedpar>>on
If clustering1 and clustering 2 partition the data in the exact same waythen G1 = G2, H (G1,G2 ) = H (G1) = H (G2 )
Suppse we produce many clusterings, using many starting points.Suppose we plot H (G1),H (G1,G2 ),…,H (G1,G2,…,Gi ),…As a function of iIf the graph increases like i log2 k then the clustering is completely unstableIf the graph stops increasing after some i then we reached stability.
If clustering1 and clustering 2 are independent (partition the data independently from each other).then H (G1,G2 ) = H (G1)+ H (G2 )