jeffrey d. ullman stanford university
TRANSCRIPT
JeffreyD.UllmanStanfordUniversity
2
¡ Givenasetofpoints,withano1onofdistancebetweenpoints,groupthepointsintosomenumberofclusters,sothatmembersofaclusterare“close”toeachother,whilemembersofdifferentclustersare“far.”
3
x x x x x x x x x x x x x
x x
x xx x x x
x x x x
x x x x
x x x x x x x x x
x
x
x
4
¡ Clusteringintwodimensionslookseasy.¡ Clusteringsmallamountsofdatalookseasy.¡ Andinmostcases,looksarenotdeceiving.
5
¡ Manyapplica1onsinvolvenot2,but10or10,000dimensions.
¡ High-dimensionalspaceslookdifferent:almostallpairsofpointsareataboutthesamedistance.
6
¡ Assumerandompointsbetween0and1ineachdimension.
¡ In2dimensions:avarietyofdistancesbetween0and1.41.
¡ Inanynumberofdimensions,thedistancebetweentworandompointsinanyonedimensionisdistributedasatriangle.
Anypointisdistancezerofromitself.
Halfthepointsarethefirstofpointsatdistance½.
Onlypoints0and1aredistance1.
7
¡ Thelawoflargenumbersapplies.¡ Actualdistancebetweentworandompointsisthesqrtofthesumofsquaresofessen1allythesamesetofdifferences.§ I.e.,“allpointsarethesamedistanceapart.”
¡ Euclideanspaceshavedimensions,andpointshavecoordinatesineachdimension.
¡ Distancebetweenpointsisusuallythesquare-rootofthesumofthesquaresofthedistancesineachdimension.
¡ Non-Euclideanspaceshaveadistancemeasure,butpointsdonotreallyhaveaposi1oninthespace.§ Bigproblem:cannot“average”points.
8
9
¡ Objectsaresequencesof{C,A,T,G}.¡ Distancebetweensequences=editdistance=theminimumnumberofinsertsanddeletesneededtoturnoneintotheother.§ No1ce:nowayto“average”twostrings.
¡ Inprac1ce,thedistanceforDNAsequencesismorecomplicated:allowsotheropera1onslikemuta/ons(changeofasymbolintoanother)orreversalofsubstrings.
10
¡ Hierarchical(Agglomera1ve):§ Ini1ally,eachpointinclusterbyitself.§ Repeatedlycombinethetwo“nearest”clustersintoone.
¡ PointAssignment:§ Maintainasetofclusters.§ Placepointsintotheir“nearest”cluster.§ Possiblysplitclustersorcombineclustersaswego.
¡ Pointassignmentgoodwhenclustersarenice,convexshapes.
¡ Hierarchicalcanwinwhenshapesareweird.
11
Aside:ifyourealizedyouhadconcentricclusters,youcouldmappointsbasedondistancefromcenter,andturntheproblemintoasimple,one-dimensionalcase.
12
¡ Twoimportantques1ons:1. Howdoyoudeterminethe“nearness”ofclusters?2. Howdoyourepresentaclusterofmorethanone
point?
13
¡ Keyproblem:asyoubuildclusters,howdoyourepresenttheloca1onofeachcluster,totellwhichpairofclustersisclosest?
¡ Euclideancase:eachclusterhasacentroid=averageofitspoints.§ Measureinterclusterdistancesbydistancesofcentroids.
14
(5,3) o (1,2) o
o (2,1) o (4,1)
o (0,0) o
(5,0)
x (1.5,1.5)
x (4.5,0.5) x (1,1)
x (4.7,1.3)
15
(0,0) (1,2) (2,1) (4,1) (5,0) (5,3)
16
¡ Theonly“loca1ons”wecantalkaboutarethepointsthemselves.§ I.e.,thereisno“average”oftwopoints.
¡ Approach1:clustroid=point“closest”tootherpoints.§ Treatclustroidasifitwerecentroid,whencompu1nginterclusterdistances.
17
¡ Possiblemeanings:1. Smallestmaximumdistancetotheotherpoints.2. Smallestaveragedistancetootherpoints.3. Smallestsumofsquaresofdistancestoother
points.4. Etc.,etc.
18
1 2
3
4
5
6
interclusterdistance
clustroid
clustroid
19
¡ Approach2:interclusterdistance=minimumofthedistancesbetweenanytwopoints,onefromeachcluster.
¡ Approach3:Pickano1onof“cohesion”ofclusters,e.g.,maximumdistancefromthecentroidorclustroid.§ Mergeclusterswhoseunionismostcohesive.
20
¡ Approach1:Usethediameterofthemergedcluster=maximumdistancebetweenpointsinthecluster.
¡ Approach2:Usetheaveragedistancebetweenpointsinthecluster.
¡ Approach3:Density-basedapproach:takethediameteroraveragedistance,e.g.,anddividebythenumberofpointsinthecluster.§ Perhapsraisethenumberofpointstoapowerfirst,e.g.,square-root.
¡ Itreallydependsontheshapeofclusters.§ Whichyoumaynotknowinadvance.
¡ Example:we’llcomparetwoapproaches:1. Mergeclusterswithsmallestdistancebetween
centroids(orclustroidsfornon-Euclidean).2. Mergeclusterswiththesmallestdistancebetween
twopoints,onefromeachcluster.
21
¡ Centroid-basedmergingworkswell.
¡ Butmergerbasedonclosestmembersmightaccidentallymergeincorrectly.
22
AandBhaveclosercentroidsthanAandC,butclosestpointsarefromAandC.
A
B
C
¡ Linkingbasedonclosestmembersworkswell.
¡ ButCentroid-basedlinkingmightcauseerrors.
23
24
¡ Anexampleofpoint-assignment.¡ AssumesEuclideanspace.¡ Startbypickingk,thenumberofclusters.¡ Ini1alizeclusterswithaseed(=onepointpercluster).§ Example:pickonepointatrandom,thenk-1otherpoints,eachasfarawayaspossiblefromthepreviouspoints.§ OK,aslongastherearenooutliers(pointsthatarefarfromanyreasonablecluster).
¡ Basicidea:pickasmallsampleofpoints,clusterthembyanyalgorithm,andusethecentroidsasaseed.
¡ Ink-means++,samplesize=k1mesafactorthatislogarithmicinthetotalnumberofpoints.
¡ Sequen1allypicksamplepointsrandomly,buttheprobabilityofaddingapointptothesampleispropor1onaltoD(p)2.§ D(p)=distancebetweenpandthenearestpickedpoint.
25
¡ k-means++,likeotherseedmethods,issequen1al.§ YouneedtoupdateD(p)foreachunpickedpduetonewpoint.
¡ Naturallyparallel:manycomputenodescaneachhandleasmallsetofpoints.§ EachpicksafewnewsamplepointsusingsameD(p).
¡ Reallyimportantandcommontrick:don’tupdateakereveryselec1on;rathermakemanyselec1onsatoneround.§ Subop1malpicksdon’treallymamer.
26
27
1. Foreachpoint,placeitintheclusterwhosecurrentcentroiditisnearest.
2. Akerallpointsareassigned,fixthecentroidsofthekclusters.
3. Op1onal:reassignallpointstotheirclosestcentroid.§ Some1mesmovespointsbetweenclusters.
28
1
2
3
4
5
6
7 8 x
x
Clustersafterfirstround
Reassignedpoints
29
¡ Trydifferentk,lookingatthechangeintheaveragedistancetocentroid,askincreases.
¡ Averagefallsrapidlyun1lrightk,thenchangeslimle.
k
Averagedistancetocentroid Bestvalue
ofk
Note:binarysearchforkispossible.
30
x x x x x x x x x x x x x
x x
x xx x x x
x x x x
x x x x
x x x x x x x x x
x
x
x
Toofew;manylongdistancestocentroid.
31
x x x x x x x x x x x x x
x x
x xx x x x
x x x x
x x x x
x x x x x x x x x
x
x
x
Justright;distancesrathershort.
32
x x x x x x x x x x x x x
x x
x xx x x x
x x x x
x x x x
x x x x x x x x x
x
x
x
Toomany;littleimprovementinaveragedistance.
33
¡ BFR(Bradley-Fayyad-Reina)isavariantofk-meansdesignedtohandleverylarge(disk-resident)datasets.
¡ ItassumesthatclustersarenormallydistributedaroundacentroidinaEuclideanspace.§ Standarddevia1onsindifferentdimensionsmayvary.
34
¡ Pointsarereadonemain-memory-fullata1me.
¡ Mostpointsfrompreviousmemoryloadsaresummarizedbysimplesta1s1cs.§ Alsokeptinmainmemory,whichlimitshowmanypointscanbereadinone“memoryload.”
¡ Tobegin,fromtheini1alloadweselecttheini1alkcentroidsbysomesensibleapproach.
35
1. Thediscardset(DS):pointscloseenoughtoacentroidtobesummarized.
2. Thecompressionset(CS):groupsofpointsthatareclosetogetherbutnotclosetoanycentroid.Theyaresummarized,butnotassignedtoacluster.
3. Theretainedset(RS):isolatedpoints.
36
Acluster.ItspointsareinDS.
Thecentroid
Compressionsets.TheirpointsareinCS.
PointsinRS
37
¡ Eachclusterinthediscardsetandeachcompressionsetissummarizedby:1. Thenumberofpoints,N.2. ThevectorSUM,whoseithcomponentisthesum
ofthecoordinatesofthepointsintheithdimension.
3. ThevectorSUMSQ:ithcomponent=sumofsquaresofcoordinatesinithdimension.
38
¡ 2d+1valuesrepresentanynumberofpoints.§ d=numberofdimensions.
¡ Averagesineachdimension(centroidcoordinates)canbecalculatedeasilyasSUMi/N.§ SUMi=ithcomponentofSUM.
¡ Varianceindimensionicanbecomputedby:(SUMSQi/N)–(SUMi/N)2
§ Andthestandarddevia1onisthesquarerootofthat.
39
1. Findthosepointsthatare“sufficientlyclose”toaclustercentroid;addthosepointstothatclusterandtheDS.
2. Useanymain-memoryclusteringalgorithmtoclustertheremainingpointsandtheoldRS.
§ ClustersgototheCS;outlyingpointstotheRS.
40
3. Adjuststa1s1csoftheclusterstoaccountforthenewpoints.
§ ConsidermergingcompressedsetsintheCS.4. Ifthisisthelastround,mergeallcompressed
setsintheCSandallRSpointsintotheirnearestcluster.
41
¡ Howdowedecideifapointis“closeenough”toaclusterthatwewilladdthepointtothatcluster?
¡ Howdowedecidewhethertwocompressedsetsdeservetobecombinedintoone?
42
¡ Weneedawaytodecidewhethertoputanewpointintoacluster.
¡ BFRsuggesttwoways:1. TheMahalanobisdistanceislessthanathreshold.2. Lowlikelihoodofthecurrentlynearestcentroid
changing.
43
¡ NormalizedEuclideandistancefromcentroid.¡ Forpoint(x1,…,xk)andcentroid(c1,…,ck):
1. Normalizeineachdimension:yi=(xi-ci)/σi§ σi=standarddevia1oninithdimensionforthiscluster.
2. Takesumofthesquaresoftheyi’s.3. Takethesquareroot.
44
¡ Ifclustersarenormallydistributedinddimensions,thenakertransforma1on,onestandarddevia1on=√d.§ I.e.,70%ofthepointsoftheclusterwillhaveaMahalanobisdistance<√d.
¡ AcceptapointforaclusterifitsM.D.is<somethreshold,e.g.4standarddevia1ons.
45
σ
2σ
46
¡ Similartomeasuringcohesion.Forexample:¡ Computethevarianceofthecombinedsubcluster,ineachdimension.§ N,SUM,andSUMSQallowustomakethatcalcula1onquickly.
¡ Combineifthevarianceisbelowsomethreshold.
¡ Manyalterna1ves:treatdimensionsdifferently,considerdensity.
47
¡ ProblemwithBFR/k-means:§ Assumesclustersarenormallydistributedineachdimension.
§ Andaxesarefixed–ellipsesatananglearenotOK.¡ CURE:§ AssumesaEuclideandistance.§ Allowsclusterstoassumeanyshape.
48
e e
e
e
e e
e
e e
e
e
h
h
h
h
h
h
h h
h
h
h
h h
salary
age
49
1. Pickarandomsampleofpointsthatfitinmainmemory.
2. Clusterthesepointshierarchically–groupnearestpoints/clusters.
3. Foreachcluster,pickasampleofpoints,asdispersedaspossible.
4. Fromthesample,pickrepresenta1vesbymovingthem(say)20%towardthecentroidofthecluster.
50
e e
e
e
e e
e
e e
e
e
h
h
h
h
h
h
h h
h
h
h
h h
salary
age
51
e e
e
e
e e
e
e e
e
e
h
h
h
h
h
h
h h
h
h
h
h h
salary
age
Pick(say)4remotepointsforeachcluster.
52
e e
e
e
e e
e
e e
e
e
h
h
h
h
h
h
h h
h
h
h
h h
salary
age
Movepoints(say)20%towardthecentroid.
¡ Alarge,dispersedclusterwillhavelargemovesfromitsboundary.
¡ Asmall,denseclusterwillhavelimlemove.¡ Favorsasmall,denseclusterthatisnearalargerdispersedcluster.
53
54
¡ Now,visiteachpointpinthedataset.¡ Placeitinthe“closestcluster.”§ Normaldefini1onof“closest”:thatclusterwiththeclosest(top)amongallthesamplepointsofalltheclusters.