jeﬀrey d. ullman stanford university

JeffreyD.UllmanStanfordUniversity

2

¡  Givenasetofpoints,withano1onofdistancebetweenpoints,groupthepointsintosomenumberofclusters,sothatmembersofaclusterare“close”toeachother,whilemembersofdifferentclustersare“far.”

3

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

4

¡  Clusteringintwodimensionslookseasy.¡  Clusteringsmallamountsofdatalookseasy.¡  Andinmostcases,looksarenotdeceiving.

5

¡ Manyapplica1onsinvolvenot2,but10or10,000dimensions.

¡  High-dimensionalspaceslookdifferent:almostallpairsofpointsareataboutthesamedistance.

6

¡  Assumerandompointsbetween0and1ineachdimension.

¡  In2dimensions:avarietyofdistancesbetween0and1.41.

¡  Inanynumberofdimensions,thedistancebetweentworandompointsinanyonedimensionisdistributedasatriangle.

Anypointisdistancezerofromitself.

Halfthepointsarethefirstofpointsatdistance½.

Onlypoints0and1aredistance1.

7

¡  Thelawoflargenumbersapplies.¡  Actualdistancebetweentworandompointsisthesqrtofthesumofsquaresofessen1allythesamesetofdifferences.§  I.e.,“allpointsarethesamedistanceapart.”

¡  Euclideanspaceshavedimensions,andpointshavecoordinatesineachdimension.

¡  Distancebetweenpointsisusuallythesquare-rootofthesumofthesquaresofthedistancesineachdimension.

¡  Non-Euclideanspaceshaveadistancemeasure,butpointsdonotreallyhaveaposi1oninthespace.§  Bigproblem:cannot“average”points.

8

9

¡  Objectsaresequencesof{C,A,T,G}.¡  Distancebetweensequences=editdistance=theminimumnumberofinsertsanddeletesneededtoturnoneintotheother.§ No1ce:nowayto“average”twostrings.

¡  Inprac1ce,thedistanceforDNAsequencesismorecomplicated:allowsotheropera1onslikemuta/ons(changeofasymbolintoanother)orreversalofsubstrings.

10

¡  Hierarchical(Agglomera1ve):§  Ini1ally,eachpointinclusterbyitself.§  Repeatedlycombinethetwo“nearest”clustersintoone.

¡  PointAssignment:§ Maintainasetofclusters.§  Placepointsintotheir“nearest”cluster.§  Possiblysplitclustersorcombineclustersaswego.

¡  Pointassignmentgoodwhenclustersarenice,convexshapes.

¡  Hierarchicalcanwinwhenshapesareweird.

11

Aside:ifyourealizedyouhadconcentricclusters,youcouldmappointsbasedondistancefromcenter,andturntheproblemintoasimple,one-dimensionalcase.

12

¡  Twoimportantques1ons:1.  Howdoyoudeterminethe“nearness”ofclusters?2.  Howdoyourepresentaclusterofmorethanone

point?

13

¡  Keyproblem:asyoubuildclusters,howdoyourepresenttheloca1onofeachcluster,totellwhichpairofclustersisclosest?

¡  Euclideancase:eachclusterhasacentroid=averageofitspoints.§ Measureinterclusterdistancesbydistancesofcentroids.

14

(5,3) o (1,2) o

o (2,1) o (4,1)

o (0,0) o

(5,0)

x (1.5,1.5)

x (4.5,0.5) x (1,1)

x (4.7,1.3)

15

(0,0) (1,2) (2,1) (4,1) (5,0) (5,3)

16

¡  Theonly“loca1ons”wecantalkaboutarethepointsthemselves.§  I.e.,thereisno“average”oftwopoints.

¡  Approach1:clustroid=point“closest”tootherpoints.§  Treatclustroidasifitwerecentroid,whencompu1nginterclusterdistances.

17

¡  Possiblemeanings:1.  Smallestmaximumdistancetotheotherpoints.2.  Smallestaveragedistancetootherpoints.3.  Smallestsumofsquaresofdistancestoother

points.4.  Etc.,etc.

18

1 2

3

4

5

6

interclusterdistance

clustroid

clustroid

19

¡  Approach2:interclusterdistance=minimumofthedistancesbetweenanytwopoints,onefromeachcluster.

¡  Approach3:Pickano1onof“cohesion”ofclusters,e.g.,maximumdistancefromthecentroidorclustroid.§ Mergeclusterswhoseunionismostcohesive.

20

¡  Approach1:Usethediameterofthemergedcluster=maximumdistancebetweenpointsinthecluster.

¡  Approach2:Usetheaveragedistancebetweenpointsinthecluster.

¡  Approach3:Density-basedapproach:takethediameteroraveragedistance,e.g.,anddividebythenumberofpointsinthecluster.§  Perhapsraisethenumberofpointstoapowerfirst,e.g.,square-root.

¡  Itreallydependsontheshapeofclusters.§ Whichyoumaynotknowinadvance.

¡  Example:we’llcomparetwoapproaches:1.  Mergeclusterswithsmallestdistancebetween

centroids(orclustroidsfornon-Euclidean).2.  Mergeclusterswiththesmallestdistancebetween

twopoints,onefromeachcluster.

21

¡  Centroid-basedmergingworkswell.

¡  Butmergerbasedonclosestmembersmightaccidentallymergeincorrectly.

22

AandBhaveclosercentroidsthanAandC,butclosestpointsarefromAandC.

A

B

C

¡  Linkingbasedonclosestmembersworkswell.

¡  ButCentroid-basedlinkingmightcauseerrors.

23

24

¡  Anexampleofpoint-assignment.¡  AssumesEuclideanspace.¡  Startbypickingk,thenumberofclusters.¡  Ini1alizeclusterswithaseed(=onepointpercluster).§  Example:pickonepointatrandom,thenk-1otherpoints,eachasfarawayaspossiblefromthepreviouspoints.§ OK,aslongastherearenooutliers(pointsthatarefarfromanyreasonablecluster).

¡  Basicidea:pickasmallsampleofpoints,clusterthembyanyalgorithm,andusethecentroidsasaseed.

¡  Ink-means++,samplesize=k1mesafactorthatislogarithmicinthetotalnumberofpoints.

¡  Sequen1allypicksamplepointsrandomly,buttheprobabilityofaddingapointptothesampleispropor1onaltoD(p)2.§ D(p)=distancebetweenpandthenearestpickedpoint.

25

¡  k-means++,likeotherseedmethods,issequen1al.§  YouneedtoupdateD(p)foreachunpickedpduetonewpoint.

¡  Naturallyparallel:manycomputenodescaneachhandleasmallsetofpoints.§  EachpicksafewnewsamplepointsusingsameD(p).

¡  Reallyimportantandcommontrick:don’tupdateakereveryselec1on;rathermakemanyselec1onsatoneround.§  Subop1malpicksdon’treallymamer.

26

27

1.  Foreachpoint,placeitintheclusterwhosecurrentcentroiditisnearest.

2.  Akerallpointsareassigned,fixthecentroidsofthekclusters.

3.  Op1onal:reassignallpointstotheirclosestcentroid.§  Some1mesmovespointsbetweenclusters.

28

1

2

3

4

5

6

7 8 x

x

Clustersafterfirstround

Reassignedpoints

29

¡  Trydifferentk,lookingatthechangeintheaveragedistancetocentroid,askincreases.

¡  Averagefallsrapidlyun1lrightk,thenchangeslimle.

k

Averagedistancetocentroid Bestvalue

ofk

Note:binarysearchforkispossible.

30


x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Toofew;manylongdistancestocentroid.

31


x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Justright;distancesrathershort.

32


x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Toomany;littleimprovementinaveragedistance.

33

¡  BFR(Bradley-Fayyad-Reina)isavariantofk-meansdesignedtohandleverylarge(disk-resident)datasets.

¡  ItassumesthatclustersarenormallydistributedaroundacentroidinaEuclideanspace.§  Standarddevia1onsindifferentdimensionsmayvary.

34

¡  Pointsarereadonemain-memory-fullata1me.

¡ Mostpointsfrompreviousmemoryloadsaresummarizedbysimplesta1s1cs.§  Alsokeptinmainmemory,whichlimitshowmanypointscanbereadinone“memoryload.”

¡  Tobegin,fromtheini1alloadweselecttheini1alkcentroidsbysomesensibleapproach.

35

1.  Thediscardset(DS):pointscloseenoughtoacentroidtobesummarized.

2.  Thecompressionset(CS):groupsofpointsthatareclosetogetherbutnotclosetoanycentroid.Theyaresummarized,butnotassignedtoacluster.

3.  Theretainedset(RS):isolatedpoints.

36

Acluster.ItspointsareinDS.

Thecentroid

Compressionsets.TheirpointsareinCS.

PointsinRS

37

¡  Eachclusterinthediscardsetandeachcompressionsetissummarizedby:1.  Thenumberofpoints,N.2.  ThevectorSUM,whoseithcomponentisthesum

ofthecoordinatesofthepointsintheithdimension.

3.  ThevectorSUMSQ:ithcomponent=sumofsquaresofcoordinatesinithdimension.

38

¡  2d+1valuesrepresentanynumberofpoints.§  d=numberofdimensions.

¡  Averagesineachdimension(centroidcoordinates)canbecalculatedeasilyasSUMi/N.§  SUMi=ithcomponentofSUM.

¡  Varianceindimensionicanbecomputedby:(SUMSQi/N)–(SUMi/N)2

§  Andthestandarddevia1onisthesquarerootofthat.

39

1.  Findthosepointsthatare“sufficientlyclose”toaclustercentroid;addthosepointstothatclusterandtheDS.

2.  Useanymain-memoryclusteringalgorithmtoclustertheremainingpointsandtheoldRS.

§  ClustersgototheCS;outlyingpointstotheRS.

40

3.  Adjuststa1s1csoftheclusterstoaccountforthenewpoints.

§  ConsidermergingcompressedsetsintheCS.4.  Ifthisisthelastround,mergeallcompressed

setsintheCSandallRSpointsintotheirnearestcluster.

41

¡  Howdowedecideifapointis“closeenough”toaclusterthatwewilladdthepointtothatcluster?

¡  Howdowedecidewhethertwocompressedsetsdeservetobecombinedintoone?

42

¡  Weneedawaytodecidewhethertoputanewpointintoacluster.

¡  BFRsuggesttwoways:1.  TheMahalanobisdistanceislessthanathreshold.2.  Lowlikelihoodofthecurrentlynearestcentroid

changing.

43

¡  NormalizedEuclideandistancefromcentroid.¡  Forpoint(x1,…,xk)andcentroid(c1,…,ck):

1.  Normalizeineachdimension:yi=(xi-ci)/σi§  σi=standarddevia1oninithdimensionforthiscluster.

2.  Takesumofthesquaresoftheyi’s.3.  Takethesquareroot.

44

¡  Ifclustersarenormallydistributedinddimensions,thenakertransforma1on,onestandarddevia1on=√d.§  I.e.,70%ofthepointsoftheclusterwillhaveaMahalanobisdistance<√d.

¡  AcceptapointforaclusterifitsM.D.is<somethreshold,e.g.4standarddevia1ons.

45

σ

2σ

46

¡  Similartomeasuringcohesion.Forexample:¡  Computethevarianceofthecombinedsubcluster,ineachdimension.§ N,SUM,andSUMSQallowustomakethatcalcula1onquickly.

¡  Combineifthevarianceisbelowsomethreshold.

¡ Manyalterna1ves:treatdimensionsdifferently,considerdensity.

47

¡  ProblemwithBFR/k-means:§  Assumesclustersarenormallydistributedineachdimension.

§  Andaxesarefixed–ellipsesatananglearenotOK.¡  CURE:§  AssumesaEuclideandistance.§  Allowsclusterstoassumeanyshape.

48

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

49

1.  Pickarandomsampleofpointsthatfitinmainmemory.

2.  Clusterthesepointshierarchically–groupnearestpoints/clusters.

3.  Foreachcluster,pickasampleofpoints,asdispersedaspossible.

4.  Fromthesample,pickrepresenta1vesbymovingthem(say)20%towardthecentroidofthecluster.

50

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

51

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

Pick(say)4remotepointsforeachcluster.

52

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

Movepoints(say)20%towardthecentroid.

¡  Alarge,dispersedclusterwillhavelargemovesfromitsboundary.

¡  Asmall,denseclusterwillhavelimlemove.¡  Favorsasmall,denseclusterthatisnearalargerdispersedcluster.

53

54

¡  Now,visiteachpointpinthedataset.¡  Placeitinthe“closestcluster.”§ Normaldefini1onof“closest”:thatclusterwiththeclosest(top)amongallthesamplepointsofalltheclusters.

jeﬀrey d. ullman stanford university

Documents