cs 5614: (big) data management systemspeople.cs.vt.edu/badityap/classes/cs5614-spr17/lectures/...bfr...

55
CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #16: Clustering

Upload: others

Post on 08-Feb-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

CS5614:(Big)DataManagementSystems

B.AdityaPrakashLecture#16:Clustering

Page 2: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

HighDimensionalData

§  Givenacloudofdatapointswewanttounderstanditsstructure

VTCS5614 2Prakash2017

Page 3: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

3

TheProblemofClustering§  Givenasetofpoints,withanoDonofdistancebetweenpoints,groupthepointsintosomenumberofclusters,sothat– Membersofaclusterareclose/similartoeachother– Membersofdifferentclustersaredissimilar

§  Usually:– Pointsareinahigh-dimensionalspace– Similarityisdefinedusingadistancemeasure

•  Euclidean,Cosine,Jaccard,editdistance,…

VTCS5614Prakash2017

Page 4: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

4

Example:Clusters&Outliers

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

VTCS5614

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x Outlier Cluster

Prakash2017

Page 5: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

Clusteringisahardproblem!

VTCS5614 5Prakash2017

Page 6: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

6

Whyisithard?

§  Clusteringintwodimensionslookseasy§  Clusteringsmallamountsofdatalookseasy§  Andinmostcases,looksarenotdeceiving

§  ManyapplicaDonsinvolvenot2,but10or10,000dimensions

§  High-dimensionalspaceslookdifferent:Almostallpairsofpointsareataboutthesamedistance

VTCS5614Prakash2017

Page 7: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

ClusteringProblem:Galaxies§  Acatalogof2billion“skyobjects”representsobjectsby

theirradiaWonin7dimensions(frequencybands)§  Problem:Clusterintosimilarobjects,e.g.,galaxies,nearby

stars,quasars,etc.§  SloanDigitalSkySurvey

VTCS5614 7Prakash2017

Page 8: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

ClusteringProblem:MusicCDs§  IntuiWvely:Musicdividesintocategories,andcustomerspreferafewcategories– Butwhatarecategoriesreally?

§  RepresentaCDbyasetofcustomerswhoboughtit:

§  SimilarCDshavesimilarsetsofcustomers,andvice-versa

8VTCS5614Prakash2017

Page 9: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

ClusteringProblem:MusicCDsSpaceofallCDs:§  Thinkofaspacewithonedim.foreachcustomer– Valuesinadimensionmaybe0or1only– ACDisapointinthisspace(x1,x2,…,xk),wherexi=1ifftheithcustomerboughttheCD

§  ForAmazon,thedimensionistensofmillions

§  Task:FindclustersofsimilarCDsVTCS5614 9Prakash2017

Page 10: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

ClusteringProblem:Documents

Findingtopics:§  Representadocumentbyavector(x1,x2,…,xk),wherexi=1ifftheithword(insomeorder)appearsinthedocument–  Itactuallydoesn’tmaberifkisinfinite;i.e.,wedon’tlimitthesetofwords

§  Documentswithsimilarsetsofwordsmaybeaboutthesametopic

10VTCS5614Prakash2017

Page 11: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

Cosine,Jaccard,andEuclidean§  AswithCDswehaveachoicewhenwethinkofdocumentsassetsofwordsorshingles:– Setsasvectors:Measuresimilaritybythecosinedistance

– Setsassets:MeasuresimilaritybytheJaccarddistance

– Setsaspoints:MeasuresimilaritybyEuclideandistance

VTCS5614 11Prakash2017

Page 12: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

12

Overview:MethodsofClustering

§  Hierarchical:– AgglomeraWve(bobomup):

•  IniDally,eachpointisacluster•  Repeatedlycombinethetwo“nearest”clustersintoone

– Divisive(topdown):•  Startwithoneclusterandrecursivelysplitit

§  Pointassignment:– Maintainasetofclusters–  Pointsbelongto“nearest”cluster

VTCS5614Prakash2017

Page 13: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

HierarchicalClustering

§  KeyoperaWon:Repeatedlycombinetwonearestclusters

§  ThreeimportantquesWons:– 1)Howdoyourepresentaclusterofmorethanonepoint?

– 2)Howdoyoudeterminethe“nearness”ofclusters?

– 3)Whentostopcombiningclusters?

VTCS5614 13Prakash2017

Page 14: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

HierarchicalClustering§  KeyoperaWon:Repeatedlycombinetwonearestclusters

§  (1)Howtorepresentaclusterofmanypoints?–  Keyproblem:Asyoumergeclusters,howdoyourepresentthe“locaDon”ofeachcluster,totellwhichpairofclustersisclosest?

§  Euclideancase:eachclusterhasacentroid=averageofits(data)points

§  (2)Howtodetermine“nearness”ofclusters?– Measureclusterdistancesbydistancesofcentroids

VTCS5614 14Prakash2017

Page 15: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

Example:Hierarchicalclustering

Prakash2017 VTCS5614 15

Page 16: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

AndintheNon-EuclideanCase?WhatabouttheNon-Euclideancase?§  Theonly“locaDons”wecantalkaboutarethepointsthemselves–  i.e.,thereisno“average”oftwopoints

§  Approach1:–  (1)Howtorepresentaclusterofmanypoints?clustroid=(data)point“closest”tootherpoints

–  (2)Howdoyoudeterminethe“nearness”ofclusters?Treatclustroidasifitwerecentroid,whencompuDnginter-clusterdistances

16VTCS5614Prakash2017

Page 17: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

“Closest”Point?§  (1)Howtorepresentaclusterofmanypoints?clustroid=point“closest”tootherpoints

§  Possiblemeaningsof“closest”:–  Smallestmaximumdistancetootherpoints–  Smallestaveragedistancetootherpoints–  Smallestsumofsquaresofdistancestootherpoints

•  FordistancemetricdclustroidcofclusterCis:

VTCS5614 17

∑∈Cxc

cxd 2),(min

Centroid is the avg. of all (data)points in the cluster. This means centroid is an “artificial” point. Clustroid is an existing (data)point that is “closest” to all other points in the cluster.

X

Cluster on 3 datapoints

Centroid

Clustroid

Datapoint

Prakash2017

Page 18: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

Defining“Nearness”ofClusters

§  (2)Howdoyoudeterminethe“nearness”ofclusters?– Approach2:Interclusterdistance=minimumofthedistancesbetweenanytwopoints,onefromeachcluster

– Approach3:PickanoDonof“cohesion”ofclusters,e.g.,maximumdistancefromtheclustroid

•  Mergeclusterswhoseunionismostcohesive

18VTCS5614Prakash2017

Page 19: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

Cohesion

§  Approach3.1:Usethediameterofthemergedcluster=maximumdistancebetweenpointsinthecluster

§  Approach3.2:Usetheaveragedistancebetweenpointsinthecluster

§  Approach3.3:Useadensity-basedapproach– Takethediameteroravg.distance,e.g.,anddividebythenumberofpointsinthecluster

VTCS5614 19Prakash2017

Page 20: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

ImplementaWon

§  NaïveimplementaWonofhierarchicalclustering:– Ateachstep,computepairwisedistancesbetweenallpairsofclusters,thenmerge

– O(N3)

§  CarefulimplementaDonusingpriorityqueuecanreduceDmetoO(N2logN)– SWlltooexpensiveforreallybigdatasetsthatdonotfitinmemory

VTCS5614 20Prakash2017

Page 21: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

K-MEANSCLUSTERING

Prakash2017 VTCS5614 21

Page 22: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

k–meansAlgorithm(s)

§  AssumesEuclideanspace/distance

§  Startbypickingk,thenumberofclusters

§  IniDalizeclustersbypickingonepointpercluster– Example:Pickonepointatrandom,thenk-1otherpoints,eachasfarawayaspossiblefromthepreviouspoints

22VTCS5614Prakash2017

Page 23: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

PopulaWngClusters§  1)Foreachpoint,placeitintheclusterwhosecurrentcentroiditisnearest

§  2)Alerallpointsareassigned,updatethelocaDonsofcentroidsofthekclusters

§  3)Reassignallpointstotheirclosestcentroid–  SomeDmesmovespointsbetweenclusters

§  Repeat2and3unWlconvergence–  Convergence:Pointsdon’tmovebetweenclustersandcentroidsstabilize

VTCS5614 23Prakash2017

Page 24: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

Example:AssigningClusters

VTCS5614 24

x

x

x

x

x

x

x x

x … data point … centroid

x

x

x

Clusters after round 1 Prakash2017

Page 25: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

Example:AssigningClusters

VTCS5614 25

x

x

x

x

x

x

x x

x … data point … centroid

x

x

x

Clusters after round 2 Prakash2017

Page 26: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

Example:AssigningClusters

VTCS5614 26

x

x

x

x

x

x

x x

x … data point … centroid

x

x

x

Clusters at the end Prakash2017

Page 27: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

Gegngthekright

Howtoselectk?§  Trydifferentk,lookingatthechangeintheaveragedistancetocentroidaskincreases

§  AveragefallsrapidlyunDlrightk,thenchangeslible

27

k

Average distance to

centroid

Best value of k

VTCS5614Prakash2017

Page 28: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

Example:Pickingk

VTCS5614 28

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Too few; many long distances to centroid.

Prakash2017

Page 29: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

Example:Pickingk

VTCS5614 29

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Just right; distances rather short.

Prakash2017

Page 30: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

Example:Pickingk

VTCS5614 30

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Too many; little improvement in average distance.

Prakash2017

Page 31: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

THEBFRALGORITHM

Extensionofk-meanstolargedata

Prakash2017 VTCS5614 31

Page 32: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

BFRAlgorithm§  BFR[Bradley-Fayyad-Reina]isavariantofk-meansdesignedtohandleverylarge(disk-resident)datasets

§  AssumesthatclustersarenormallydistributedaroundacentroidinaEuclideanspace–  StandarddeviaDonsindifferentdimensionsmayvary

•  Clustersareaxis-alignedellipses

§  Efficientwaytosummarizeclusters(wantmemoryrequiredO(clusters)andnotO(data))

32VTCS5614Prakash2017

Page 33: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

BFRAlgorithm§  Pointsarereadfromdiskonemain-memory-fullataDme

§  MostpointsfrompreviousmemoryloadsaresummarizedbysimplestaWsWcs

§  Tobegin,fromtheiniDalloadweselecttheiniDalkcentroidsbysomesensibleapproach:–  Takekrandompoints–  TakeasmallrandomsampleandclusteropDmally–  Takeasample;pickarandompoint,andthenk–1morepoints,eachasfarfromthepreviouslyselectedpointsaspossible

33VTCS5614Prakash2017

Page 34: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

ThreeClassesofPoints3setsofpointswhichwekeeptrackof:§  Discardset(DS):

–  Pointscloseenoughtoacentroidtobesummarized

§  Compressionset(CS):–  GroupsofpointsthatareclosetogetherbutnotclosetoanyexisDngcentroid

–  Thesepointsaresummarized,butnotassignedtoacluster

§  Retainedset(RS):–  IsolatedpointswaiDngtobeassignedtoacompressionset

VTCS5614 34Prakash2017

Page 35: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

BFR:“Galaxies”Picture

VTCS5614 35

A cluster. Its points are in the DS.

The centroid

Compressed sets. Their points are in the CS.

Points in the RS

Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

Prakash2017

Page 36: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

SummarizingSetsofPointsForeachcluster,thediscardset(DS)issummarizedby:§  Thenumberofpoints,N§  ThevectorSUM,whoseithcomponentisthesumofthecoordinatesofthepointsintheithdimension

§  ThevectorSUMSQ:ithcomponent=sumofsquaresofcoordinatesinithdimension

VTCS5614

36A cluster. All its points are in the DS. The centroid

Prakash2017

Page 37: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

SummarizingPoints:Comments

§  2d+1valuesrepresentanysizecluster–  d=numberofdimensions

§  Averageineachdimension(thecentroid)canbecalculatedasSUMi/N–  SUMi=ithcomponentofSUM

§  Varianceofacluster’sdiscardsetindimensioniis:(SUMSQi/N)–(SUMi/N)2–  AndstandarddeviaDonisthesquarerootofthat

§  Nextstep:Actualclustering

37VTCS5614

Note: Dropping the “axis-aligned” clusters assumption would require storing full covariance matrix to summarize the cluster. So, instead of SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big! Prakash2017

Page 38: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

The“Memory-Load”ofPoints

Processingthe“Memory-Load”ofpoints(1):§  1)Findthosepointsthatare“sufficientlyclose”toaclustercentroidandaddthosepointstothatclusterandtheDS–  Thesepointsaresoclosetothecentroidthattheycanbesummarizedandthendiscarded

§  2)Useanymain-memoryclusteringalgorithmtoclustertheremainingpointsandtheoldRS–  ClustersgototheCS;outlyingpointstotheRS

VTCS5614 38

Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

Prakash2017

Page 39: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

The“Memory-Load”ofPointsProcessingthe“Memory-Load”ofpoints(2):§  3)DSset:AdjuststaDsDcsoftheclusterstoaccountforthenewpoints–  AddNs,SUMs,SUMSQs

§  4)ConsidermergingcompressedsetsintheCS

§  5)Ifthisisthelastround,mergeallcompressedsetsintheCSandallRSpointsintotheirnearestcluster

VTCS5614 39

Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

Prakash2017

Page 40: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

BFR:“Galaxies”Picture

VTCS5614 40

A cluster. Its points are in the DS.

The centroid

Compressed sets. Their points are in the CS.

Points in the RS

Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

Prakash2017

Page 41: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

AFewDetails…

§  Q1)Howdowedecideifapointis“closeenough”toaclusterthatwewilladdthepointtothatcluster?

§  Q2)Howdowedecidewhethertwocompressedsets(CS)deservetobecombinedintoone?

41VTCS5614Prakash2017

Page 42: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

HowCloseisCloseEnough?

§  Q1)Weneedawaytodecidewhethertoputanewpointintoacluster(anddiscard)

§  BFRsuggeststwoways:–  TheMahalanobisdistanceislessthanathreshold–  Highlikelihoodofthepointbelongingtocurrentlynearestcentroid

VTCS5614 42Prakash2017

Page 43: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

MahalanobisDistance

VTCS5614 43Prakash2017

Page 44: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

MahalanobisDistance

§  Ifclustersarenormallydistributedinddimensions,thenalertransformaDon,onestandarddeviaDon=√𝒅 –  i.e.,68%ofthepointsoftheclusterwillhaveaMahalanobisdistance<√𝒅 

§  AcceptapointforaclusterifitsM.D.is<somethreshold,e.g.2standarddeviaDons

44VTCS5614Prakash2017

Page 45: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

Picture:EqualM.D.Regions§  Euclideanvs.Mahalanobisdistance

VTCS5614 45

Contours of equidistant points from the origin

Uniformly distributed points, Euclidean distance

Normally distributed points, Euclidean distance

Normally distributed points, Mahalanobis distance

Prakash2017

Page 46: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

Should2CSclustersbecombined?

Q2)Should2CSsubclustersbecombined?§  Computethevarianceofthecombinedsubcluster

–  N,SUM,andSUMSQallowustomakethatcalculaDonquickly

§  Combineifthecombinedvarianceisbelowsomethreshold

§  ManyalternaWves:Treatdimensionsdifferently,considerdensity

VTCS5614 46Prakash2017

Page 47: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

THECUREALGORITHM

Extensionofk-meanstoclustersofarbitraryshapes

Prakash2017 VTCS5614 47

Page 48: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

TheCUREAlgorithm§  ProblemwithBFR/k-means:

–  Assumesclustersarenormallydistributedineachdimension

–  Andaxesarefixed–ellipsesatananglearenotOK

§  CURE(ClusteringUsingREpresentaWves):–  AssumesaEuclideandistance–  Allowsclusterstoassumeanyshape–  UsesacollecWonofrepresentaWvepointstorepresentclusters

48

Vs.

VTCS5614Prakash2017

Page 49: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

Example:UniversitySalaries

49

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

VTCS5614Prakash2017

Page 50: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

StarWngCURE2Passalgorithm.Pass1:§  0)Pickarandomsampleofpointsthatfitinmainmemory

§  1)IniWalclusters:–  Clusterthesepointshierarchically–groupnearestpoints/clusters

§  2)PickrepresentaWvepoints:–  Foreachcluster,pickasampleofpoints,asdispersedaspossible

–  Fromthesample,pickrepresentaDvesbymovingthem(say)20%towardthecentroidofthecluster

VTCS5614 50Prakash2017

Page 51: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

Example:IniWalClusters

51

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

VTCS5614Prakash2017

Page 52: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

Example:PickDispersedPoints

52

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

Pick (say) 4 remote points for each cluster.

VTCS5614Prakash2017

Page 53: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

Example:PickDispersedPoints

53

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

Move points (say) 20% toward the centroid.

VTCS5614Prakash2017

Page 54: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

FinishingCURE

Pass2:§  Now,rescanthewholedatasetandvisiteachpointpinthedataset

§  Placeitinthe“closestcluster”– NormaldefiniDonof“closest”:FindtheclosestrepresentaDvetopandassignittorepresentaDve’scluster

54VTCS5614

p

Prakash2017

Page 55: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from

Summary§  Clustering:Givenasetofpoints,withanoDonofdistancebetweenpoints,groupthepointsintosomenumberofclusters

§  Algorithms:– AgglomeraDvehierarchicalclustering:

•  Centroidandclustroid– k-means:

•  IniDalizaDon,pickingk– BFR– CURE

VTCS5614 55Prakash2017