foundations of data mining - cohenwang.com · data access data streams distributed data/parallel...

57
Foundations of Data Mining Instructors: http://www.cohenwang.com/edith/dataminingclass2017 Amos Fiat Edith Cohen Haim Kaplan Lecture 1 1 Edith Cohen

Upload: others

Post on 05-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

FoundationsofDataMining

Instructors:

http://www.cohenwang.com/edith/dataminingclass2017

AmosFiatEdithCohen HaimKaplan

Lecture1

1EdithCohen

Page 2: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Courselogistics

§ Tuesdays16:00-19:00,Sherman002§ Slidesfor(mostorall)lectureswillbepostedonthecoursewebpage:http://www.cohenwang.com/edith/dataminingclass2017

§ Officehours:Emailinstructorstosetatime§ Grade:70%finalexam,30%on5problemsets

2EdithCohen

Page 3: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

DataCollected: networkactivity,peopleactivity,measurements,search/assistantqueries,location,onlineinteractionsandtransactions,text,media,Generated(processed)data:parametersinlargescalemodel,partlycuratedrawdata,

§ Scale:petabytes->exabytes ->…§ Diverseformats:relational,logs,text,media,measurements§ Location:distributed,streamed,

3EdithCohen

Page 4: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

DatatoInformation

Socialissues:§ Privacy,Fairness

Miningandlearningfromdata§ Aggregates,statistics,properties§ Modelsthatallowustogeneralize/predict

Scalable(efficient,fast)computation:§ Dataavailableasstreamedordistributed(limitdatamovementforefficiency/privacy)§ Platformsthatusecomputationresources(Map-reduce,Tensor-Flow,… ) acrossscales:

• GPUs,multi-coreCPUs,DataCenter,widearea,federated(ondevice)computing§ Algorithmdesign:

§ “linear”processingonlargedata,§ trade-offaccuracyandcomputationcost

4EdithCohen

Page 5: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

TopicsforthiscourseSelectioncriteriaoftopics:§ Broaddemonstratedapplicability§ Promotedeeperunderstandingofconcepts§ Simplicity,elegance,principled§ Instructorbias

Topics:§ Datamodeledas:keyvaluepairs,metric(vectors,sets),graphs§ Properties,features,statisticsofinterest§ Summarystructuresforefficientstorage/movement/computation§ Algorithmsfordistributed/parallel/streamedcomputation§ Datarepresentationsthatsupportgeneralization(recovermissingrelations,

identifyspuriousones)§ Dataprivacy

5EdithCohen

Page 6: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Today

§ Key-valuepairsdata§ Introtosummarystructures(sketches)§ Computationoverstreamed/distributeddata

§ Frequentkeys:TheMisra Gries structure§ Setmembership:BloomFilters§ Counting:Morriscounters

6EdithCohen

Page 7: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Data𝑫

Key-Valuepairs

Dataelement𝑒 ∈ D haskeyandvalue(e.key,e.value)

8

2 2 15

7

310 4 Exampledata§ Searchqueries§ IPnetworkpackets/flowrecords§ Onlineinteractions§ Parameterupdates(trainingMLmodel)

7EdithCohen

Page 8: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Data𝑫

Key-Valuepairs

Dataelement𝑒 ∈ D haskeyandvalue(e.key,e.value)

8

2 2 15

7

310 4 Exampletasks/queries§ Sum/Maxvalue§ Membership:Isin𝐷?§ Howmanydistinctkeys?§ Veryfrequentkeys(heavyhitters)

8EdithCohen

Page 9: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Dataaccess

Datastreams

Distributeddata/parallelcomputation

D 4

D3D1 D2

D5

GPUs,CPUs,VMs,Servers,widearea,devices§ Distributeddatasources§ Distributeforfaster/scalablecomputation

Datareadinone(orfew)sequentialpasses§ Cannotberevisited(IPtraffic)§ I/Oefficiency(sequentialaccessis

cheaperthanrandomaccess)

Challenges:Limitdatamovement,SupportupdatestoD

Challenges:”State”mustbemuchsmallerthandatasize,SupportupdatestoD9EdithCohen

Page 10: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

SummaryStructures(Sketches)

Examples:randomsamples,projections,histograms,…

𝑫:dataset;𝑓(𝑫):somestatistics/properties𝑺𝒌𝒆𝒕𝒄𝒉(𝑫):Asummaryof𝑫 thatactsas“surrogate”andallowsustoestimate𝑓(𝑫)𝑓5():estimatorweapplyto𝑺𝒌𝒆𝒕𝒄𝒉(𝑫) toestimate𝑓(𝑫)

Data𝑫 Sketch(D)

?𝑓(𝑫) 𝑓5(𝑺𝒌𝒆𝒕𝒄𝒉(𝑫))

§ Multi-objective𝑓(𝑞,𝑫) ,𝑓5(𝑞, 𝑺𝒌𝒆𝒕𝒄𝒉(𝑫)) sketchsupportsmultiplequerytypes

Whysketch?Datacanbetoolargeto:

§ Storeinfullforlongorevenshortterm§ Transmit

§ Slow/costlyprocessingofexactqueries§ Dataupdatesdonotnecessitatefull

recomputation

10EdithCohen

Page 11: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Composable sketches

Sufficestospecifymergingtwosketches

Data𝐴 Sketch(A)

Data𝐵 Sketch(B)

Data𝐴 ∪ 𝐵 Sketch(A∪ B)

Sketch1

S.1 ∪ 2

S.1 ∪ 2 ∪ 5

S.3 ∪ 4

1∪ 2 ∪ 3 ∪ 4 ∪ 5

Sketch2

Sketch5

Sketch3 Sketche 4

Distributeddata/parallelizecomputation

Onlysketchstructuremovesbetweenlocations

§Sketch(A ∪ B) fromSketch(A) andSketch(B)

11EdithCohen

Page 12: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Streamingsketches

Data𝐴 Sketch(A)

Data𝐴 ∪ {𝑒} Sketch(A∪ {𝑒})

Weakerrequirementthanfullycomposable

§Sketch(A ∪ {e}) fromSketch(A) andelement {e}

elemente elemente

Streameddata

Sketch Sketch Sketch Sketch Sketch Sketch

Only“state”maintainedisthesketchstructure

12EdithCohen

Page 13: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

SketchAPI

§ InitializationSketch(∅)§ Estimatorspecification𝑓5(Sketch(D))§ Mergetwosketches Sketch(A ∪ B) fromSketch(A) andSketch(B)§ Processanelement𝑒 = (𝑒. 𝑘𝑒𝑦, 𝑒. 𝑣𝑎𝑙):Sketch(A ∪ 𝑒)) fromSketch(A) and 𝑒

§ Delete𝑒

Data𝑫 Sketch (𝑫)

Q:𝑓(𝑫) ? 𝑓5(Sketch(D))

§ Seektooptimizesketch-sizevs. estimatequality

optional

13EdithCohen

Page 14: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Easysketches:min,max,sum,…

Sum§ Initialize:𝑠 ← 0§ Processelement𝑒 :𝑠 ← 𝑠 + 𝑒. 𝑣𝑎𝑙§ Merge s,s’:𝑠 ← 𝑠 + 𝑠Q§ Deleteelement𝑒 :𝑠 ← 𝑠 − 𝑒. 𝑣𝑎𝑙§ Query: return𝑠

32, 112, 14, 9, 37, 83, 115, 2,Exact,composable,Sketchisjustasingleregister𝑠:

Max§ Initialize:𝑠 ← 0§ Processelement𝑒 :𝑠 ← max(𝑠, 𝑒. 𝑣𝑎𝑙)§ Merge s,s’:𝑠 ← max(𝑠, 𝑠Q)§ Query: return𝑠

Elementvalues:

Nodeletesupport

14EdithCohen

Page 15: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

FrequentKeys

ExampleApplications:§ Networking:Find“elephant”IPflows§ Searchengines:Findthemostfrequentqueries§ Textanalysis:Frequentterms

Zipf law:Frequencyof𝑖WX heaviestkey∝ 𝑖Z[Saytop10%keysin90%ofelements

§ Dataisstreamed ordistributed§ Verylarge#distinctkeys,huge#elements§ Findthekeysthatoccurveryoften

wikipedia

https://brenocon.com/blog/2009/05

Occurin3/11elements

15EdithCohen

Page 16: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

FrequentKeys:ExactSolution

Exactsolution:§ Createacounterforeachdistinctkeyonitsfirstoccurrence§ Whenprocessinganelementwithkey𝑥 ,incrementthecounterof𝑥

Problem:Structuresizeis𝑛 = numberofdistinctkeys.Whatcanwedowithsizek ≪ 𝑛?Properties:Fullycomposable,exact,evensupportsdeletions,recoversallfrequencies

Solution:Sketchthatgotre-discoveredmanytimes[MG1982,DLM2002,KSP2003,MAA2006]16EdithCohen

Page 17: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

FrequentKeys:Streamingsketch[Misra Gries 1982]

Processinganelementwithkey𝒙§ If wealreadyhaveacounterfor𝒙,incrementit§ Else,Ifthereisnocounter,buttherearefewerthan𝑘 counters,createacounterfor𝒙 initializedto𝟏.

§ Else,decreaseallcountersby𝟏.Remove𝟎 counters.

𝑛 = 6 #distinct𝑘 = 3 #structuresize𝑚 = 11 #element

Sketchsizeparameter𝑘:Use(atmost)𝑘 countersindexedbykeys.Initially,nocounters

Query:#occurrencesof𝒙?§ If wehaveacounterfor𝒙,returnitsvalue

§ Else,return 𝟎.Clearlyanunder-estimate.Whatcanwesayprecisely?

17EdithCohen

Page 18: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

MGsketch:Analysis

Weboundthenumberof“decrease”stepsEachdecreasestepremoves𝑘 “counts”fromstructure,togetherwithinputelement,itresultsin𝑘 + 1 “uncounted”elements.

⇒ Numberofdecrementsteps≤ lZlm

nop

Lemma: Estimateissmallerthantruecountbyatmost𝒎Z𝒎m

𝒌o𝟏𝑚′:Sumofcountersinstructure;𝑚: #elementsinstream;𝑘:structuresizeWechargeeach“missedcount”toa“decrease”step:§ Ifkeyinstructure,anydecreaseincountisdueto“decrease”step.§ Elementprocessedandnotcountedresultsindecreasestep.

18EdithCohen

Page 19: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

MGsketch:Analysis(contd.)Estimateissmallerthantruecountbyatmost

𝒎Z𝒎m

𝒌o𝟏

⇒Wegetgoodestimatesfor𝑥 withfrequency≫ lZlQnop

§ Errorboundisinverselyproportionalto𝑘.Typicaltradeoffofsketch-sizeandqualityofestimate.

§ Errorboundcanbecomputedwithsketch:Track𝑚 (elementcount),know𝑚’ (canbecomputedfromstructure)and𝑘.

§ MGworksbecausetypicalfrequencydistributionshavefewverypopularkeys“Zipf law”

19EdithCohen

Page 20: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

MakingMGfullyComposable:MergingtwoMGsketches[MAA2006,ACHPWY2012]

Basicmerge:§ Ifakey𝑥 isinbothstructures,keeponecounterwithsumofthetwocounts

§ Ifakey𝑥 isinonestructure,keepthecounter

Reduce:Iftherearemorethan𝒌 counters§ Takethe 𝑘 + 1 th largestcounter§ Subtractitsvaluefromallothercounters§ Deletenon-positivecounters

20EdithCohen

Page 21: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

MergingtwoMisra Gries Sketches

BasicMerge:

21EdithCohen

Page 22: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

MergingtwoMisra Gries Summaries

4th largest

Reducesincetherearemorethan𝒌 = 𝟑 counters:§ Takethe 𝑘 + 1 th =4th largestcounter§ Subtractitsvalue(2)fromallothercounters§ Deletenon-positivecounters

22EdithCohen

Page 23: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

MergingMGSummaries:Correctness

Claim: Finalmergedsketchhasatmost𝑘 countersProof: Wesubtractthe(𝑘 + 1)th largestfromeverything,soatmostthe𝑘 largestcanremainpositive.

Claim:Foreachkey,mergedsketchcountissmallerthantruecountbyatmostlZlQ

nop

23EdithCohen

Page 24: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

MergingMGSummaries:Correctness

Part1:Totalelements:𝑚pCountinstructure:𝑚p′Countmissed:≤ 𝒎𝟏Z𝒎𝟏Q

𝒌o𝟏

Part2:Totalelements:𝑚vCountinstructure:𝑚v′Countmissed:≤ 𝒎𝟐Z𝒎𝟐Q

𝒌o𝟏

Proof:“Counts”forkey𝑥 canbemissedinpart1,part2,orinthereducecomponentofthemergeWeadduptheboundsonthemisses

“Reduce”missedcountperkeyisatmost𝑹 = the (𝒌 + 𝟏)th largestcountbeforereduce

Claim:Foreachkey,mergedsketchcountissmallerthantruecountbyatmostlZlQ

nop

24EdithCohen

Page 25: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

MergingMGSummaries:Correctness

Part1:Totalelements:𝑚pCountinstructure:𝑚p′Countmissed:≤ 𝒎𝟏Z𝒎𝟏Q

𝒌o𝟏

Part2:Totalelements:𝑚vCountinstructure:𝑚v′Countmissed:≤ 𝒎𝟐Z𝒎𝟐Q

𝒌o𝟏

⇒“Countmissed”ofonekeyinmergedsketchisatmost𝒎𝟏Z𝒎𝟏Q𝒌o𝟏

+ 𝒎𝟐Z𝒎𝟐Q𝒌o𝟏

+ 𝑹

“Reduce”missedcountperkeyisatmost𝑹 = the (𝒌 + 𝟏)th largestcountbeforereduce

25EdithCohen

Page 26: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

MergingMGSummaries:CorrectnessCountedelementsinstructure:§ Afterbasicmergeandbeforereduce:𝑚p

Q + 𝑚v′§ Afterreduce:𝑚Q

Claim:mpQ + mv

Q − 𝑚Q ≥ 𝑅 𝑘 + 1Proof:𝑅 areerasedinthereducestepineachofthe𝑘 + 1largestcounters.Maybemoreinsmallercounters.

“Countmissed”ofonekeyisatmost𝒎𝟏Z𝒎𝟏Q𝒌o𝟏

+ 𝒎𝟐Z𝒎𝟐Q𝒌o𝟏

+ 𝑹 ≤ 𝟏𝒌o𝟏

𝒎𝟏 +𝒎𝟐 −𝒎Q = lZlQnop

26EdithCohen

Page 27: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Probabilisticstructures

§ Misra Gries isadeterministic structure§ Theoutcomeisdetermineduniquelybytheinput§ Probabilisticstructures/algorithmscanbemuchmorepowerful• Provideprivacy/robustnesstooutliers• Provideefficiency/size

27EdithCohen

Page 28: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Setmembership

Exampleapplications:§ Spellchecker:Insertacorpusofwords.Checkifwordisincorpus.§ Webcrawler:Insertallurls thatwerevisited.Checkifcurrenturl wasexplored.

§ Distributedcaches:Maintaina”summary”ofkeysofcachedresources.Sendrequeststoacachethathastheresource.

§ BlacklistedIPaddresses:Intercepttrafficfromblacklistedsources

Data𝑫§ Dataisstreamed ordistributed§ Verylarge#distinctkeys,huge#elements,

largerepresentationofkeys

Structurethatsupportsmembershipqueries:Isin𝐷?

Exactsolution: Dictionary(hashmap)structure.Problem: storesrepresentationofallkeys28EdithCohen

Page 29: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Setmembership:BloomFilters[Bloom1970]§ Verypopularinmanyapplications§ Probabilisticdatastructure§ Reducesrepresentationsizetofewbits(8)perkey§ Falsepositivespossible,nofalsenegatives§ Tradeoffbetweensizeandfalsepositiverate§ Composable§ Analysisrelies onhavingindependentrandomhashfunctions(practiceworkwell,theoreticalissues)

29EdithCohen

Page 30: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

IndependentRandomHashFunctionsDomain𝑫 ofkeys;probabilitydistribution𝑭 over𝑹Distribution𝑯ofhashfunctionsℎ:𝑫 → 𝑹 withthefollowingproperties:Overℎ ∼ 𝐻§ Foreach𝑥 ∈ 𝑫,ℎ 𝑥 ∼ 𝑭 (overℎ ∼ 𝐻 )§ ℎ 𝑥 areindependentfordifferentkeys𝑥 ∈ 𝑫

Weuserandomhashfunctionsasawaytohaverandomdrawswith“memory”:Attacha“permanent”randomvaluetoakey

SimplifiedandIdealized

30EdithCohen

Page 31: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Setmembershipwarmup:HashsolutionParameter:𝑚Structure:Booleanarrayofsize𝑚Randomhashfunctionℎ whereℎ 𝑥 ∼ 𝑈[1,… ,𝑚]

Initialize:Declare boolean array𝑆 ofsize𝑚;For 𝑖 = 1,… ,𝑚:𝑆[𝑖] ← 𝐹

Processelementwithkey𝑥:𝑆[ℎ 𝑥 ] ← 𝑇

Membershipqueryfor𝑥:Return S[h 𝑥 ]

1 2 43 65 987 10

𝑚 = 10

T T

F ⇒ notinset

Merge: Twostructuresofsamesizeandsamehashfunction.TakebitwiseOR31EdithCohen

Page 32: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Hashsolution:Probabilityofafalsepositive𝑚:Structuresize; 𝑛 numberofdistinctkeysinsertedb = l

� ,numberofbitsweuseinstructureperdistinctkeyindata

Probability𝜀 offalsepositivefor𝑥:

Probabilityofℎ(𝑥) hittinganoccupiedcell:

𝜀 = PrX∼�

[𝑆 ℎ 𝑥 = 𝑇] ≈n𝑚 =

1b

Toohighformanyapplications!!(IPaddressis32bits…)Canwegetabettertradeoffbetween𝜀 andb ?

1 2 43 65 987 10

𝑚 = 10

T T

Example:𝜀 = 0.02 ⟹ 𝑏 = 50

32EdithCohen

Page 33: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Setmembership:BloomFilters[Bloom1970]Twoparameters:𝑚 and𝑘Structure:Booleanarrayofsize𝑚Independenthashfunctionsℎp, ℎv, … , ℎn whereℎ� 𝑥 ∼ 𝑈[1,… ,𝑚]

Initialize:Declare boolean array𝑆 ofsize𝑚;For 𝑖 = 1,… ,𝑚:𝑆[𝑖] ← 𝐹

Processelementwithkey𝑥:For 𝑖 = 1,… , 𝑘:𝑆[ℎ� 𝑥 ] ← 𝑇

Membershipqueryfor𝑥:Return 𝑆[ℎp 𝑥 ] and 𝑆[ℎv 𝑥 ] and⋯ 𝑆[ℎn 𝑥 ]

1 2 43 65 987 10

𝑚 = 10 ;𝑘 = 3

T T TT T

T ∧ T ∧ F = F ⇒ notinset

Merge: Twostructuresofsamesizeandsamesetofhashfunctions.TakebitwiseOR33EdithCohen

Page 34: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

BloomFiltersAnalysis:Probabilityofafalsepositive𝑚:Structuresize;𝑘:numberofhashfunctions; 𝑛 numberofdistinctkeysinserted

1 2 43 65 987 10

𝑚 = 10 ;𝑘 = 3

T T TT T

T ∧ T ∧ F = F ⇒ notinset

Afalsepositiveoccursfor𝑥 whenall 𝑘 cellsℎ� 𝑥 for𝑖 = 1,… , 𝑘 areT:

𝜀 = � (1 − PrX∼�

[ 𝑆 ℎ�(𝑥) = 𝐹]�

��p,…,,n

) = 1 − 1 −1𝑚

n� n

Probabilityofℎ�(𝑥) NOThittingaparticularcell𝑗:

PrX∼�

[ℎ� 𝑥 ≠ 𝑗] = (1 −1m)

Probabilitythatcell𝑗 isFisthatnoneofthe𝑛𝑘“dartthrows”hitscell𝑗:

PrX∼�

[ 𝑆 𝑗 = 𝐹] = 1 −1m

 �

*Assume𝑘 ≪ 𝑚 soℎ�(𝑥) fordifferent𝑖 = 1, … , 𝑘 areverylikelytobedistinct 34EdithCohen

Page 35: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

BloomFilters:Probabilityofafalsepositive(contd)𝑚:Structuresize;𝑘:numberofhashfunctions; 𝑛 numberofdistinctkeysinserted

Falsepositiveprobability:

𝜀 ≤ 1 − 1 −1𝑚

n� n

1 −1m

 �= 1 −

1m

ln�l

!!FPprobabilitydependsonb = l� ,numberofbitsweuseperdistinctkey

WecanseethatFPprobabilitydecreaseswith𝑚

≈ 1 − 𝑒Zn�l

n

lim�→¢

1 −1𝑖

�=1𝑒

≈ 1𝑒

n�l= 𝑒Z

n�l

Givenb,which𝑘 minimizestheFPprobability𝜀?

= 1 − 𝑒Zn£n

35EdithCohen

Page 36: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

BloomFilters:Probabilityofafalsepositive(contd)𝑘:numberofhashfunctions𝑚:Structuresize𝑛: numberofdistinctkeysinserted𝑏 = l

� :bitsperkey

Falsepositiveprobability𝜀 (upperbound):

Given𝑏,which𝑘 minimizestheFPprobability?

𝜀 ≤ 1 − 𝑒Zn£n

𝑘 ≈ ln 2 𝑏 ≈ 0.7 𝑏

𝜀 ≈12

£¤�v

𝑏 ≈ 1.44logv1𝜀

!!FPerrordecreasesexponentiallyin𝑏(recall𝜀 = p

£ for𝑘 =1)

Compute𝑏 fordesiredFPerror𝜀:

Example:𝑏 = 8; 𝑘 = 6; 𝜀 ≈ 0.02

36EdithCohen

Page 37: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Quickreview:RandomVariables

RandomvariableXProbabilityDensityFunction(PDF)𝑓(𝑥) :• Properties:𝑓 𝑥 ≥ 0 ∫ 𝑓 𝑥¢

Z¢ 𝑑𝑥 = 1

§ CumulativeDistributionFunction(CDF)

𝐹 𝑡 = ∫ 𝑓 𝑥WZ¢ 𝑑𝑥:probabilitythat𝑋 ≤ 𝑡

• Properties:𝐹 ∈ [0,1]monotonenon-decreasing

37EdithCohen

Page 38: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Quickreview:Expectation§ Expectation:“average”valueof𝑿:

𝜇¯ ≡ 𝐸 𝑋 = ∫ 𝑥𝑓 𝑥¢Z¢ 𝑑𝑥

§ LinearityofExpectation:𝐸[𝑎𝑋 + 𝑏] = 𝑎𝐸[𝑋] + 𝑏

Forrandomvariables𝑋p,𝑋v,𝑋²,...,𝑋n

𝐸 ³𝑋�

n

��p

=³𝐸[𝑋�]n

��p

38EdithCohen

Page 39: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Quickreview:Variance§ Variance

𝐕𝐚𝐫 𝑿 ≡ 𝝈𝑿𝟐 = 𝑬[ 𝑿 − 𝝁)𝟐 = º 𝒙 − 𝝁 𝟐𝒇 𝒙 𝒅𝒙¢

Z¢§ Usefulrelations:𝝈𝑿𝟐 = 𝑬 𝑿𝟐 − 𝝁𝑿𝟐

𝐕𝐚𝐫[𝒂𝑿 + 𝒃] = 𝒂𝟐𝑽𝐚𝐫[𝑿]§ Thestandarddeviationis𝝈𝑿 = 𝑽𝐚𝐫[𝑿]�

§ CoefficientofVariation 𝝈𝝁(normalizeds.d.)

39EdithCohen

Page 40: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Quickreview:Covariance

Cov 𝑿, 𝒀 = 𝝈𝑿𝒀 = 𝑬 𝑿 − 𝝁𝑿 𝒀 − 𝝁𝒀= 𝐄 𝐗𝐘 − 𝝁𝑿𝝁𝒀

§ 𝑿, 𝒀 areindependent⟹𝝈𝑿𝒀 = 𝟎

§ Varianceofthesumof𝑿𝟏, 𝑿𝟐,…, 𝑿𝒌

𝐕𝐚𝐫 ³𝑿𝒊

𝒌

𝒊�𝟏

=³³𝐂𝐨𝐯[𝑿𝒊, 𝑿𝒋]𝒌

𝒋�𝟏

𝒌

𝒊�𝟏

=³𝐕𝐚𝐫[𝑿𝒊]𝒌

𝒊�𝟏

+³𝐂𝐨𝐯[𝑿𝒊, 𝑿𝒋]𝒌

𝒊É𝒋

When(pairwise)independent

Measureofjointvariabilityoftworandomvariables)𝑿, 𝒀

40EdithCohen

Page 41: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

QuickReview:Estimators

§ Error (randomvariable)err 𝑓5 = 𝑓5 𝑆 − 𝑓(𝐷) ;RelativeErrorÊËË Ì5

Í Î§ Bias Bias[𝑓5] = E[err 𝑓5 ] = 𝐸[𝑓5] − 𝑓(𝐷)

§ WhenBias = 0estimatorisunbiased§ MeanSquareError(MSE):

E err 𝑓5v= 𝑉ar 𝑓5 + Bias 𝑓5

v

§ RootMeanSquareError(RMSE): 𝑀𝑆𝐸�

§ NormalizedRootMeanSquareError(NRMSE): ÒÓÔ�

Ì Õ

Afunction𝑓5(𝑆) appliedtoaprobabilisticsketch𝑆 ofdataDtoestimateaproperty/statistics𝑓(D) ofthedataD

41EdithCohen

Page 42: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

SimpleCounting(revisited)

Initialize: 𝒔 ← 0Processelement:𝒔 ← 𝒔 + 𝟏Merge s,s’:𝑠 ← 𝑠 + 𝑠Q

1, 1, 1, 1, 1, 1, 1, 1,

Exactcount:Size(bits)is⌈logv𝑛⌉ where𝑛 isthecurrentcount.

Canwecountwithfewer bits?Havetosettleforanapproximate count…

Applications:Wehaveverymanyquantitiestocount,andfastmemoryisscarce(say,insideabackbonerouter,)orbandwidthisscarce(distributedtrainingofalargeMLmodel)

42EdithCohen

Page 43: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

MorrisCounter[Morris1978]

§ Initialize:𝐬 = 𝟎§ Increment:Increment𝐬withprobability𝟐Z𝒔§ Query:Return𝟐𝒔 − 𝟏

1, 1, 1, 1, 1, 1, 1, 1,Stream:

Counter 𝒙: 0

𝑝 = 2ZÛ: 1

1 1 2 2 2 2 3 3

Estimate 𝒏Ý: 0 1 1 3 3 3 3 7 7

1, 2, 3, 4, 5, 6, 7, 8,Count𝒏:pv

pv

Probabilisticstreamcounter:Maintaine“log𝑛”insteadof𝑛, useloglog𝑛 bits

43EdithCohen

𝑛 = 10à,Exact:logv10à ≈ 30bitslogv logv10à≈ 5bits

Page 44: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

MorrisCounter:Unbiasedness

§ When𝑛 = 0,𝑠 = 0,estimateis𝑛á = 𝟐𝟎 − 𝟏 = 𝟎§ When𝑛 = 1,𝑠 = 1,estimateis𝑛á = 𝟐𝟏 − 𝟏 = 𝟏§ When𝑛 = 2,

with𝑝 = pv,s = 𝟏 ,𝑛á = 𝟏

with𝑝 = pv,𝒔 = 𝟐 ,𝑛á = 𝟐𝟐 − 𝟏 = 𝟑

Expectation:E 𝒏Ý = 𝟏𝟐∗ 𝟏 + 𝟏

𝟐∗ 𝟑 = 𝟐

§ 𝒏 = 𝟑, 𝟒, 𝟓… byinduction….

§ Initialize:𝐬 = 𝟎§ Increment:𝐬 ← 𝒔 + 𝟏 withprobability𝟐Z𝒔§ Query:Return𝟐𝒔 − 𝟏

44EdithCohen

Page 45: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

MorrisCounter:Unbiasedness(contd)

§ Supposethecountervalueiss§ Weincreasewithprobability𝟐Z𝒔

§ Theexpectedincreaseintheestimateis2Z[((2[op − 1) − (2å−1)) + (1 − 2Zå)0 = 2Z[2[ = 1

Itsufficestoshowthattheexpectedincreaseoftheestimate isalways1

§ Initialize:𝐬 = 𝟎§ Increment:𝐬 ← 𝒔 + 𝟏 withprobability𝟐Z𝒔§ Query:Return𝟐𝒔 − 𝟏

45EdithCohen

Page 46: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

MorrisCounter:VarianceHowgoodisourestimate?𝑋� :randomvariableofcounterwithinput𝑛§ Ourestimateistherandomvariable𝑛á = 2¯æ − 1

Var 𝑛á = 𝑉ar 𝑛á + 1 = 𝐸 𝑛á + 1 v − 𝐸 𝑛á + 1 v

= 𝐸 2v¯æ − (𝑛 + 1)v

§ Wecanshowbyinduction𝐸 2v¯æ = ²v𝑛v + ²

v𝑛 + 1

§ Thismeans𝑉ar 𝑛á ≈ pv𝑛v andCV = è

é≈ p

v�(=NRMSEsinceunbiased)

Howtoreducetheerror?46EdithCohen

Page 47: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Reducingvariancebyaveraging

𝒌 (pairwise)independent unbiasedestimates𝒁𝒊 withexpectation𝝁 andvariance𝝈𝟐.

Theaverageestimator𝒏′ë =∑ 𝒁𝒊𝒌𝒊í𝟏𝒌

§ Expectation:𝐸 𝑛Që = pn∑ 𝐸 𝑍� = p

n𝑘𝜇 = 𝜇n

��p

§ Variance: pn

v∑ Var 𝑍� = p

n

v𝑘𝜎v = ðñ

nn��p (×𝑘 decrease)

§ CV:𝝈𝝁

(× 𝑘� decrease)

47EdithCohen

Page 48: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

MorrisCounter:Reducingvariance(genericmethod)

§ Use𝒌 independent counters𝒚𝟏, 𝒚𝟐, … , 𝒚𝒌§ Computeestimates𝒁𝒊 = 𝟐𝒚𝒊 − 𝟏

§ Averagetheestimates𝒏′ë =∑ 𝒁𝒊𝒌𝒊í𝟏𝒌

§ NRMSE=CV= èé≈ p

vn� = 𝜀

CanwegetabettertradeoffofsketchsizeandNRMSE𝜀 ?

Morriscounter:𝑉ar 𝑛á = σv ≈ pv𝑛v andCV = è

é≈ p

v�

Sketchsize(bits):𝑘loglog𝑛 = pv𝜀Zvloglog𝑛

48EdithCohen

Page 49: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

MorrisCounter:Reducingvariance(dedicatedmethod)basechange[Morris1978+Flajolet1985]

Singlecounterwith basechange–IDEA:Changebase𝟐(count𝐥𝐨𝐠𝟐 𝒏)to𝟏 + 𝒃 (count𝐥𝐨𝐠𝟏o𝒃 𝒏)§ Estimate:Return 1 + 𝑏 [ − 1§ Increment:

§ Increasecounter𝒔bymaximumamountsoestimateincrease= 1 − Δ ≤ 1.§ Increment𝒔withprobabilityΔbZp(1 + 𝒃)Z𝒔

For𝒃 closerto0,weincreaseaccuracybutalsoincreasecountersize.

Morriscounter:𝑉ar 𝑛á = σv ≈ pv𝑛v andCV = è

é≈ p

v�

Weanalyzeamoregeneralmethod49EdithCohen

Page 50: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

WeightedMorrisCounter[C’15]5, 14, 1, 7, 18, 9, 121, 17,

§ Estimate: return 1 + 𝑏 [ − 1

weightedvalues,composable,size/qualitytunedbybaseparameter𝑏

§ Initialize: 𝑠 ← 0

§ Add Vor merge withaMorrissketch𝑠v ≤ 𝑠 (𝑉 = 1 + 𝑏 [ñ − 1):§ Increase𝑠 bymaxamountsothatestimateincreaseby𝑍 ≤ 𝑉§ Δ ← V − Z ; Increment𝑠 withprobability ù

£ po£ ú

Sketchsize:logvlogpo£ 𝑛 ≈ logv¤ûüñ�£¤ûüñý

≤ logvlogv𝑛 + 2logv pþ

WecanshowVar 𝑛á ≤ 𝑏𝑛(𝑛 + 1) ⟹ CV ≤ 𝑏� 1 + p�

� ⟹ Choose𝑏 = 𝜀v

!!Muchbetterthantheaveragingstructurepv𝜀Zvloglog𝑛

𝑛 = 10à,𝜀 = 0.1Exact:logv10à ≈ 30bitsAveMorris:≈ 250 bitsW-Morris:≈ 12bits

50EdithCohen

Page 51: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

WeightedMorrisCounter:Unbiasedness§ Estimate: return 1 + 𝑏 [ − 1§ Initialize: 𝑠 ← 0

WeshowthattheexpectedincreaseintheestimatewhenaddingV isequalto𝑉.Theincreasehastwocomponents,deterministic,andprobabilistic:§ Deterministic:Weset𝑠 ← 𝑠 +max{i ≥ 0| 1 + b åo! − 1 + b å ≤ 𝑉}.Thisstep

increasedtheestimateby𝑍 = 1 + b åo! − 1 + b å

§ Wethenprobabilisticallyincrement𝑠 toaccountforΔ = V − Z: Theestimateincreaseis1 + b åop − 1 + b å = b 1 + b å withprobabilityp = ù

£ po£ ú andis0 otherwise.Therefore,theexpectationis𝑝b 1 + b å = Δ.

§ Add Vor merge withaMorrissketch𝑠v ≤ 𝑠 (𝑉 = 1 + 𝑏 [ñ − 1):§ Increase𝑠 bymaxamountsothatestimateincreaseby𝑍 ≤ 𝑉§ Δ ← V − Z ; Increment𝑠 withprobability ù

£ po£ ú

51EdithCohen

Page 52: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

WeightedMorrisCounter:VarianceboundEstimate: 1 + 𝑏 [ − 1

Lemma1:ConsidervalueV andvariableA.ThenVar 𝐴 ≤ Δ𝑏(𝑛 + 1)

Add Δ:Increment𝑠 withprobability ù£ po£ ú

Lemma2:Forany𝑖 ≠ j. Cov 𝐴�, 𝐴% = 0.

Combining,wehavethatVar 𝑛á = ∑ Var 𝐴� ≤ ∑ V�𝑏 𝑛 + 1 ≤�� �

� 𝑏𝑛(𝑛 + 1)

ConsideralldatavaluesV! andthecorrespondingrandomvariables𝐴�thatistheincreaseintheestimate.Notethatbydefinition𝑛 = ∑ V!�

� .

ItremainstoprovetheLemmas…

52EdithCohen

Page 53: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

WeightedMorrisCounter:Variancebound,Lemma1Estimate: 1 + 𝑏 [ − 1

Lemma1:ConsidervalueV andvariableA.ThenVar 𝐴 ≤ V𝑏(𝑛 + 1)

Add Δ:Increment𝑠 withprobability ù£ po£ ú

Proof: Thevariance,conditionedonthestateofthecounter𝑠 ,onlydependsonthe“probabilistic”partwhichisΔ ≤ V.Var 𝐴 s] = p

&− 1 ΔQv ≤ ' po' ú

ù Δv = Δ𝑏 1 + b å

Thevalueof𝑠 atthetimetheelementisprocessedisatmostthefinalvalue𝑠Q ≥ 𝑠 ofthecounter.SoVar 𝐴 s] ≤ Δ𝑏 1 + b åQ

Theunconditionedvarianceisboundedbytheexpectationoverthedistributionof𝑠′.Notethat𝐸 1 + 𝑏 [Q = 𝑛 + 1. ThereforeVar 𝐴 = Eå[Var 𝐴 s]] ≤ Δ𝑏 𝐸[Q 1 + 𝑏 [Q = Δ𝑏 𝑛 + 1 ≤ 𝑉𝑏(𝑛 + 1)

ConsideralldatavaluesV! andthecorrespondingrandomvariables𝐴�thatistheincreaseintheestimate.Notethatbydefinition𝑛 = ∑ 𝑉!�

� .

53EdithCohen

Page 54: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

WeightedMorrisCounter:Variancebound,Lemma2Estimate: 1 + 𝑏 [ − 1 Add Δ:Increment𝑠 withprobability ù

£ po£ ú

Proof:Suppose𝑉p isprocessedfirst.WehaveE Ap = Vp.Wenowconsider𝐴v ConditionedonAp.RecallthattheexpectationofAv conditionedonanyvalueofthecounterwhenVv isprocessedisE Av 𝑠] = 𝑉v = 𝐸[𝐴v].Therefore,forany𝑎 ,E Av 𝐴p = 𝑎] = 𝑉v.

E Ap𝐴v =³aPr[𝐴p = 𝑎]�

(

E Av 𝐴p = 𝑎] = VvE Ap = E Av E Ap = 𝑉pVv

Consideralldatavalues𝑉! andthecorrespondingrandomvariables𝐴�thatistheincreaseintheestimate.Notethatbydefinition𝑛 = ∑ V!�

� .

Lemma2:Forany𝑖 ≠ j. Cov 𝐴�, 𝐴% = 0.

54EdithCohen

Page 55: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

Preview:CountingDistinctKeys

Samekeyscanoccurinmultipledataelements,wewanttocountthenumberofdistinct keys.§ Numberofdistinctkeysis𝒏 (= 6 inexample)§ Numberofdataelementsinthisexampleis11

55EdithCohen

Page 56: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

CountingDistinctKeys: ExampleApplications

§ Networking:§ Packetorrequeststreams:CountthenumberofdistinctsourceIPaddresses

§ Packetstreams:CountthenumberofdistinctIPflows(source+destination IP,port,protocol)

§ SearchEngines:Findhowmanydistinctsearchquerieswereissuedtoasearchengineeachday

56EdithCohen

Page 57: Foundations of Data Mining - cohenwang.com · Data access Data streams Distributed data/parallel computation D4 D 1 D 2 D 3 D 5 GPUs, CPUs, VMs, Servers, wide area, devices § Distributed

BibliographyMisra Gries Summaries§ J.Misra andDavidGries,FindingRepeatedElements.ScienceofComputerProgramming2,1982

http://www.cs.utexas.edu/users/misra/scannedPdf.dir/FindRepeatedElements.pdf§ Merging:Agarwal,Cormode,Huang,Phillips,Wei,andYi,Mergeable Summaries,PODS2012

Bloomfilters:§ Bloom,BurtonH.(1970), "Space/TimeTrade-offsinHashCodingwithAllowable

Errors", CommunicationsoftheACM, 13 (7)

§ https://en.wikipedia.org/wiki/Bloom_filterApproximatecounting(MorrisAlgorithm)§ RobertMorris.CountingLargeNumbersofEventsinSmallRegisters.Commun.ACM,21(10):840-

842,1978http://www.inf.ed.ac.uk/teaching/courses/exc/reading/morris.pdf

§ PhilippeFlajolet Approximatecounting:Adetailedanalysis.BIT251985http://algo.inria.fr/flajolet/Publications/Flajolet85c.pdf

§ MergingMorriscounters:theseslides

57EdithCohen