FoundationsofDataMining
Instructors:
http://www.cohenwang.com/edith/dataminingclass2017
AmosFiatEdithCohen HaimKaplan
Lecture1
1EdithCohen
Courselogistics
§ Tuesdays16:00-19:00,Sherman002§ Slidesfor(mostorall)lectureswillbepostedonthecoursewebpage:http://www.cohenwang.com/edith/dataminingclass2017
§ Officehours:Emailinstructorstosetatime§ Grade:70%finalexam,30%on5problemsets
2EdithCohen
DataCollected: networkactivity,peopleactivity,measurements,search/assistantqueries,location,onlineinteractionsandtransactions,text,media,Generated(processed)data:parametersinlargescalemodel,partlycuratedrawdata,
§ Scale:petabytes->exabytes ->…§ Diverseformats:relational,logs,text,media,measurements§ Location:distributed,streamed,
3EdithCohen
DatatoInformation
Socialissues:§ Privacy,Fairness
Miningandlearningfromdata§ Aggregates,statistics,properties§ Modelsthatallowustogeneralize/predict
Scalable(efficient,fast)computation:§ Dataavailableasstreamedordistributed(limitdatamovementforefficiency/privacy)§ Platformsthatusecomputationresources(Map-reduce,Tensor-Flow,… ) acrossscales:
• GPUs,multi-coreCPUs,DataCenter,widearea,federated(ondevice)computing§ Algorithmdesign:
§ “linear”processingonlargedata,§ trade-offaccuracyandcomputationcost
4EdithCohen
TopicsforthiscourseSelectioncriteriaoftopics:§ Broaddemonstratedapplicability§ Promotedeeperunderstandingofconcepts§ Simplicity,elegance,principled§ Instructorbias
Topics:§ Datamodeledas:keyvaluepairs,metric(vectors,sets),graphs§ Properties,features,statisticsofinterest§ Summarystructuresforefficientstorage/movement/computation§ Algorithmsfordistributed/parallel/streamedcomputation§ Datarepresentationsthatsupportgeneralization(recovermissingrelations,
identifyspuriousones)§ Dataprivacy
5EdithCohen
Today
§ Key-valuepairsdata§ Introtosummarystructures(sketches)§ Computationoverstreamed/distributeddata
§ Frequentkeys:TheMisra Gries structure§ Setmembership:BloomFilters§ Counting:Morriscounters
6EdithCohen
Data𝑫
Key-Valuepairs
Dataelement𝑒 ∈ D haskeyandvalue(e.key,e.value)
8
2 2 15
7
310 4 Exampledata§ Searchqueries§ IPnetworkpackets/flowrecords§ Onlineinteractions§ Parameterupdates(trainingMLmodel)
7EdithCohen
Data𝑫
Key-Valuepairs
Dataelement𝑒 ∈ D haskeyandvalue(e.key,e.value)
8
2 2 15
7
310 4 Exampletasks/queries§ Sum/Maxvalue§ Membership:Isin𝐷?§ Howmanydistinctkeys?§ Veryfrequentkeys(heavyhitters)
8EdithCohen
Dataaccess
Datastreams
Distributeddata/parallelcomputation
D 4
D3D1 D2
D5
GPUs,CPUs,VMs,Servers,widearea,devices§ Distributeddatasources§ Distributeforfaster/scalablecomputation
Datareadinone(orfew)sequentialpasses§ Cannotberevisited(IPtraffic)§ I/Oefficiency(sequentialaccessis
cheaperthanrandomaccess)
Challenges:Limitdatamovement,SupportupdatestoD
Challenges:”State”mustbemuchsmallerthandatasize,SupportupdatestoD9EdithCohen
SummaryStructures(Sketches)
Examples:randomsamples,projections,histograms,…
𝑫:dataset;𝑓(𝑫):somestatistics/properties𝑺𝒌𝒆𝒕𝒄𝒉(𝑫):Asummaryof𝑫 thatactsas“surrogate”andallowsustoestimate𝑓(𝑫)𝑓5():estimatorweapplyto𝑺𝒌𝒆𝒕𝒄𝒉(𝑫) toestimate𝑓(𝑫)
Data𝑫 Sketch(D)
?𝑓(𝑫) 𝑓5(𝑺𝒌𝒆𝒕𝒄𝒉(𝑫))
§ Multi-objective𝑓(𝑞,𝑫) ,𝑓5(𝑞, 𝑺𝒌𝒆𝒕𝒄𝒉(𝑫)) sketchsupportsmultiplequerytypes
Whysketch?Datacanbetoolargeto:
§ Storeinfullforlongorevenshortterm§ Transmit
§ Slow/costlyprocessingofexactqueries§ Dataupdatesdonotnecessitatefull
recomputation
10EdithCohen
Composable sketches
Sufficestospecifymergingtwosketches
Data𝐴 Sketch(A)
Data𝐵 Sketch(B)
Data𝐴 ∪ 𝐵 Sketch(A∪ B)
Sketch1
S.1 ∪ 2
S.1 ∪ 2 ∪ 5
S.3 ∪ 4
1∪ 2 ∪ 3 ∪ 4 ∪ 5
Sketch2
Sketch5
Sketch3 Sketche 4
Distributeddata/parallelizecomputation
Onlysketchstructuremovesbetweenlocations
§Sketch(A ∪ B) fromSketch(A) andSketch(B)
11EdithCohen
Streamingsketches
Data𝐴 Sketch(A)
Data𝐴 ∪ {𝑒} Sketch(A∪ {𝑒})
Weakerrequirementthanfullycomposable
§Sketch(A ∪ {e}) fromSketch(A) andelement {e}
elemente elemente
Streameddata
Sketch Sketch Sketch Sketch Sketch Sketch
Only“state”maintainedisthesketchstructure
12EdithCohen
SketchAPI
§ InitializationSketch(∅)§ Estimatorspecification𝑓5(Sketch(D))§ Mergetwosketches Sketch(A ∪ B) fromSketch(A) andSketch(B)§ Processanelement𝑒 = (𝑒. 𝑘𝑒𝑦, 𝑒. 𝑣𝑎𝑙):Sketch(A ∪ 𝑒)) fromSketch(A) and 𝑒
§ Delete𝑒
Data𝑫 Sketch (𝑫)
Q:𝑓(𝑫) ? 𝑓5(Sketch(D))
§ Seektooptimizesketch-sizevs. estimatequality
optional
≻
13EdithCohen
Easysketches:min,max,sum,…
Sum§ Initialize:𝑠 ← 0§ Processelement𝑒 :𝑠 ← 𝑠 + 𝑒. 𝑣𝑎𝑙§ Merge s,s’:𝑠 ← 𝑠 + 𝑠Q§ Deleteelement𝑒 :𝑠 ← 𝑠 − 𝑒. 𝑣𝑎𝑙§ Query: return𝑠
32, 112, 14, 9, 37, 83, 115, 2,Exact,composable,Sketchisjustasingleregister𝑠:
Max§ Initialize:𝑠 ← 0§ Processelement𝑒 :𝑠 ← max(𝑠, 𝑒. 𝑣𝑎𝑙)§ Merge s,s’:𝑠 ← max(𝑠, 𝑠Q)§ Query: return𝑠
Elementvalues:
Nodeletesupport
14EdithCohen
FrequentKeys
ExampleApplications:§ Networking:Find“elephant”IPflows§ Searchengines:Findthemostfrequentqueries§ Textanalysis:Frequentterms
Zipf law:Frequencyof𝑖WX heaviestkey∝ 𝑖Z[Saytop10%keysin90%ofelements
§ Dataisstreamed ordistributed§ Verylarge#distinctkeys,huge#elements§ Findthekeysthatoccurveryoften
wikipedia
https://brenocon.com/blog/2009/05
Occurin3/11elements
15EdithCohen
FrequentKeys:ExactSolution
Exactsolution:§ Createacounterforeachdistinctkeyonitsfirstoccurrence§ Whenprocessinganelementwithkey𝑥 ,incrementthecounterof𝑥
Problem:Structuresizeis𝑛 = numberofdistinctkeys.Whatcanwedowithsizek ≪ 𝑛?Properties:Fullycomposable,exact,evensupportsdeletions,recoversallfrequencies
Solution:Sketchthatgotre-discoveredmanytimes[MG1982,DLM2002,KSP2003,MAA2006]16EdithCohen
FrequentKeys:Streamingsketch[Misra Gries 1982]
Processinganelementwithkey𝒙§ If wealreadyhaveacounterfor𝒙,incrementit§ Else,Ifthereisnocounter,buttherearefewerthan𝑘 counters,createacounterfor𝒙 initializedto𝟏.
§ Else,decreaseallcountersby𝟏.Remove𝟎 counters.
𝑛 = 6 #distinct𝑘 = 3 #structuresize𝑚 = 11 #element
Sketchsizeparameter𝑘:Use(atmost)𝑘 countersindexedbykeys.Initially,nocounters
Query:#occurrencesof𝒙?§ If wehaveacounterfor𝒙,returnitsvalue
§ Else,return 𝟎.Clearlyanunder-estimate.Whatcanwesayprecisely?
17EdithCohen
MGsketch:Analysis
Weboundthenumberof“decrease”stepsEachdecreasestepremoves𝑘 “counts”fromstructure,togetherwithinputelement,itresultsin𝑘 + 1 “uncounted”elements.
⇒ Numberofdecrementsteps≤ lZlm
nop
Lemma: Estimateissmallerthantruecountbyatmost𝒎Z𝒎m
𝒌o𝟏𝑚′:Sumofcountersinstructure;𝑚: #elementsinstream;𝑘:structuresizeWechargeeach“missedcount”toa“decrease”step:§ Ifkeyinstructure,anydecreaseincountisdueto“decrease”step.§ Elementprocessedandnotcountedresultsindecreasestep.
18EdithCohen
MGsketch:Analysis(contd.)Estimateissmallerthantruecountbyatmost
𝒎Z𝒎m
𝒌o𝟏
⇒Wegetgoodestimatesfor𝑥 withfrequency≫ lZlQnop
§ Errorboundisinverselyproportionalto𝑘.Typicaltradeoffofsketch-sizeandqualityofestimate.
§ Errorboundcanbecomputedwithsketch:Track𝑚 (elementcount),know𝑚’ (canbecomputedfromstructure)and𝑘.
§ MGworksbecausetypicalfrequencydistributionshavefewverypopularkeys“Zipf law”
19EdithCohen
MakingMGfullyComposable:MergingtwoMGsketches[MAA2006,ACHPWY2012]
Basicmerge:§ Ifakey𝑥 isinbothstructures,keeponecounterwithsumofthetwocounts
§ Ifakey𝑥 isinonestructure,keepthecounter
Reduce:Iftherearemorethan𝒌 counters§ Takethe 𝑘 + 1 th largestcounter§ Subtractitsvaluefromallothercounters§ Deletenon-positivecounters
20EdithCohen
MergingtwoMisra Gries Sketches
BasicMerge:
21EdithCohen
MergingtwoMisra Gries Summaries
4th largest
Reducesincetherearemorethan𝒌 = 𝟑 counters:§ Takethe 𝑘 + 1 th =4th largestcounter§ Subtractitsvalue(2)fromallothercounters§ Deletenon-positivecounters
22EdithCohen
MergingMGSummaries:Correctness
Claim: Finalmergedsketchhasatmost𝑘 countersProof: Wesubtractthe(𝑘 + 1)th largestfromeverything,soatmostthe𝑘 largestcanremainpositive.
Claim:Foreachkey,mergedsketchcountissmallerthantruecountbyatmostlZlQ
nop
23EdithCohen
MergingMGSummaries:Correctness
Part1:Totalelements:𝑚pCountinstructure:𝑚p′Countmissed:≤ 𝒎𝟏Z𝒎𝟏Q
𝒌o𝟏
Part2:Totalelements:𝑚vCountinstructure:𝑚v′Countmissed:≤ 𝒎𝟐Z𝒎𝟐Q
𝒌o𝟏
Proof:“Counts”forkey𝑥 canbemissedinpart1,part2,orinthereducecomponentofthemergeWeadduptheboundsonthemisses
“Reduce”missedcountperkeyisatmost𝑹 = the (𝒌 + 𝟏)th largestcountbeforereduce
Claim:Foreachkey,mergedsketchcountissmallerthantruecountbyatmostlZlQ
nop
24EdithCohen
MergingMGSummaries:Correctness
Part1:Totalelements:𝑚pCountinstructure:𝑚p′Countmissed:≤ 𝒎𝟏Z𝒎𝟏Q
𝒌o𝟏
Part2:Totalelements:𝑚vCountinstructure:𝑚v′Countmissed:≤ 𝒎𝟐Z𝒎𝟐Q
𝒌o𝟏
⇒“Countmissed”ofonekeyinmergedsketchisatmost𝒎𝟏Z𝒎𝟏Q𝒌o𝟏
+ 𝒎𝟐Z𝒎𝟐Q𝒌o𝟏
+ 𝑹
“Reduce”missedcountperkeyisatmost𝑹 = the (𝒌 + 𝟏)th largestcountbeforereduce
25EdithCohen
MergingMGSummaries:CorrectnessCountedelementsinstructure:§ Afterbasicmergeandbeforereduce:𝑚p
Q + 𝑚v′§ Afterreduce:𝑚Q
Claim:mpQ + mv
Q − 𝑚Q ≥ 𝑅 𝑘 + 1Proof:𝑅 areerasedinthereducestepineachofthe𝑘 + 1largestcounters.Maybemoreinsmallercounters.
“Countmissed”ofonekeyisatmost𝒎𝟏Z𝒎𝟏Q𝒌o𝟏
+ 𝒎𝟐Z𝒎𝟐Q𝒌o𝟏
+ 𝑹 ≤ 𝟏𝒌o𝟏
𝒎𝟏 +𝒎𝟐 −𝒎Q = lZlQnop
26EdithCohen
Probabilisticstructures
§ Misra Gries isadeterministic structure§ Theoutcomeisdetermineduniquelybytheinput§ Probabilisticstructures/algorithmscanbemuchmorepowerful• Provideprivacy/robustnesstooutliers• Provideefficiency/size
27EdithCohen
Setmembership
Exampleapplications:§ Spellchecker:Insertacorpusofwords.Checkifwordisincorpus.§ Webcrawler:Insertallurls thatwerevisited.Checkifcurrenturl wasexplored.
§ Distributedcaches:Maintaina”summary”ofkeysofcachedresources.Sendrequeststoacachethathastheresource.
§ BlacklistedIPaddresses:Intercepttrafficfromblacklistedsources
Data𝑫§ Dataisstreamed ordistributed§ Verylarge#distinctkeys,huge#elements,
largerepresentationofkeys
Structurethatsupportsmembershipqueries:Isin𝐷?
Exactsolution: Dictionary(hashmap)structure.Problem: storesrepresentationofallkeys28EdithCohen
Setmembership:BloomFilters[Bloom1970]§ Verypopularinmanyapplications§ Probabilisticdatastructure§ Reducesrepresentationsizetofewbits(8)perkey§ Falsepositivespossible,nofalsenegatives§ Tradeoffbetweensizeandfalsepositiverate§ Composable§ Analysisrelies onhavingindependentrandomhashfunctions(practiceworkwell,theoreticalissues)
29EdithCohen
IndependentRandomHashFunctionsDomain𝑫 ofkeys;probabilitydistribution𝑭 over𝑹Distribution𝑯ofhashfunctionsℎ:𝑫 → 𝑹 withthefollowingproperties:Overℎ ∼ 𝐻§ Foreach𝑥 ∈ 𝑫,ℎ 𝑥 ∼ 𝑭 (overℎ ∼ 𝐻 )§ ℎ 𝑥 areindependentfordifferentkeys𝑥 ∈ 𝑫
Weuserandomhashfunctionsasawaytohaverandomdrawswith“memory”:Attacha“permanent”randomvaluetoakey
SimplifiedandIdealized
30EdithCohen
Setmembershipwarmup:HashsolutionParameter:𝑚Structure:Booleanarrayofsize𝑚Randomhashfunctionℎ whereℎ 𝑥 ∼ 𝑈[1,… ,𝑚]
Initialize:Declare boolean array𝑆 ofsize𝑚;For 𝑖 = 1,… ,𝑚:𝑆[𝑖] ← 𝐹
Processelementwithkey𝑥:𝑆[ℎ 𝑥 ] ← 𝑇
Membershipqueryfor𝑥:Return S[h 𝑥 ]
1 2 43 65 987 10
𝑚 = 10
T T
F ⇒ notinset
Merge: Twostructuresofsamesizeandsamehashfunction.TakebitwiseOR31EdithCohen
Hashsolution:Probabilityofafalsepositive𝑚:Structuresize; 𝑛 numberofdistinctkeysinsertedb = l
� ,numberofbitsweuseinstructureperdistinctkeyindata
Probability𝜀 offalsepositivefor𝑥:
Probabilityofℎ(𝑥) hittinganoccupiedcell:
𝜀 = PrX∼�
[𝑆 ℎ 𝑥 = 𝑇] ≈n𝑚 =
1b
Toohighformanyapplications!!(IPaddressis32bits…)Canwegetabettertradeoffbetween𝜀 andb ?
1 2 43 65 987 10
𝑚 = 10
T T
Example:𝜀 = 0.02 ⟹ 𝑏 = 50
32EdithCohen
Setmembership:BloomFilters[Bloom1970]Twoparameters:𝑚 and𝑘Structure:Booleanarrayofsize𝑚Independenthashfunctionsℎp, ℎv, … , ℎn whereℎ� 𝑥 ∼ 𝑈[1,… ,𝑚]
Initialize:Declare boolean array𝑆 ofsize𝑚;For 𝑖 = 1,… ,𝑚:𝑆[𝑖] ← 𝐹
Processelementwithkey𝑥:For 𝑖 = 1,… , 𝑘:𝑆[ℎ� 𝑥 ] ← 𝑇
Membershipqueryfor𝑥:Return 𝑆[ℎp 𝑥 ] and 𝑆[ℎv 𝑥 ] and⋯ 𝑆[ℎn 𝑥 ]
1 2 43 65 987 10
𝑚 = 10 ;𝑘 = 3
T T TT T
T ∧ T ∧ F = F ⇒ notinset
Merge: Twostructuresofsamesizeandsamesetofhashfunctions.TakebitwiseOR33EdithCohen
BloomFiltersAnalysis:Probabilityofafalsepositive𝑚:Structuresize;𝑘:numberofhashfunctions; 𝑛 numberofdistinctkeysinserted
1 2 43 65 987 10
𝑚 = 10 ;𝑘 = 3
T T TT T
T ∧ T ∧ F = F ⇒ notinset
Afalsepositiveoccursfor𝑥 whenall 𝑘 cellsℎ� 𝑥 for𝑖 = 1,… , 𝑘 areT:
𝜀 = � (1 − PrX∼�
[ 𝑆 ℎ�(𝑥) = 𝐹]�
��p,…,,n
) = 1 − 1 −1𝑚
n� n
Probabilityofℎ�(𝑥) NOThittingaparticularcell𝑗:
PrX∼�
[ℎ� 𝑥 ≠ 𝑗] = (1 −1m)
Probabilitythatcell𝑗 isFisthatnoneofthe𝑛𝑘“dartthrows”hitscell𝑗:
PrX∼�
[ 𝑆 𝑗 = 𝐹] = 1 −1m
�
*Assume𝑘 ≪ 𝑚 soℎ�(𝑥) fordifferent𝑖 = 1, … , 𝑘 areverylikelytobedistinct 34EdithCohen
BloomFilters:Probabilityofafalsepositive(contd)𝑚:Structuresize;𝑘:numberofhashfunctions; 𝑛 numberofdistinctkeysinserted
Falsepositiveprobability:
𝜀 ≤ 1 − 1 −1𝑚
n� n
1 −1m
�= 1 −
1m
ln�l
!!FPprobabilitydependsonb = l� ,numberofbitsweuseperdistinctkey
WecanseethatFPprobabilitydecreaseswith𝑚
≈ 1 − 𝑒Zn�l
n
lim�→¢
1 −1𝑖
�=1𝑒
≈ 1𝑒
n�l= 𝑒Z
n�l
Givenb,which𝑘 minimizestheFPprobability𝜀?
= 1 − 𝑒Zn£n
35EdithCohen
BloomFilters:Probabilityofafalsepositive(contd)𝑘:numberofhashfunctions𝑚:Structuresize𝑛: numberofdistinctkeysinserted𝑏 = l
� :bitsperkey
Falsepositiveprobability𝜀 (upperbound):
Given𝑏,which𝑘 minimizestheFPprobability?
𝜀 ≤ 1 − 𝑒Zn£n
𝑘 ≈ ln 2 𝑏 ≈ 0.7 𝑏
𝜀 ≈12
£¤�v
𝑏 ≈ 1.44logv1𝜀
!!FPerrordecreasesexponentiallyin𝑏(recall𝜀 = p
£ for𝑘 =1)
Compute𝑏 fordesiredFPerror𝜀:
Example:𝑏 = 8; 𝑘 = 6; 𝜀 ≈ 0.02
36EdithCohen
Quickreview:RandomVariables
RandomvariableXProbabilityDensityFunction(PDF)𝑓(𝑥) :• Properties:𝑓 𝑥 ≥ 0 ∫ 𝑓 𝑥¢
Z¢ 𝑑𝑥 = 1
§ CumulativeDistributionFunction(CDF)
𝐹 𝑡 = ∫ 𝑓 𝑥WZ¢ 𝑑𝑥:probabilitythat𝑋 ≤ 𝑡
• Properties:𝐹 ∈ [0,1]monotonenon-decreasing
37EdithCohen
Quickreview:Expectation§ Expectation:“average”valueof𝑿:
𝜇¯ ≡ 𝐸 𝑋 = ∫ 𝑥𝑓 𝑥¢Z¢ 𝑑𝑥
§ LinearityofExpectation:𝐸[𝑎𝑋 + 𝑏] = 𝑎𝐸[𝑋] + 𝑏
Forrandomvariables𝑋p,𝑋v,𝑋²,...,𝑋n
𝐸 ³𝑋�
n
��p
=³𝐸[𝑋�]n
��p
38EdithCohen
Quickreview:Variance§ Variance
𝐕𝐚𝐫 𝑿 ≡ 𝝈𝑿𝟐 = 𝑬[ 𝑿 − 𝝁)𝟐 = º 𝒙 − 𝝁 𝟐𝒇 𝒙 𝒅𝒙¢
Z¢§ Usefulrelations:𝝈𝑿𝟐 = 𝑬 𝑿𝟐 − 𝝁𝑿𝟐
𝐕𝐚𝐫[𝒂𝑿 + 𝒃] = 𝒂𝟐𝑽𝐚𝐫[𝑿]§ Thestandarddeviationis𝝈𝑿 = 𝑽𝐚𝐫[𝑿]�
§ CoefficientofVariation 𝝈𝝁(normalizeds.d.)
39EdithCohen
Quickreview:Covariance
Cov 𝑿, 𝒀 = 𝝈𝑿𝒀 = 𝑬 𝑿 − 𝝁𝑿 𝒀 − 𝝁𝒀= 𝐄 𝐗𝐘 − 𝝁𝑿𝝁𝒀
§ 𝑿, 𝒀 areindependent⟹𝝈𝑿𝒀 = 𝟎
§ Varianceofthesumof𝑿𝟏, 𝑿𝟐,…, 𝑿𝒌
𝐕𝐚𝐫 ³𝑿𝒊
𝒌
𝒊�𝟏
=³³𝐂𝐨𝐯[𝑿𝒊, 𝑿𝒋]𝒌
𝒋�𝟏
𝒌
𝒊�𝟏
=³𝐕𝐚𝐫[𝑿𝒊]𝒌
𝒊�𝟏
+³𝐂𝐨𝐯[𝑿𝒊, 𝑿𝒋]𝒌
𝒊É𝒋
When(pairwise)independent
Measureofjointvariabilityoftworandomvariables)𝑿, 𝒀
40EdithCohen
QuickReview:Estimators
§ Error (randomvariable)err 𝑓5 = 𝑓5 𝑆 − 𝑓(𝐷) ;RelativeErrorÊËË Ì5
Í Î§ Bias Bias[𝑓5] = E[err 𝑓5 ] = 𝐸[𝑓5] − 𝑓(𝐷)
§ WhenBias = 0estimatorisunbiased§ MeanSquareError(MSE):
E err 𝑓5v= 𝑉ar 𝑓5 + Bias 𝑓5
v
§ RootMeanSquareError(RMSE): 𝑀𝑆𝐸�
§ NormalizedRootMeanSquareError(NRMSE): ÒÓÔ�
Ì Õ
Afunction𝑓5(𝑆) appliedtoaprobabilisticsketch𝑆 ofdataDtoestimateaproperty/statistics𝑓(D) ofthedataD
41EdithCohen
SimpleCounting(revisited)
Initialize: 𝒔 ← 0Processelement:𝒔 ← 𝒔 + 𝟏Merge s,s’:𝑠 ← 𝑠 + 𝑠Q
1, 1, 1, 1, 1, 1, 1, 1,
Exactcount:Size(bits)is⌈logv𝑛⌉ where𝑛 isthecurrentcount.
Canwecountwithfewer bits?Havetosettleforanapproximate count…
Applications:Wehaveverymanyquantitiestocount,andfastmemoryisscarce(say,insideabackbonerouter,)orbandwidthisscarce(distributedtrainingofalargeMLmodel)
42EdithCohen
MorrisCounter[Morris1978]
§ Initialize:𝐬 = 𝟎§ Increment:Increment𝐬withprobability𝟐Z𝒔§ Query:Return𝟐𝒔 − 𝟏
1, 1, 1, 1, 1, 1, 1, 1,Stream:
Counter 𝒙: 0
𝑝 = 2ZÛ: 1
1 1 2 2 2 2 3 3
Estimate 𝒏Ý: 0 1 1 3 3 3 3 7 7
1, 2, 3, 4, 5, 6, 7, 8,Count𝒏:pv
pv
pÞ
pÞ
pÞ
pÞ
pß
pß
Probabilisticstreamcounter:Maintaine“log𝑛”insteadof𝑛, useloglog𝑛 bits
43EdithCohen
𝑛 = 10à,Exact:logv10à ≈ 30bitslogv logv10à≈ 5bits
MorrisCounter:Unbiasedness
§ When𝑛 = 0,𝑠 = 0,estimateis𝑛á = 𝟐𝟎 − 𝟏 = 𝟎§ When𝑛 = 1,𝑠 = 1,estimateis𝑛á = 𝟐𝟏 − 𝟏 = 𝟏§ When𝑛 = 2,
with𝑝 = pv,s = 𝟏 ,𝑛á = 𝟏
with𝑝 = pv,𝒔 = 𝟐 ,𝑛á = 𝟐𝟐 − 𝟏 = 𝟑
Expectation:E 𝒏Ý = 𝟏𝟐∗ 𝟏 + 𝟏
𝟐∗ 𝟑 = 𝟐
§ 𝒏 = 𝟑, 𝟒, 𝟓… byinduction….
§ Initialize:𝐬 = 𝟎§ Increment:𝐬 ← 𝒔 + 𝟏 withprobability𝟐Z𝒔§ Query:Return𝟐𝒔 − 𝟏
44EdithCohen
MorrisCounter:Unbiasedness(contd)
§ Supposethecountervalueiss§ Weincreasewithprobability𝟐Z𝒔
§ Theexpectedincreaseintheestimateis2Z[((2[op − 1) − (2å−1)) + (1 − 2Zå)0 = 2Z[2[ = 1
Itsufficestoshowthattheexpectedincreaseoftheestimate isalways1
§ Initialize:𝐬 = 𝟎§ Increment:𝐬 ← 𝒔 + 𝟏 withprobability𝟐Z𝒔§ Query:Return𝟐𝒔 − 𝟏
45EdithCohen
MorrisCounter:VarianceHowgoodisourestimate?𝑋� :randomvariableofcounterwithinput𝑛§ Ourestimateistherandomvariable𝑛á = 2¯æ − 1
Var 𝑛á = 𝑉ar 𝑛á + 1 = 𝐸 𝑛á + 1 v − 𝐸 𝑛á + 1 v
= 𝐸 2v¯æ − (𝑛 + 1)v
§ Wecanshowbyinduction𝐸 2v¯æ = ²v𝑛v + ²
v𝑛 + 1
§ Thismeans𝑉ar 𝑛á ≈ pv𝑛v andCV = è
é≈ p
v�(=NRMSEsinceunbiased)
Howtoreducetheerror?46EdithCohen
Reducingvariancebyaveraging
𝒌 (pairwise)independent unbiasedestimates𝒁𝒊 withexpectation𝝁 andvariance𝝈𝟐.
Theaverageestimator𝒏′ë =∑ 𝒁𝒊𝒌𝒊í𝟏𝒌
§ Expectation:𝐸 𝑛Që = pn∑ 𝐸 𝑍� = p
n𝑘𝜇 = 𝜇n
��p
§ Variance: pn
v∑ Var 𝑍� = p
n
v𝑘𝜎v = ðñ
nn��p (×𝑘 decrease)
§ CV:𝝈𝝁
(× 𝑘� decrease)
47EdithCohen
MorrisCounter:Reducingvariance(genericmethod)
§ Use𝒌 independent counters𝒚𝟏, 𝒚𝟐, … , 𝒚𝒌§ Computeestimates𝒁𝒊 = 𝟐𝒚𝒊 − 𝟏
§ Averagetheestimates𝒏′ë =∑ 𝒁𝒊𝒌𝒊í𝟏𝒌
§ NRMSE=CV= èé≈ p
vn� = 𝜀
CanwegetabettertradeoffofsketchsizeandNRMSE𝜀 ?
Morriscounter:𝑉ar 𝑛á = σv ≈ pv𝑛v andCV = è
é≈ p
v�
Sketchsize(bits):𝑘loglog𝑛 = pv𝜀Zvloglog𝑛
48EdithCohen
MorrisCounter:Reducingvariance(dedicatedmethod)basechange[Morris1978+Flajolet1985]
Singlecounterwith basechange–IDEA:Changebase𝟐(count𝐥𝐨𝐠𝟐 𝒏)to𝟏 + 𝒃 (count𝐥𝐨𝐠𝟏o𝒃 𝒏)§ Estimate:Return 1 + 𝑏 [ − 1§ Increment:
§ Increasecounter𝒔bymaximumamountsoestimateincrease= 1 − Δ ≤ 1.§ Increment𝒔withprobabilityΔbZp(1 + 𝒃)Z𝒔
For𝒃 closerto0,weincreaseaccuracybutalsoincreasecountersize.
Morriscounter:𝑉ar 𝑛á = σv ≈ pv𝑛v andCV = è
é≈ p
v�
Weanalyzeamoregeneralmethod49EdithCohen
WeightedMorrisCounter[C’15]5, 14, 1, 7, 18, 9, 121, 17,
§ Estimate: return 1 + 𝑏 [ − 1
weightedvalues,composable,size/qualitytunedbybaseparameter𝑏
§ Initialize: 𝑠 ← 0
§ Add Vor merge withaMorrissketch𝑠v ≤ 𝑠 (𝑉 = 1 + 𝑏 [ñ − 1):§ Increase𝑠 bymaxamountsothatestimateincreaseby𝑍 ≤ 𝑉§ Δ ← V − Z ; Increment𝑠 withprobability ù
£ po£ ú
Sketchsize:logvlogpo£ 𝑛 ≈ logv¤ûüñ�£¤ûüñý
≤ logvlogv𝑛 + 2logv pþ
WecanshowVar 𝑛á ≤ 𝑏𝑛(𝑛 + 1) ⟹ CV ≤ 𝑏� 1 + p�
� ⟹ Choose𝑏 = 𝜀v
!!Muchbetterthantheaveragingstructurepv𝜀Zvloglog𝑛
𝑛 = 10à,𝜀 = 0.1Exact:logv10à ≈ 30bitsAveMorris:≈ 250 bitsW-Morris:≈ 12bits
50EdithCohen
WeightedMorrisCounter:Unbiasedness§ Estimate: return 1 + 𝑏 [ − 1§ Initialize: 𝑠 ← 0
WeshowthattheexpectedincreaseintheestimatewhenaddingV isequalto𝑉.Theincreasehastwocomponents,deterministic,andprobabilistic:§ Deterministic:Weset𝑠 ← 𝑠 +max{i ≥ 0| 1 + b åo! − 1 + b å ≤ 𝑉}.Thisstep
increasedtheestimateby𝑍 = 1 + b åo! − 1 + b å
§ Wethenprobabilisticallyincrement𝑠 toaccountforΔ = V − Z: Theestimateincreaseis1 + b åop − 1 + b å = b 1 + b å withprobabilityp = ù
£ po£ ú andis0 otherwise.Therefore,theexpectationis𝑝b 1 + b å = Δ.
§ Add Vor merge withaMorrissketch𝑠v ≤ 𝑠 (𝑉 = 1 + 𝑏 [ñ − 1):§ Increase𝑠 bymaxamountsothatestimateincreaseby𝑍 ≤ 𝑉§ Δ ← V − Z ; Increment𝑠 withprobability ù
£ po£ ú
51EdithCohen
WeightedMorrisCounter:VarianceboundEstimate: 1 + 𝑏 [ − 1
Lemma1:ConsidervalueV andvariableA.ThenVar 𝐴 ≤ Δ𝑏(𝑛 + 1)
Add Δ:Increment𝑠 withprobability ù£ po£ ú
Lemma2:Forany𝑖 ≠ j. Cov 𝐴�, 𝐴% = 0.
Combining,wehavethatVar 𝑛á = ∑ Var 𝐴� ≤ ∑ V�𝑏 𝑛 + 1 ≤�� �
� 𝑏𝑛(𝑛 + 1)
ConsideralldatavaluesV! andthecorrespondingrandomvariables𝐴�thatistheincreaseintheestimate.Notethatbydefinition𝑛 = ∑ V!�
� .
ItremainstoprovetheLemmas…
52EdithCohen
WeightedMorrisCounter:Variancebound,Lemma1Estimate: 1 + 𝑏 [ − 1
Lemma1:ConsidervalueV andvariableA.ThenVar 𝐴 ≤ V𝑏(𝑛 + 1)
Add Δ:Increment𝑠 withprobability ù£ po£ ú
Proof: Thevariance,conditionedonthestateofthecounter𝑠 ,onlydependsonthe“probabilistic”partwhichisΔ ≤ V.Var 𝐴 s] = p
&− 1 ΔQv ≤ ' po' ú
ù Δv = Δ𝑏 1 + b å
Thevalueof𝑠 atthetimetheelementisprocessedisatmostthefinalvalue𝑠Q ≥ 𝑠 ofthecounter.SoVar 𝐴 s] ≤ Δ𝑏 1 + b åQ
Theunconditionedvarianceisboundedbytheexpectationoverthedistributionof𝑠′.Notethat𝐸 1 + 𝑏 [Q = 𝑛 + 1. ThereforeVar 𝐴 = Eå[Var 𝐴 s]] ≤ Δ𝑏 𝐸[Q 1 + 𝑏 [Q = Δ𝑏 𝑛 + 1 ≤ 𝑉𝑏(𝑛 + 1)
ConsideralldatavaluesV! andthecorrespondingrandomvariables𝐴�thatistheincreaseintheestimate.Notethatbydefinition𝑛 = ∑ 𝑉!�
� .
53EdithCohen
WeightedMorrisCounter:Variancebound,Lemma2Estimate: 1 + 𝑏 [ − 1 Add Δ:Increment𝑠 withprobability ù
£ po£ ú
Proof:Suppose𝑉p isprocessedfirst.WehaveE Ap = Vp.Wenowconsider𝐴v ConditionedonAp.RecallthattheexpectationofAv conditionedonanyvalueofthecounterwhenVv isprocessedisE Av 𝑠] = 𝑉v = 𝐸[𝐴v].Therefore,forany𝑎 ,E Av 𝐴p = 𝑎] = 𝑉v.
E Ap𝐴v =³aPr[𝐴p = 𝑎]�
(
E Av 𝐴p = 𝑎] = VvE Ap = E Av E Ap = 𝑉pVv
Consideralldatavalues𝑉! andthecorrespondingrandomvariables𝐴�thatistheincreaseintheestimate.Notethatbydefinition𝑛 = ∑ V!�
� .
Lemma2:Forany𝑖 ≠ j. Cov 𝐴�, 𝐴% = 0.
54EdithCohen
Preview:CountingDistinctKeys
Samekeyscanoccurinmultipledataelements,wewanttocountthenumberofdistinct keys.§ Numberofdistinctkeysis𝒏 (= 6 inexample)§ Numberofdataelementsinthisexampleis11
55EdithCohen
CountingDistinctKeys: ExampleApplications
§ Networking:§ Packetorrequeststreams:CountthenumberofdistinctsourceIPaddresses
§ Packetstreams:CountthenumberofdistinctIPflows(source+destination IP,port,protocol)
§ SearchEngines:Findhowmanydistinctsearchquerieswereissuedtoasearchengineeachday
56EdithCohen
BibliographyMisra Gries Summaries§ J.Misra andDavidGries,FindingRepeatedElements.ScienceofComputerProgramming2,1982
http://www.cs.utexas.edu/users/misra/scannedPdf.dir/FindRepeatedElements.pdf§ Merging:Agarwal,Cormode,Huang,Phillips,Wei,andYi,Mergeable Summaries,PODS2012
Bloomfilters:§ Bloom,BurtonH.(1970), "Space/TimeTrade-offsinHashCodingwithAllowable
Errors", CommunicationsoftheACM, 13 (7)
§ https://en.wikipedia.org/wiki/Bloom_filterApproximatecounting(MorrisAlgorithm)§ RobertMorris.CountingLargeNumbersofEventsinSmallRegisters.Commun.ACM,21(10):840-
842,1978http://www.inf.ed.ac.uk/teaching/courses/exc/reading/morris.pdf
§ PhilippeFlajolet Approximatecounting:Adetailedanalysis.BIT251985http://algo.inria.fr/flajolet/Publications/Flajolet85c.pdf
§ MergingMorriscounters:theseslides
57EdithCohen