x sampling for big data - nick duffield · sampling for big data reservoir sampling via order...
TRANSCRIPT
GrahamCormode,[email protected]
NickDuffield,TexasA&[email protected]
Sampling for Big Data
x9x8x7x6x5x4x3x2x1
x10
x’9x’8
x’10
x’6x’5x’4x’3x’2x’1
Æ
0 1
00 01 10
000 001 010 011 100 101 110 111
11
Sampling for Big Data
Big Data
◊ “Big”dataarisesinmanyforms:– PhysicalMeasurements:fromscience(physics,astronomy)
– Medicaldata:geneticsequences,detailedtimeseries
– Activitydata:GPSlocation,socialnetworkactivity
– Businessdata:customerbehaviortrackingatfinedetail
◊ Commonthemes:– Dataislarge,andgrowing
– Thereareimportantpatternsandtrendsinthedata
– Wedon’tfullyknowwheretolookorhowtofindthem
Sampling for Big Data
Why Reduce?
◊ Although“big”dataisaboutmorethanjustthevolume……mostbigdataisbig!
◊ Itisnotalwayspossibletostorethedatainfull– Manyapplications(telecoms,ISPs,searchengines)can’tkeepeverything
◊ Itisinconvenienttoworkwithdatainfull– Justbecausewecan,doesn’tmeanweshould
◊ Itisfastertoworkwithacompactsummary– Bettertoexploredataonalaptopthanacluster
Sampling for Big Data
Why Sample?
◊ Samplinghasanintuitivesemantics– Weobtainasmallerdatasetwiththesamestructure
◊ Estimatingonasampleisoftenstraightforward– Runtheanalysisonthesamplethatyouwouldonthefulldata
– Somerescaling/reweightingmaybenecessary
◊ Samplingisgeneralandagnostictotheanalysistobedone– Othersummarymethodsonlyworkforcertaincomputations– Thoughsamplingcanbetunedtooptimizesomecriteria
◊ Samplingis(usually)easytounderstand– Soprevalentthatwehaveanintuitionaboutsampling
Sampling for Big Data
Alternatives to Sampling
◊ Samplingisnottheonlygameintown– Manyotherdatareductiontechniquesbymanynames
◊ Dimensionalityreductionmethods– PCA,SVD,eigenvalue/eigenvectordecompositions
– Costlyandslowtoperformonbigdata
◊ “Sketching”techniquesforstreamsofdata– Hashbasedsummariesviarandomprojections– Complextounderstandandlimitedinfunction
◊ Othertransform/dictionarybasedsummarizationmethods– Wavelets,FourierTransform,DCT,Histograms
– Notincrementallyupdatable,highoverhead
◊ Allworthyofstudy– inothertutorials
Sampling for Big Data
Health Warning: contains probabilities
◊ Willavoiddetailedprobabilitycalculations,aimtogivehighleveldescriptionsandintuition
◊ Butsomeprobabilitybasicsareassumed– Conceptsofprobability,expectation,varianceofrandomvariables– Alludetoconcentrationofmeasure(Exponential/Chernoff bounds)
◊ Feelfreetoaskquestionsabouttechnicaldetailsalongtheway
Sampling for Big Data
Outline
◊ Motivatingapplication:samplinginlargeISPnetworks
◊ Basicsofsampling:conceptsandestimation
◊ Streamsampling:uniformandweightedcase– Variations:Concisesampling,sampleandhold,sketchguided
BREAK◊ Advancedstreamsampling:samplingascostoptimization– VarOpt,priority,structureaware,andstablesampling
◊ Hashingandcoordination– Bottom-k,consistentsamplingandsketch-basedsampling
◊ Graphsampling– Node,edgeandsubgraphsampling
◊ Conclusionandfuturedirections
Sampling for Big Data
Sampling as a Mediator of Constraints
Data Characteristics(Heavy Tails, Correlations)
Query Requirements(Ad Hoc, Accuracy, Aggregates, Speed)
Resource Constraints(Bandwidth, Storage, CPU)
Sampling
Sampling for Big Data
Motivating Application: ISP Data
◊ WillmotivatemanyresultswithapplicationtoISPs
◊ Manyreasonstousesuchexamples:– Expertise:tutorsfromtelecomsworld
– Demand:manysamplingmethodsdevelopedinresponsetoISPneeds
– Practice:samplingwidelyusedinISPmonitoring,builtintorouters
– Prescience:ISPswerefirsttohitmany“bigdata”problems– Variety:manydifferentplaceswheresamplingisneeded
◊ First,acrash-courseonISPnetworks…
Sampling for Big Data
Structure of Large ISP Networks
PeeringwithotherISPs
AccessNetworks:Wireless,DSL,IPTV
City-levelRouterCenters
BackboneLinks
DownstreamISPandbusinesscustomers
ServiceandDatacenters
NetworkManagement&Administration
Sampling for Big Data
Measuring the ISP Network: Data Sources
Peering
AccessRouterCenters
Backbone
Business
DatacentersManagement
LinkTrafficRatesAggregatedperrouterinterface
TrafficMatricesFlowrecordsfromrouters
Loss&LatencyActiveprobing
Loss&LatencyRoundtrip toedge
ProtocolMonitoring:Routers,Wireless
StatusReports:Devicefailuresandtransitions
CustomerCareLogsReactiveindicatorsofnetworkperformance
Sampling for Big Data
Why Summarize (ISP) Big Data?
◊ Whentransmissionbandwidthformeasurementsislimited– NotsuchabigissueinISPswithin-bandcollection
◊ Typicallyrawaccumulationisnotfeasible(evenfornationstates)– Highratestreamingdata
– Maintainhistoricalsummariesforbaselining,timeseriesanalysis
◊ Tofacilitatefastqueries– Wheninfeasibletorunexploratoryqueriesoverfulldata
◊ Aspartofhierarchicalqueryinfrastructure:– Maintainfulldataoverlimiteddurationwindow
– Drilldownintofulldatathroughoneormorelayersofsummarization
Samplinghasbeenprovedtobeaflexiblemethodtoaccomplishthis
Sampling for Big Data
Data Scale:Summarization and Sampling
Sampling for Big Data
Traffic Measurement in the ISP Network
AccessRouter Centers
Backbone
Business
DatacentersManagement
Traffic MatricesFlow records from routers
Sampling for Big Data
Massive Dataset: Flow Records
◊ IPFlow:setofpacketswithcommonkeyobservedcloseintime
◊ FlowKey:IPsrc/dst address,TCP/UDPports,ToS,…[64to104+bits]
◊ FlowRecords:– Protocollevelsummariesofflows,compiledandexportedbyrouters
– Flowkey,packetandbytecounts,first/lastpackettime,somerouterstate
– Realizations:CiscoNetflow,IETFStandards
◊ Scale:100’sTeraBytesofflowrecordsdailyaregeneratedinalargeISP◊ Usedtomanagenetworkoverrangeoftimescales:
– Capacityplanning(months),….,detectingnetworkattacks(seconds)
◊ Analysistasks– Easy:timeseries ofpredeterminedaggregates(e.g.addressprefixes)
– Hard:fastqueriesoverexploratoryselectors,history,communicationssubgraphs
flow1 flow2 flow3 flow4
time
Sampling for Big Data
Flows, Flow Records and Sampling
◊ Twotypesofsamplingusedinpracticeforinternettraffic:1. Samplingpacketstreaminrouterpriortoformingflowrecords
□ Limitstherateoflookupsofpacketkeyinflowcache
□ RealizedasPacketSampledNetFlow(morelater…)
2. Downstreamsamplingofflowrecordsincollectioninfrastructure
□ Limitstransmissionbandwidth,storagerequirements□ RealizedinISPmeasurementcollectioninfrastructure(morelater…)
◊ Twocasesillustrativeofgeneralproperty– Differentunderlyingdistributionsrequiredifferentsampledesigns
– Statisticaloptimalitysometimeslimitedbyimplementationconstraints
□ Availabilityofrouterstorage,processingcycles
Sampling for Big Data
Abstraction: Keyed Data Streams
◊ DataModel:objectsarekeyedweights– Objects(x,k):Weight x;keyk
□ Example1:objects=packets,x =bytes,k =key(source/destination)
□ Example2:objects=flows,x =packetsorbytes,k =key
□ Example3:objects=accountupdates,x =credit/debit,k =accountID
◊ Streamofkeyedweights,{(xi , ki):i =1,2,…,n}
◊ Genericquery:subsetsums– X(S)=ΣiÎS xi for SÌ {1,2,…,n}i.e.totalweightofindexsubsetS– TypicallyS=S(K)={i:kiÎ K} :objectswithkeysinK
□ Example1,2:X(S(K))=totalbytestogivenIPdest address/UDPport
□ Example3:X(S(K))=totalbalancechangeoversetofaccounts
◊ Aim:Computefixedsizesummaryofstreamthatcanbeusedtoestimatearbitrarysubsetsumswithknownerrorbounds
Sampling for Big Data
Inclusion Sampling and Estimation
◊ Horvitz-ThompsonEstimation:– Objectofsizexi sampledwithprobabilitypi
– Unbiasedestimatex’i=xi /pi (ifsampled),0ifnotsampled:E[x’i]=xi◊ Linearity:– Estimateofsubsetsum=sumofmatchingestimates
– SubsetsumX(S)=SiÎS xi isestimatedbyX’(S)=SiÎS x’i◊ Accuracy:– ExponentialBounds:Pr[|X’(S)- X(S)|>δX(S)]≤exp[-g(δ)X(S)]
– Confidenceintervals:X(S)Î [X-(e),X+(e)]withprobability1- e◊ Futureproof:– Don’tneedtoknowqueriesattimeofsampling
□ “Where/wheredidthatsuspiciousUDPportfirstbecomesoactive?”□ “WhichisthemostactiveIPaddresswithinthananomaloussubnet?”
– Retrospectiveestimate:subsetsumoverrelevantkeyset
Sampling for Big Data
Independent Stream Sampling
◊ BernoulliSampling– IIDsamplingofobjectswithsomeprobabilityp
– Sampledweightx hasHTestimatex/p
◊ PoissonSampling– Weightxi sampledwithprobabilitypi ;HTestimatexi /pi
◊ WhentousePoissonvs.Bernoullisampling?– Elephantsandmice:Poissonallowsprobabilitytodependonweight…
◊ Whatisbestchoiceofprobabilitiesforgivenstream{xi}?
Sampling for Big Data
Bernoulli Sampling
◊ Theeasiestpossiblecaseofsampling:allweightsare1– N objects,andwanttosamplek fromthemuniformly
– Eachpossiblesubsetofk shouldbeequallylikely
◊ UniformlysampleanindexfromN (withoutreplacement)k times– Somesubtleties:trulyrandomnumbersfrom[1…N]onacomputer?
– Assumethatrandomnumbergeneratorsaregoodenough
◊ CommontrickinDB:assignarandomnumbertoeachitemandsort– CostlyifN isverybig,butsoisrandomaccess
◊ Interestingproblem:takeasinglelinearscanofdatatodrawsample– Streamingmodelofcomputation:seeeachelementonce
– Application:IPflowsampling,toomany(forus)tostore
– (Forawhile)commontechinterviewquestion
Sampling for Big Data
Reservoir Sampling
“Reservoirsampling”describedby[Knuth 69, 81];enhancements[Vitter 85]◊ Fixedsizek uniformsamplefromarbitrarysizeN streaminonepass– Noneedtoknowstreamsizeinadvance
– Includefirstk itemsw.p.1
– Includeitemn>k withprobabilitypn=k/n,n>k
□ Pickj uniformlyfrom{1,2,…,n}□ Ifj≤k,swapitemn intolocationj inreservoir,discardreplaceditem
◊ Neatproofshowstheuniformityofthesamplingmethod:– LetSn =samplesetaftern arrivals
k=7 n
m (< n)
Previouslysampleditem:induction
mÎ Sn-1 w.p.pn-1Þ mÎ Sn w.p.pn-1 *(1– pn /k)=pn
Newitem:selectionprobability
Prob[nÎ Sn ]=pn :=k/n
Sampling for Big Data
Reservoir Sampling: Skip Counting
◊ Simpleapproach:checkeachiteminturn– O(1) peritem:
– Fineifcomputationtime<interarrivaltime
– OtherwisebuildupcomputationbacklogO(N)
◊ Better:“skipcounting”– Findrandomindexm(n)ofnextselection>n– Distribution:Prob[m(n)≤m]=1- (1-pn+1)*(1-pn+2)*…*(1-pm)
◊ Expectednumberofselectionsfromstreamisk+Σk<m≤N pm =k+Σk<m≤N k/m=O(k(1+ln (N/k)))
◊ Vitter’85providedalgorithmwiththisaveragerunningtime
Sampling for Big Data
Reservoir Sampling via Order Sampling
◊ Ordersamplinga.k.a.bottom-ksample,min-hashing
◊ Uniformsamplingofstreamintoreservoirofsizek
◊ Eacharrivaln:generateone-timerandomvaluern Î U[0,1]– rn alsoknownashash,rank,tag…
◊ Storekitemswiththesmallestrandomtags
0.391 0.908 0.291 0.555 0.619 0.273
n Eachitemhassamechanceofleasttag,souniform
n Fasttoimplementviapriorityqueue
n Canrunonmultipleinputstreamsseparately,thenmerge
Sampling for Big Data
Handling Weights
◊ Sofar:uniformsamplingfromastreamusingareservoir
◊ Extendtonon-uniformsamplingfromweightedstreams– Easycase:k=1
– Samplingprobabilityp(n)=xn/WnwhereWn=Si=1n xi
◊ k>1 isharder– Canhaveelementswithlargeweight:wouldbesampledwithprob 1?
◊ Numberofdifferentweightedorder-samplingschemesproposedtorealizedesireddistributionalobjectives– Rankrn=f(un,xn )forsomefunctionfandunÎ U[0,1]
– k-mins sketches[Cohen 1997],Bottom-ksketches[Cohen Kaplan 2007]– [Rosen 1972], Weightedrandomsampling[Efraimidis Spirakis 2006]– OrderPPSSampling[Ohlsson 1990, Rosen 1997] – PrioritySampling[Duffield Lund Thorup 2004], [Alon+DLT 2005]
Sampling for Big Data
Weighted random sampling
◊ Weightedrandomsampling[Efraimidis Spirakis 06] generalizesmin-wise– Foreachitemdrawrn uniformlyatrandominrange[0,1]
– Computethe‘tag’ofanitemasrn (1/xn)
– Keeptheitemswiththek smallesttags
– Canprovethecorrectnessoftheexponentialsamplingdistribution
◊ Canalsomakeefficientviaskipcountingideas
Sampling for Big Data
Priority Sampling
◊ Eachitemxi givenpriorityzi =xi /ri withrn uniformrandomin(0,1]
◊ Maintainreservoirofk+1items(xi ,zi )ofhighestpriority◊ Estimation
– Letz*=(k+1)st highest priority
– Top-kpriorityitems:weightestimate x’I =max{xi ,z*}
– Allother items:weightestimate zero
◊ Statisticsandbounds– x’I unbiased; zerocovariance:Cov[x’i ,x’j ]=0fori≠j
– Relative varianceforanysubset sum≤1/(k-1)[Szegedy, 2006]
Sampling for Big Data
Priority Sampling in Databases
◊ OneTimeSamplePreparation– Computeprioritiesofallitems,sortindecreasingpriorityorder
□ Nodiscard
◊ SampleandEstimate– EstimateanysubsetsumX(S)= SiÎS xi byX’(S)=SiÎS x’I forsomeS’Ì S
– Method:selectitemsindecreasingpriorityorder
◊ Twovariants:boundedvarianceorcomplexity1. S’=firstkitemsfromS:relativevariancebounded≤1/(k-1)
□ x’I=max{xi ,z*}where z*=(k+1)st highestpriorityinS
2. S’=itemsfromSinfirstk:executiontimeO(k)
□ x’I=max{xi ,z*}where z*=(k+1)st highestpriority
[Alon et. al., 2005]
Sampling for Big Data
Making Stream Samples Smarter
◊ Observation:wesee thewholestream,evenifwecan’tstoreit– Cankeepmoreinformationaboutsampleditemsifrepeated
– Simpleinformation:ifitemsampled,countallrepeats
◊ CountingSamples[Gibbons & Mattias 98]– Samplenewitemswithfixedprobabilityp,countrepeatsasci– Unbiasedestimateoftotalcount:1/p+(ci – 1)
◊ SampleandHold[Estan & Varghese 02]:generalizetoweightedkeys– Newkeywithweightb sampledwithprobability1- (1-p)b
◊ Lowervariancecomparedwithindependentsampling– Butsamplesizewillgrowaspn
◊ Adaptivesampleandhold:reducep whenneeded– “Stickysampling”:geometricdecreasesinp [Manku, Motwani 02]– Muchsubsequentworktuningdecreaseinptomaintainsamplesize
Sampling for Big Data
Sketch Guided Sampling
◊ Gofurther:avoidsamplingtheheavykeysasmuch– Uniformsamplingwillpickfromtheheavykeysagainandagain
◊ Idea:useanoracletotellwhenakeyisheavy[Kumar Xu 06] – Adjustsamplingprobabilityaccordingly
◊ Canusea“sketch”datastructuretoplaytheroleoforacle– Likeahashtablewithcollisions,tracksapproximatefrequencies
– E.g.(Counting)BloomFilters,Count-MinSketch
◊ Trackprobabilitywithwhichkeyissampled,useHTestimators– Setprobabilityofsamplingkeywith(estimated)weightwas
1/(1+ew)forparametere : decreasesaswincreases– Decreasinge improvesaccuracy,increasessamplesize
Sampling for Big Data
Challenges for Smart Stream Sampling
◊ Currentrouterconstraints– FlowtablesmaintainedinfastexpensiveSRAM
□ Tosupportperpacketkeylookupatlinerate
◊ Implementationrequirements– SampleandHold:stillneedperpacketlookup– SampledNetFlow:(uniform)samplingreduceslookuprate
□ Easiertoimplement despite inferiorstatistical properties
◊ Longdevelopmenttimestorealizenewsamplingalgorithms
◊ Similarconcernsaffectsamplinginotherapplications– Processinglargeamountsofdataneedsawarenessofhardware
– Uniformsamplingmeansnocoordinationneededindistributedsetting
Sampling for Big Data
Future for Smarter Stream Sampling
◊ SoftwareDefinedNetworking– Current:proprietarysoftwarerunningonspecialvendorequipment
– Future:opensoftwareandprotocolsoncommodityhardware
◊ Potentiallyoffersflexibilityintrafficmeasurement– Allocatesystemresourcestomeasurementtasksasneeded– Dynamicreconfiguration,finegrainedtuningofsampling
– Stateful packetinspectionandsamplingfornetworksecurity
◊ Technicalchallenges:– Highratepacketprocessinginsoftware
– Transparentsupportfromcommodityhardware
– OpenSketch:[Yu, Jose, Miao, 2013]◊ Sameissuesinotherapplications:useofcommodityprogrammableHW
Sampling for Big Data
Stream Sampling:Sampling as Cost Optimization
Sampling for Big Data
Matching Data to Sampling Analysis
◊ Genericproblem1:Countingobjects:weightxi =1Bernoulli(uniform)samplingwithprobabilitypworksfine
– EstimatedsubsetcountX’(S)=#{samplesinS}/p
– RelativeVariance(X’(S))=(1/p-1)/X(S)
□ givenp,getanydesiredaccuracyforlargeenoughS
◊ Genericproblem2:xi inParetodistribution,a.k.a.80-20law– Smallproportionofobjectspossessalargeproportionoftotalweight
□ Howtobesttosampleobjectstoaccuratelyestimateweight?
– Uniformsampling?
□ likelytoomitheavyobjectsÞbighitonaccuracy□ makingselectionsetS largedoesn’thelp
– Selectm largestobjects?
□ biased&smallerobjectssystematicallyignored
Sampling for Big Data
Heavy Tails in the Internet and Beyond◊ Filessizesinstorage
◊ Bytesandpacketspernetworkflow
◊ Degreedistributionsinwebgraph,socialnetworks
Sampling for Big Data
Non-Uniform Sampling
◊ Extensiveliterature:seebookby[Tille, “Sampling Algorithms”, 2006]◊ Predates“BigData”– Focusonstatisticalproperties,notsomuchcomputational
◊ IPPS:InclusionProbabilityProportionaltoSize– VarianceOptimalforHTEstimation
– Samplingprobabilitiesformultivariateversion:[Chao 1982, Tille 1996]– Efficientstreamsamplingalgorithm:[Cohen et. al. 2009]
Sampling for Big Data
Costs of Non-Uniform Sampling
◊ Independentsamplingfromn objectswithweights{x1,…,xn}
◊ Goal:findthe“best”samplingprobabilities{p1,…,pn}
◊ Horvitz-Thompson:unbiasedestimationofeachxi by
◊ Twocoststobalance:1. EstimationVariance:Var(x’i)=x2i(1/pi – 1)
2. ExpectedSampleSize:Sipi
◊ MinimizeLinearCombinationCost:Si (xi2(1/pi–1)+z2 pi)– z expressesrelativeimportanceofsmallsamplevs.smallvariance
otherwise0selectediweightifpx
x' iii =
Sampling for Big Data
Minimal Cost Sampling: IPPS
IPPS:InclusionProbabilityProportionaltoSize
◊ MinimizeCostSi (xi2 (1/pi– 1)+z2 pi)subjectto1≥pi ≥0
◊ Solution:pi =pz(xi)=min{1,xi /z}
– smallobjects(xi <z)selectedwithprobabilityproportionaltosize
– largeobjects(xi ≥z)selectedwithprobability1
– Callz the“samplingthreshold”
– Unbiasedestimatorxi/pi =max{xi ,z}
◊ Perhapsreminiscentofimportancesampling,butnotthesame:
– makenoassumptionsconcerningdistributionofthex
pz(x)
1
zx
Sampling for Big Data
Error Estimates and Bounds
◊ VarianceBased:– HTsamplingvarianceforsingleobjectofweightxi
□ Var(x’i)=x2i(1/pi – 1)=x2i(1/min{1,xi/z}– 1)≤zxi– SubsetsumX(S)=SiÎS xi isestimatedbyX’(S)=SiÎS x’i
□ Var(X’(S))≤zX(S)
◊ ExponentialBounds– E.g.Prob[X’(S)=0]≤exp(- X(S)/z)
◊ Boundsaresimpleandpowerful– dependonlyonsubsetsumX(S),notindividualconstituents
Sampling for Big Data
Sampled IP Traffic Measurements
◊ PacketSampledNetFlow– Samplepacketstreaminroutertolimitrateofkeylookup:uniform1/N
– Aggregatesampledpacketsintoflowrecordsbykey
◊ Model:packetstreamof(key,bytesize)pairs{(bi,ki)}
◊ Packetsampledflowrecord(b,k)where b=Σ {bi:i sampled� ki =k}– HTestimateb*Noftotalbytesinflow
◊ Downstreamsamplingofflowrecordsinmeasurementinfrastructure– IPPSsampling,probabilitymin{1,b*N/z}
◊ ChainedvarianceboundforanysubsetsumX offlows– Var(X’)≤(z+Nbmax)Xwherebmax=maximumpacketbytesize
– Regardlessofhowpacketsaredistributedamongstflows
[Duffield, Lund, Thorup, IEEE ToIT, 2004]
Sampling for Big Data
Estimation Accuracy in Practice
◊ Estimateanysubsetsumcomprisingatleastsomefractionf ofweight
◊ Suppose:samplesizem
◊ Analysis:typicalestimationerrorε (relativestandarddeviation)obeys
◊ 2**16=storageneededforaggregatesover16bitaddressprefixes□ Butsamplinggivesmoreflexibilitytoestimatetrafficwithinaggregates
kf1ε
0.10%
1.00%
10.00%
100.00%
0.0001 0.001 0.01 0.1 1R
SD ε
fraction f
m = 2**16 samples
Estimate fractionf =0.1%withtypicalrelativeerror
12%:
m
Sampling for Big Data
Heavy Hitters: Exact vs. Aggregate vs. Sampled◊ Samplingdoesnottellyouwheretheinterestingfeaturesare– Butdoesspeeduptheabilitytofindthemwithexistingtools
◊ Example:HeavyHitterDetection– Setting:Flowrecordsreporting10GB/strafficstream
– Aim:findHeavyHitters=IPprefixescomprising≥0.1%oftraffic
– Responsetimeneeded:5minute
◊ Compare:– Exact:10GB/sx5minutesyieldsupwardsof300Mflowrecords
– 64kaggregatesover16bitprefixes:nodeeperdrill-downpossible
– Sampled:64kflowrecords:anyaggregate≥0.1%accurateto10%Exact Aggregate Sampled
Sampling for Big Data
Cost Optimization for Sampling
Severaldifferentapproachesoptimizefordifferentobjectives:
1. FixedSampleSizeIPPSSample– VarianceOptimalsampling:minimalvarianceunbiasedestimation
2. StructureAwareSampling– Improveestimationaccuracyforsubnetqueriesusingtopologicalcost
3. FairSampling– Adaptivelybalancesamplingbudgetoversubpopulationsofflows– Uniformestimationaccuracyregardlessofsubpopulationsize
4. StableSampling– Increasestabilityofsamplesetbyimposingcostonchanges
Sampling for Big Data
IPPS Stream Reservoir Sampling◊ Eacharrivingitem:
– Provisionally include iteminreservoir
– Ifm+1 items, discard1itemrandomly
□ Calculatethresholdz tosamplem itemsonaverage:z solvesSi pz(xi)=m□ Discarditemi withprobabilityqi =1– pz(xi)
□ Adjustm survivingxiwithHorvitz-Thompsonx’i =xi /pi =max{xi,z}◊ Efficient Implementation:
– ComputationalcostO(logm)peritem,amortizedcostO(loglogm)
[Cohen, Duffield, Lund, Kaplan, Thorup; SODA 2009, SIAM J. Comput. 2011]
x9x8x7x6x5x4x3x2x1
Example: m=9
x10
Recalculatethreshold z:
=
=10
1ii 9z}xmin{1,
z
0
1
RecalculateDiscard probs:
z}xmin{1,-1q ii =
x7x6x5x4x3x2x1
x9x8
x10
Adjust weights:z},max{xx' ii =
x’9x’8
x’10
x’6x’5x’4x’3x’2x’1
Sampling for Big Data
Structure (Un)Aware Sampling
◊ Samplingisoblivioustostructureinkeys(IPaddresshierarchy)– Estimationdispersestheweightofdiscardeditemstosurvivingsamples
◊ Queriesstructureaware:subsetsumsoverrelatedkeys(IPsubnets)– AccuracyonLHSisdecreasedbydiscardingweightonRHS
Æ
0 1
00 01 10
000 001 010 011 100 101 110 111
11
Sampling for Big Data
Localizing Weight Redistribution
◊ Initialweightset{xi :iÎS}forsomeSÌΩ– E.g. Ω =possible IPaddresses, S =observedIPaddresses
◊ Attribute“rangecost”C({xi :iÎR})foreachweightsubsetRÍS– Possible factorsforRangeCost:
□ Samplingvariance
□ Topologye.g.height oflowestcommonancestor
– Heuristics: R* =Nearest Neighbor{xi,xj}ofminimalxixj◊ SamplekitemsfromS:
– Progressivelyremoveoneitem fromsubsetwithminimal rangecost:
– While(|S|>k)
□ FindR*ÍS ofminimal rangecost.
□ RemoveaweightfromR* w/VarOpt
[Cohen, Cormode, Duffield; PVLDB 2011]
Æ
0 1
00 01 10
000 001 010 011 100 101 110 111
11
Nochangeoutsidesubtree belowclosestancestor
Orderofmagnitude reduction inaveragesubneterrorvs.VarOpt
Sampling for Big Data
Fair Sampling Across Subpopulations
◊ Analysisqueriesoftenfocusonspecificsubpopulations– E.g.networking:differentcustomers,userapplications,networkpaths
◊ Widevariationinsubpopulationsize– 5ordersofmagnitudevariationintrafficoninterfacesofaccessrouter
◊ Ifuniformsamplingacrosssubpopulations:– Poorestimationaccuracyonsubsetsumswithinsmallsubpopulations
Sample
Color=subpopulation
,=interesting items
– occurrenceproportionaltosubpopulation size
UniformSamplingacross subpopulations:
– Difficulttotrackproportionofinterestingitemswithinsmall subpopulations:
Sampling for Big Data
Fair Sampling Across Subpopulations
◊ Minimizerelativevariancebysharingbudgetmoversubpopulations– Totaln objects insubpopulations n1,…,ndwithSini=n– Allocate budgetmi toeachsubpopulation ni withSimi=m
◊ MinimizeaveragepopulationrelativevarianceR=const.Si1/mi
◊ Theorem:– R minimizedwhen{mi}areMax-MinFairshareofm underdemands {ni}
◊ Streaming– Problem:don’tknowsubpopulation sizes{ni}inadvance
◊ Solution:progressivefairsharingasreservoirsample– Provisionally include eacharrival
– Discard1itemasVarOpt samplefromanymaximal subpopulation
◊ Theorem[Duffield; Sigmetrics2012]: – Max-MinFairatalltimes; equality indistributionwithVarOpt samples {mi fromni}
Sampling for Big Data
Stable Sampling
◊ Setting:Samplingapopulationoversuccessiveperiods
◊ Sampleindependentlyateachtimeperiod?– Costassociatedwithsamplechurn
– Timeseriesanalysisofsetofrelativelystablekeys
◊ Findsamplingprobabilitiesthroughcostminimization– MinimizeCost=EstimationVariance+z*E[#Churn]
◊ SizemsamplewithmaximalexpectedchurnD– weights{xi},previoussamplingprobabilities{pi}
– findnewsamplingprobabilities{qi}tominimizecostoftakingmsamples
– MinimizeSix2i/qi subjectto1≥qi ≥0,SIqi =mandSI|pi – qi |≤ D
[Cohen, Cormode, Duffield, Lund 13]
Sampling for Big Data
Summary of Part 1
◊ Samplingasapowerful,generalsummarizationtechnique
◊ UnbiasedestimationviaHorvitz-Thompsonestimators
◊ Samplingfromstreamsofdata– Uniformsampling:reservoirsampling
– Weightedgeneralizations:sampleandhold,countingsamples
◊ Advancesinstreamsampling– Thecostprincipleforsampledesign,andIPPSmethods– Threshold,priorityandVarOptsampling
– Extendingthecostprinciple:
□ structureaware,fairsampling,stablesampling,sketchguided
Sampling for Big Data
Outline
◊ Motivatingapplication:samplinginlargeISPnetworks
◊ Basicsofsampling:conceptsandestimation
◊ Streamsampling:uniformandweightedcase– Variations:Concisesampling,sampleandhold,sketchguided
BREAK◊ Advancedstreamsampling:samplingascostoptimization– VarOpt,priority,structureaware,andstablesampling
◊ Hashingandcoordination– Bottom-k,consistentsamplingandsketch-basedsampling
◊ Graphsampling– Node,edgeandsubgraphsampling
◊ Conclusionandfuturedirections
Sampling for Big Data
Data Scale:Hashing and Coordination
Sampling for Big Data
Sampling from the set of items
◊ Sometimesneedtosamplefromthedistinctsetofobjects– Notinfluencedbytheweightornumberofoccurrences
– E.g.samplefromthedistinctsetofflows,regardlessofweight
◊ Needsamplingmethodthatisinvarianttoduplicates
◊ Basicidea:buildafunctiontodeterminewhattosample– A“random”functionf(k)àR
– Usef(k) tomakeasamplingdecision:consistentdecisionforsamekey
Sampling for Big Data
Permanent Random Numbers
◊ Oftenconvenienttothinkoff asgiving“permanentrandomnumbers”– Permanent:assignedonceandforall
– Random:treatasiffullyrandomlychosen
◊ Thepermanentrandomnumberisusedinmultiplesamplingsteps– Same“random”numbereachtime,soconsistent (correlated)decisions
◊ Example:usePRNstodrawasampleofs fromN viaordersampling– Ifs<<N,smallchanceofseeingsameelementindifferentsamples– ViaPRN,strongerchanceofseeingsameelement
□ Cantrackpropertiesovertime,givesaformofstability
◊ EasiestwaytogeneratePRNs:applyahash function totheelementid– EnsuresPRNcanbegeneratedwithminimalcoordination
– Explicitlystoringarandomnumberforallobservedkeysdoesnotscale
Sampling for Big Data
Hash Functions
Manypossiblechoicesofhashingfunctions:
◊ Cryptographichashfunctions:SHA-1,MD5,etc.– Resultsappear“random”formosttests(usingseed/salt)
– Canbeslowforhighspeed/highvolumeapplications
– Fullpowerofcryptographicsecuritynotneededformoststatisticalpurposes
□ Althoughpossiblesometrade-offsinrobustnesstosubversionifnotused
◊ Heuristichashfunctions:srand(),mod– Usuallyprettyfast
– Maynotberandomenough:structureinkeysmaycausecollisions
◊ Mathematicalhashfunctions:universalhashing,k-wisehashing– Haveprecisemathematicalpropertiesonprobabilities
– Canbeimplementedtobeveryfast
Sampling for Big Data
Mathematical Hashing
◊ K-wiseindependence:Pr[h(x1)=y1 Ù h(x2)=y2Ù …Ù h(xt)=yt]=1/Rt
– Simple function:ctxt +ct-1xt-1 +…c1x+c0 modP
– ForfixedprimeP,randomlychosenc0 …ct– Canbemadeveryfast(chooseP tobeMersenneprimetosimplifymods)
◊ (Twisted)tabulationhashing[Thorup Patrascu 13]– Interpreteachkeyasasequenceofshortcharacters,e.g.8*8bits– Usea“trulyrandom”look-uptableforeachcharacter(so8*256entries)
– Taketheexclusive-ORoftherelevanttablevalues
– Fast,andfairlycompact
– Strongenoughformanyapplicationsofhashing(hashtablesetc.)
Sampling for Big Data
Bottom-k sampling
◊ Samplefromthesetofdistinctkeys– Hasheachkeyusingappropriatehashfunction
– Keepinformationonthekeyswiththes smallesthashvalues
– ThinkofasordersamplingwithPRNs…
◊ Usefulforestimatingpropertiesofthesupportsetofkeys– Evaluateanypredicateonthesampledsetofkeys
◊ Sameconcept,severaldifferentnames:– Bottom-ksampling,Min-wisehashing,K-minimumvalues
0.391 0.908 0.291 0.391 0.391 0.273
Sampling for Big Data
Subset Size Estimation from Bottom-k
◊ Wanttoestimatethefractiont=|A|/|D|– D istheobservedsetofdata
– A isanarbitrarysubsetgivenlater
– E.g.fractionofcustomerswhoaresportsfansfrommidwestaged18-35
◊ Simplealgorithm:– Runbottom-ktogetsamplesetS,estimatet’=|A∩S|/s– Errordecreasesas1/√s
– Analysisdueto[Thorup 13]: simplehashfunctionssufficeforbigenoughs
Sampling for Big Data
Similarity Estimation
◊ Howsimilararetwosets,A andB?
◊ Jaccard coefficient:|AÇ B|/|AÈ B|– 1ifA,B identical,0iftheyaredisjoint
– Widelyused,e.g.tomeasuredocumentsimilarity
◊ Simpleapproach:sampleanitemuniformlyfromA andB– Probabilityofseeingsameitemfromboth:|AÇ B|/(|A| ´ |B|)– Chanceofseeingsameitemtoolowtobeinformative
◊ Coordinatedsampling:usesamehashfunctiontosamplefromA,B– Probabilitythatsameitemsampled:|AÇ B|/|AÈ B|– Repeat:theaveragenumberofagreementsgivesJaccard coefficient
– Concentration:(additive)errorscalesas1/√s
Sampling for Big Data
Technical Issue: Min-wise hashing
◊ Foranalysistowork,thehashfunctionmustbefullyrandom– Allpossiblypermutationsoftheinputareequallylikely
– Unrealisticinpractice:descriptionofsuchafunctionishuge
◊ “Simple”hashfunctionsdon’tworkwell– Universalhashfunctionsaretooskewed
◊ Needhashfunctionsthatare“approximatelymin-wise”– Probabilityofsamplingasubsetisalmostuniform– Tabulationhashingasimplewaytoachievethis
Sampling for Big Data
Bottom-k hashing for F0 Estimation
◊ F0 isthenumberofdistinctitemsinthestream– afundamentalquantitywithmanyapplications
– E.g.numberofdistinctflowsseenonabackbonelink
◊ Letm bethedomainofstreamelements:eachdataitemis[1…m]
◊ Pickarandom(pairwise)hashfunctionh:[m]® [R]
◊ Applybottom-ksamplingunderhashfunctionh– Letvs =s’th smallest(distinct)valueofh(i) seen
◊ Ifn=F0 <s,giveexactanswer,elseestimateF’0 =sR/vs– vs/R » fractionofhashdomainoccupiedbys smallest
R0R vs
Sampling for Big Data
Analysis of F0 algorithm
◊ Canshowthatitisunlikelytohaveanoverestimate– Toomanyitemshashedbelowafixedvalue
– Cantreateacheventofanitemhashingtoolowasindependent
◊ Similaroutlinetoshowunlikelytohaveanoverestimate
◊ (Relative)errorscalesas1/√s
◊ Spacecost:– Stores hashvalues,soO(slogm)bits– CanimprovetoO(s+logm)withadditionalhashingtricks
– Seealso“StreamedApproximateCountingofDistinctElements”,KDD’14
RsR/(1+e)n0R vs
Sampling for Big Data
Consistent Weighted Sampling
◊ Wanttoextendbottom-kresultswhendatahasweights
◊ Specifically,twodatasetsA andB whereeachelementhasweight– Weightsareaggregated:weseewholeweightofelementtogether
◊ WeightedJaccard:wantprobabilitythatsamekeyischosenbybothtobeåi min(A(i),B(i))/åi max(A(i),B(i))
◊ Samplingmethodshouldobeyuniformityandconsistency– Uniformity:elementi pickedfromAwithprobabilityproportionaltoA(i)
– Consistency:ifi ispickedfromA,andB(i)>A(i),theni alsopickedforB
◊ Simplesolution:assumingintegerweights,treatweightA(i)asA(i)unique(different)copiesofelementi,applybottom-k– Limitations:slow,unscalablewhenweightscanbelarge
– Needtorescalefractionalweightstointegralmultiples
Sampling for Big Data
Consistent Weighted Sampling
◊ Efficientsamplingdistributionsexistachievinguniformityandconsistency
◊ Basicidea:consideraweightw asw/D differentelements– Computetheprobabilitythatanyoftheseachievestheminimumvalue
– StudythelimitingdistributionasD® 0
◊ ConsistentWeightedSampling[Manasse, McSherry, Talway 07], [Ioffe 10]– Usehashofitemtodeterminewhichpointssampledviacarefultransform
– Manydetailsneededtocontainbit-precision,allowfastcomputation
◊ Othercombinationsofkeyweightsarepossible[Cohen Kaplan Sen 09]– Minofweights,maxofweights,sumof(absolute)differences
Sampling for Big Data
Trajectory Sampling
◊ Aims[Duffield Grossglauser 01]:– Probepacketsateachroutertheytraverse
– Collatereportstoinferlinklossandlatency
– Needtosample;independentsamplingnouse
◊ Hash-basedsampling:– Allrouters/packets:computehashh ofinvariantpacketfields
– SampleifhÎ someH andreporttocollector;tunesampleratewith|H|– Usehighentropypacketfieldsashashinput,e.g.IPaddresses,IDfield
– Hashfunctionchoicetrade-offbetweenspeed,uniformity&security
◊ StandardizedinInternetEngineeringTaskForce(IETF)– Serviceprovidersneedconsistencyacrossdifferentvendors
– Severalhashfunctionsstandardized,extensible– Sameissuesariseinotherbigdataecosystems(appsandAPIs)
Sampling for Big Data
Hash Sampling in Network Management ◊ Manydifferentnetworksubsystemsusedtoprovideservice– Monitoredthrougheventlogs,passivemeasurementoftraffic&protocols
– Needcross-systemsamplethatcapturesfullinteractionbetweennetworkandarepresentativesetofusers
◊ Ideal:hash-basedselectionbasedoncommonidentifier
◊ Administrativechallenges!Organizationaldiversity
◊ Timelinesschallenge:– Selectionidentifiermaynotbepresentatameasurementlocation
– Example:commonidentifier=anonymized customerid
□ PassivetrafficmeasurementbasedonIPaddress
□ MappingofIPaddresstocustomerIDnotavailableremotely□ AttributionoftrafficIPaddresstoauserdifficulttocomputeatlinespeed
Sampling for Big Data
Advanced Sampling from Sketches
◊ Difficultcase:inputswithpositive andnegativeweights
◊ Wanttosamplebasedontheoverallfrequencydistribution– Samplefromsupportsetofn possibleitems
– Sampleproportionalto(absolute)totalweights
– Sampleproportionaltosomefunctionofweights
◊ Howtodothissamplingeffectively?– Challenge:maybemanyelementswithpositiveandnegativeweights– Aggregateweightsmayendupzero:howtofindthenon-zeroweights?
◊ Recentapproach:L0sampling– L0 samplingenablesnovel“graphsketching”techniques
– Sketchesforconnectivity,sparsifiers [Ahn, Guha, McGregor 12]
Sampling for Big Data
L0 Sampling
◊ L0 sampling:samplewithprob ≈fi0/F0– i.e.,sample(near)uniformlyfromitemswithnon-zerofrequency
◊ Generalapproach:[Frahling, Indyk, Sohler 05, C., Muthu, Rozenbaum05]– Sub-sampleallitems(presentornot)withprobabilityp
– Generateasub-sampledvectoroffrequenciesfp– Feedfp toak-sparserecoverydatastructure
□ Allowsreconstructionoffp ifF0 <k
– Iffp isk-sparse,samplefromreconstructedvector
– Repeatinparallelforexponentiallyshrinkingvaluesofp
Sampling for Big Data
Sampling Process
◊ Exponentialsetofprobabilities,p=1,½,¼,1/8,1/16…1/U– LetN=F0 =|{i :fi ¹ 0}|– Wanttheretobealevelwherek-sparserecoverywillsucceed
– Atlevelp,expectednumberofitemsselectedS isNp
– Picklevelp sothatk/3<Np £ 2k/3◊ Chernoffbound:withprobabilityexponentialink,1£ S£ k– Pickk=O(log1/d)toget1-dprobability
p=1
p=1/U
k-sparserecovery
Sampling for Big Data
Hash-based sampling summary
◊ Usehashfunctionsforsamplingwheresomeconsistencyisneeded– Consistencyoverrepeatedkeys
– Consistencyoverdistributedobservations
◊ Hashfunctionshavedualityofrandomandfixed– Treatasrandomforstatisticalanalysis
– Treatasfixedforgivingconsistencyproperties
◊ Canbecomequitecomplexandsubtle– Complexsamplingdistributionsforconsistentweightedsampling
– TrickycombinationofalgorithmsforL0 sampling
◊ Plentyofscopefornewhashing-basedsamplingmethods
Sampling for Big Data
Data Scale:Massive Graph Sampling
Sampling for Big Data
Massive Graph Sampling
◊ “GraphServiceProviders”– Searchproviders:webgraphs(billionsofpagesindexed)
– Onlinesocialnetworks
□ Facebook:~109users(nodes),~1012 links
– ISPs:communicationsgraphs
□ Fromflowrecords:node=src ordst IP,edgeiftrafficflowsbetweenthem
◊ Graphserviceproviderperspective– Alreadyhaveallthedata,buthowtouseit?
– Wantageneralpurposesamplethatcan:
□ Quicklyprovideanswerstoexploratoryqueries
□ Compactlyarchivesnapshotsforretrospectivequeries&baselining
◊ Graphconsumerperspective– Wanttoobtainarealisticsubgraphdirectlyorviacrawling/API
Sampling for Big Data
Retrospective analysis of ISP graphs
◊ Node=IPaddress
◊ Directededge=flowfromsourcenodetodestinationnode
compromise
control
flooding
• Hardtodetectagainstbackground
• Knownattackscanbedetected:– Signaturematchingbasedonpartialgraphs,
flowfeatures,timing
• Unknownattacksarehardertospot:– exploratory&retrospective analysis
– preserveaccuracyifsampling?BOTNET
Sampling for Big Data
Goals for Graph Sampling
Crudelydivideintothreeclassesofgoal:
1. Studylocal(nodeoredge)properties– Averageageofusers(nodes),averagelengthofconversation(edges)
2. Estimateglobalpropertiesorparametersofthenetwork– Averagedegree,shortestpathdistribution
3. Samplea“representative”subgraph– Testnewalgorithmsandlearningmorequicklythanonfullgraph
◊ Challenges:whatpropertiesshouldthesamplepreserve?– Thenotionof“representative”isverysubjective
– Canlistpropertiesthatshouldbepreserved(e.g.degreedbn,pathlengthdbn),buttherearealwaysmore…
Sampling for Big Data
Models for Graph Sampling
Manypossiblemodels,butreducetotwoforsimplicity
(seetutorialbyHasan,Ahmed,Neville,Kompella inKDD13)
◊ Staticmodel:fullaccesstothegraphtodrawthesample– The(massive)graphisaccessibleinfulltomakethesmallsample
◊ Streamingmodel:edgesarriveinsomearbitraryorder– Mustmakesamplingdecisionsonthefly
◊ Othergraphmodelscapturedifferentaccessscenarios– Crawlingmodel:e.g.exploringthe(deep)web,APIgivesnodeneighbours
– Adjacencyliststreaming:seeallneighboursofanodetogether
Sampling for Big Data
Node and Edge Properties
◊ Grossover-generalization:nodeandedgepropertiescanbesolvedusingprevioustechniques– Samplenodes/edge(inastream)– Handleduplicates(sameedgemanytimes)viahash-basedsampling
– Trackpropertiesofsampledelements
□ E.g.countthedegreeofsamplednodes
◊ Somechallenges.E.g.howtosampleanodeproportionaltoitsdegree?– Ifdegreeisknown(precomputed),thenusetheseasweights– Else,sampleedgesuniformly,thensampleeachendwithprobability½
Sampling for Big Data
Induced subgraph sampling
◊ Node-inducedsubgraph– Pass1:Sampleasetofnodes(e.g.uniformly)
– Pass2:collectalledgesincidentonsamplednodes
– Cancollapseintoasinglestreamingpass
– Can’tknowinadvancehowmanyedgeswillbesampled
◊ Edge-inducedsubgraph– Sampleasetofedges(e.g.uniformlyinonepass)
– Resultinggraphtendstobesparse,disconnected
◊ Edge-inducedvariant[Ahmed Neville Kompella 13]:– Takesecondpasstofillinedgesonsamplednodes
– Hack:combinepassestofillinedgesoncurrentsample
Sampling for Big Data
HT Estimators for Graphs
◊ CanconstructHTestimatorsfromuniformvertexsamples[Frank 78]– Evaluatethedesiredfunctiononthesampledgraph(e.g.averagedegree)
◊ Forfunctionsofedges (e.g.numberofedgessatisfyingaproperty):– Scaleupaccordingly,byN(N-1)/(k(k-1))forsamplesizek ongraphsizeN
– VarianceofestimatescanalsobeboundedintermsofN andk
◊ Similarforfunctionsofthreeedges(triangles)andhigher:– ScaleupbyNC3/kC3≈1/p3 togetunbiasedestimator– Highvariance,soothersamplingschemeshavebeendeveloped
Sampling for Big Data
Graph Sampling Heuristics
“Heuristics”,sincefewformalstatisticalpropertiesareknown
◊ Breadthfirstsampling:sampleanode,thenitsneighbours…– Biasedtowardshigh-degreenodes(morechancestoreachthem)
◊ Snowballsampling:generalizeBFbypickingmanyinitialnodes– Respondent-drivensampling:weightthesnowballsampletogive
statisticallysoundestimates[Salganik Heckathorn 04]◊ Forest-firesampling:generalizeBFbypickingonlyafractionofneighbourstoexplore[Leskovec Kleinberg Faloutsos 05]– Withprobabilityp,movetoanewnodeand“kill”currentnode
No“onetruegraphsamplingmethod”– Experimentsshowdifferentpreferences,dependingongraphandmetric
[Leskovec,Faloutsos06;Hasan,Ahmed,Neville,Kompella13]
– Noneofthesemethodsare“streamingfriendly”:requirestaticgraph□ Hack:applythemtothestreamofedgesas-is
Sampling for Big Data
Random Walks Sampling
◊ Randomwalkshaveprovenveryeffectiveformanygraphcomputations– PageRank fornodeimportance,andmanyvariations
◊ Randomwalkanaturalmodelforsamplinganode– Perform“longenough”randomwalktopickanode
– Howlongis“longenough”(formixingofRW)?
– Canget“stuck”inasubgraph ifgraphnotwell-connected– Costlytoperformmultiplerandomwalks
– Highlynon-streamingfriendly,butsuitsgraphcrawling
◊ MultidimensionalRandomWalks[Ribeiro, Towsley 10]– Pickk randomnodestoinitializethesample
– Pickarandomedgefromtheunionofedgesincidentonthesample
– Canbeviewedasawalkonahigh-dimensionalextensionofthegraph– Outperformsrunningk independentrandomwalks
Sampling for Big Data
Subgraph estimation: counting triangles
◊ Hottopic:sample-basedtrianglecounting– Triangles:simplestnon-trivialrepresentationofnodeclustering
□ Regardasprototypeformorecomplexsubgraphsofinterest
– Measureof“clusteringcoefficient”ingraph,parameteringraphmodels…
◊ Uniformsamplingperformspoorly:– Chancethatrandomlysamplededgeshappentoformsubgraph is≈0
◊ Bias thesamplingsothatdesiredsubgraph ispreferentiallysampled
Sampling for Big Data
Subgraph Sampling in Streams
WanttosampleoneoftheT trianglesinagraph
◊ [Buriol et al 06]:sampleanedgeuniformly,thenpickanode– Scanfortheedgesthatcompletethetriangle
– ProbabilityofsamplingatriangleisT/(|E|(|V|-2))
◊ [Pavan et al 13]:sampleanedge,thensampleanincidentedge– Scanfortheedgethatcompletesthetriangle
– (Afterbiascorrection)probabilityofsamplingatriangleisT/(|E|D)□ D = maxdegree,considerablysmallerthan|V| inmostgraphs
◊ [Jha et.al. KDD 2013]: sampleedges,thesamplepairsofincidentedges– Scanforedgesthatcomplete“wedges”(edgepairsincidentonavertex)
◊ Advert:GraphSampleandHold[Ahmed, Duffield, Neville, Kompella, KDD 2014]– Generalframeworkforsubgraph counting;e.g.trianglecounting
– Similaraccuracytopreviousstateofart,butusingsmallerstorage
Sampling for Big Data
Graph Sampling Summary
◊ Samplingarepresentativegraphfromamassivegraphisdifficult!
◊ Currentstateoftheart:– Samplenodes/edgesuniformlyfromastream
– Heuristicsamplingfromstatic/streaminggraph
◊ Samplingenablessubgraph sampling/counting– Mucheffortdevotedtotriangles(smallestnon-trivialsubgraph)
◊ “Real”graphsarericher– Differentnodeandedgetypes,attributesonboth
– Justscratchingsurfaceofsamplingrealisticgraphs
Sampling for Big Data
Current Directions in Sampling
Sampling for Big Data
Outline
◊ Motivatingapplication:samplinginlargeISPnetworks
◊ Basicsofsampling:conceptsandestimation
◊ Streamsampling:uniformandweightedcase– Variations:Concisesampling,sampleandhold,sketchguided
BREAK◊ Advancedstreamsampling:samplingascostoptimization– VarOpt,priority,structureaware,andstablesampling
◊ Hashingandcoordination– Bottom-k,consistentsamplingandsketch-basedsampling
◊ Graphsampling– Node,edgeandsubgraphsampling
◊ Conclusionandfuturedirections
Sampling for Big Data
Role and Challenges for Sampling
◊ Matching
– Samplingmediatesbetweendatacharacteristicsandanalysisneeds
– Example:samplefrompower-lawdistributionofbytesperflow…□ butalsomakeaccurateestimatesfromsamples
□ simpleuniformsamplingmissesthelargeflows
◊ Balance
– Weightedsamplingacrosskey-functions:e.g.customers,networkpaths,geolocations
□ coversmallcustomers,notjustlarge
□ coverallnetworkelements,notjusthighlyutilized
◊ Consistency
– Sampleallviewsofsameevent,flow,customer,networkelement□ acrossdifferentdatasets,atdifferenttimes
□ independentsamplingÞ smallintersectionofviews
Sampling for Big Data
Sampling and Big Data Systems
◊ Samplingisstillausefultoolinclustercomputing– Reducethelatencyofexperimentalanalysisandalgorithmdesign
◊ SamplingasanoperatoriseasytoimplementinMapReduce– Foruniformorweightedsamplingoftuples
◊ Graphcomputationsareacoremotivatorofbigdata– PageRankasacanonicalbigcomputation
– Graph-specificsystemsemerging(Pregel,LFgraph,Graphlab,Giraph…)– But… samplingprimitivesnotyetprevalentinevolvinggraphsystems
◊ Whentodothesampling?– Option1:Sampleasaninitialstepinthecomputation
□ Foldsampleintotheinitial“Map”step
– Option2:Sampletocreateastoredsamplegraphbeforecomputation□ Allowsmorecomplexsampling,e.g.randomwalksampling
Sampling for Big Data
Sampling + KDD
◊ Theinterplaybetweensamplinganddataminingisnotwellunderstood– NeedanunderstandingofhowML/DMalgorithmsareaffectedbysampling
– E.g.howbigasampleisneededtobuildanaccurateclassifier?
– E.g.whatsamplingstrategyoptimizesclusterquality
◊ Expectresultstobemethodspecific– I.e.“IPPS+k-means”ratherthan“sample+cluster”
Sampling for Big Data
Sampling and Privacy
◊ Currentfocusonprivacy-preservingdatamining– Deliverpromiseofbigdatawithoutsacrificingprivacy?
– Opportunityforsamplingtobepartofthesolution
◊ Naïvesamplingprovides“privacyinexpectation”– Yourdataremainsprivateifyouaren’tincludedinthesample…
◊ Intuition:uncertaintyintroducedbysamplingcontributes toprivacy– Thisintuitioncanbeformalizedwithdifferentprivacymodels
◊ Samplingcanbeanalyzedinthecontextofdifferentialprivacy– Samplingalonedoesnotprovidedifferentialprivacy– ButapplyingaDPmethodtosampleddatadoesguaranteeprivacy
– Atradeoffbetweensamplingrateandprivacyparameters
□ Sometimes,lowersamplingrateimprovesoverallaccuracy
Sampling for Big Data
Advert: Now Hiring…
◊ NickDuffield,TexasA&M– Phds inbigdata,graphsampling
◊ GrahamCormode,UniversityofWarwickUK– Phds inbigdatasummarization
(graphsandmatrices,fundedbyMSR)– Postdocs inprivacyanddatamodeling
(fundedbyEC,AT&T)
GrahamCormode,[email protected]
NickDuffield,TexasA&[email protected]
Sampling for Big Data
x9x8x7x6x5x4x3x2x1
x10
x’9x’8
x’10
x’6x’5x’4x’3x’2x’1
Æ
0 1
00 01 10
000 001 010 011 100 101 110 111
11