x sampling for big data - nick duffield · sampling for big data reservoir sampling via order...

GrahamCormode,[email protected]

NickDuffield,TexasA&[email protected]

Sampling for Big Data

x9x8x7x6x5x4x3x2x1

x10

x’9x’8

x’10

x’6x’5x’4x’3x’2x’1

Æ

0 1

00 01 10

000 001 010 011 100 101 110 111

11


Big Data

◊ “Big”dataarisesinmanyforms:– PhysicalMeasurements:fromscience(physics,astronomy)

– Medicaldata:geneticsequences,detailedtimeseries

– Activitydata:GPSlocation,socialnetworkactivity

– Businessdata:customerbehaviortrackingatfinedetail

◊ Commonthemes:– Dataislarge,andgrowing

– Thereareimportantpatternsandtrendsinthedata

– Wedon’tfullyknowwheretolookorhowtofindthem


Why Reduce?

◊ Although“big”dataisaboutmorethanjustthevolume……mostbigdataisbig!

◊ Itisnotalwayspossibletostorethedatainfull– Manyapplications(telecoms,ISPs,searchengines)can’tkeepeverything

◊ Itisinconvenienttoworkwithdatainfull– Justbecausewecan,doesn’tmeanweshould

◊ Itisfastertoworkwithacompactsummary– Bettertoexploredataonalaptopthanacluster


Why Sample?

◊ Samplinghasanintuitivesemantics– Weobtainasmallerdatasetwiththesamestructure

◊ Estimatingonasampleisoftenstraightforward– Runtheanalysisonthesamplethatyouwouldonthefulldata

– Somerescaling/reweightingmaybenecessary

◊ Samplingisgeneralandagnostictotheanalysistobedone– Othersummarymethodsonlyworkforcertaincomputations– Thoughsamplingcanbetunedtooptimizesomecriteria

◊ Samplingis(usually)easytounderstand– Soprevalentthatwehaveanintuitionaboutsampling


Alternatives to Sampling

◊ Samplingisnottheonlygameintown– Manyotherdatareductiontechniquesbymanynames

◊ Dimensionalityreductionmethods– PCA,SVD,eigenvalue/eigenvectordecompositions

– Costlyandslowtoperformonbigdata

◊ “Sketching”techniquesforstreamsofdata– Hashbasedsummariesviarandomprojections– Complextounderstandandlimitedinfunction

◊ Othertransform/dictionarybasedsummarizationmethods– Wavelets,FourierTransform,DCT,Histograms

– Notincrementallyupdatable,highoverhead

◊ Allworthyofstudy– inothertutorials


Health Warning: contains probabilities

◊ Willavoiddetailedprobabilitycalculations,aimtogivehighleveldescriptionsandintuition

◊ Butsomeprobabilitybasicsareassumed– Conceptsofprobability,expectation,varianceofrandomvariables– Alludetoconcentrationofmeasure(Exponential/Chernoff bounds)

◊ Feelfreetoaskquestionsabouttechnicaldetailsalongtheway


Outline

◊ Motivatingapplication:samplinginlargeISPnetworks

◊ Basicsofsampling:conceptsandestimation

◊ Streamsampling:uniformandweightedcase– Variations:Concisesampling,sampleandhold,sketchguided

BREAK◊ Advancedstreamsampling:samplingascostoptimization– VarOpt,priority,structureaware,andstablesampling

◊ Hashingandcoordination– Bottom-k,consistentsamplingandsketch-basedsampling

◊ Graphsampling– Node,edgeandsubgraphsampling

◊ Conclusionandfuturedirections


Sampling as a Mediator of Constraints

Data Characteristics(Heavy Tails, Correlations)

Query Requirements(Ad Hoc, Accuracy, Aggregates, Speed)

Resource Constraints(Bandwidth, Storage, CPU)

Sampling


Motivating Application: ISP Data

◊ WillmotivatemanyresultswithapplicationtoISPs

◊ Manyreasonstousesuchexamples:– Expertise:tutorsfromtelecomsworld

– Demand:manysamplingmethodsdevelopedinresponsetoISPneeds

– Practice:samplingwidelyusedinISPmonitoring,builtintorouters

– Prescience:ISPswerefirsttohitmany“bigdata”problems– Variety:manydifferentplaceswheresamplingisneeded

◊ First,acrash-courseonISPnetworks…


Structure of Large ISP Networks

PeeringwithotherISPs

AccessNetworks:Wireless,DSL,IPTV

City-levelRouterCenters

BackboneLinks

DownstreamISPandbusinesscustomers

ServiceandDatacenters

NetworkManagement&Administration


Measuring the ISP Network: Data Sources

Peering

AccessRouterCenters

Backbone

Business

DatacentersManagement

LinkTrafficRatesAggregatedperrouterinterface

TrafficMatricesFlowrecordsfromrouters

Loss&LatencyActiveprobing

Loss&LatencyRoundtrip toedge

ProtocolMonitoring:Routers,Wireless

StatusReports:Devicefailuresandtransitions

CustomerCareLogsReactiveindicatorsofnetworkperformance


Why Summarize (ISP) Big Data?

◊ Whentransmissionbandwidthformeasurementsislimited– NotsuchabigissueinISPswithin-bandcollection

◊ Typicallyrawaccumulationisnotfeasible(evenfornationstates)– Highratestreamingdata

– Maintainhistoricalsummariesforbaselining,timeseriesanalysis

◊ Tofacilitatefastqueries– Wheninfeasibletorunexploratoryqueriesoverfulldata

◊ Aspartofhierarchicalqueryinfrastructure:– Maintainfulldataoverlimiteddurationwindow

– Drilldownintofulldatathroughoneormorelayersofsummarization

Samplinghasbeenprovedtobeaflexiblemethodtoaccomplishthis


Data Scale:Summarization and Sampling


Traffic Measurement in the ISP Network

AccessRouter Centers

Backbone

Business

DatacentersManagement

Traffic MatricesFlow records from routers


Massive Dataset: Flow Records

◊ IPFlow:setofpacketswithcommonkeyobservedcloseintime

◊ FlowKey:IPsrc/dst address,TCP/UDPports,ToS,…[64to104+bits]

◊ FlowRecords:– Protocollevelsummariesofflows,compiledandexportedbyrouters

– Flowkey,packetandbytecounts,first/lastpackettime,somerouterstate

– Realizations:CiscoNetflow,IETFStandards

◊ Scale:100’sTeraBytesofflowrecordsdailyaregeneratedinalargeISP◊ Usedtomanagenetworkoverrangeoftimescales:

– Capacityplanning(months),….,detectingnetworkattacks(seconds)

◊ Analysistasks– Easy:timeseries ofpredeterminedaggregates(e.g.addressprefixes)

– Hard:fastqueriesoverexploratoryselectors,history,communicationssubgraphs

flow1 flow2 flow3 flow4

time


Flows, Flow Records and Sampling

◊ Twotypesofsamplingusedinpracticeforinternettraffic:1. Samplingpacketstreaminrouterpriortoformingflowrecords

□ Limitstherateoflookupsofpacketkeyinflowcache

□ RealizedasPacketSampledNetFlow(morelater…)

2. Downstreamsamplingofflowrecordsincollectioninfrastructure

□ Limitstransmissionbandwidth,storagerequirements□ RealizedinISPmeasurementcollectioninfrastructure(morelater…)

◊ Twocasesillustrativeofgeneralproperty– Differentunderlyingdistributionsrequiredifferentsampledesigns

– Statisticaloptimalitysometimeslimitedbyimplementationconstraints

□ Availabilityofrouterstorage,processingcycles


Abstraction: Keyed Data Streams

◊ DataModel:objectsarekeyedweights– Objects(x,k):Weight x;keyk

□ Example1:objects=packets,x =bytes,k =key(source/destination)

□ Example2:objects=flows,x =packetsorbytes,k =key

□ Example3:objects=accountupdates,x =credit/debit,k =accountID

◊ Streamofkeyedweights,{(xi , ki):i =1,2,…,n}

◊ Genericquery:subsetsums– X(S)=ΣiÎS xi for SÌ {1,2,…,n}i.e.totalweightofindexsubsetS– TypicallyS=S(K)={i:kiÎ K} :objectswithkeysinK

□ Example1,2:X(S(K))=totalbytestogivenIPdest address/UDPport

□ Example3:X(S(K))=totalbalancechangeoversetofaccounts

◊ Aim:Computefixedsizesummaryofstreamthatcanbeusedtoestimatearbitrarysubsetsumswithknownerrorbounds


Inclusion Sampling and Estimation

◊ Horvitz-ThompsonEstimation:– Objectofsizexi sampledwithprobabilitypi

– Unbiasedestimatex’i=xi /pi (ifsampled),0ifnotsampled:E[x’i]=xi◊ Linearity:– Estimateofsubsetsum=sumofmatchingestimates

– SubsetsumX(S)=SiÎS xi isestimatedbyX’(S)=SiÎS x’i◊ Accuracy:– ExponentialBounds:Pr[|X’(S)- X(S)|>δX(S)]≤exp[-g(δ)X(S)]

– Confidenceintervals:X(S)Î [X-(e),X+(e)]withprobability1- e◊ Futureproof:– Don’tneedtoknowqueriesattimeofsampling

□ “Where/wheredidthatsuspiciousUDPportfirstbecomesoactive?”□ “WhichisthemostactiveIPaddresswithinthananomaloussubnet?”

– Retrospectiveestimate:subsetsumoverrelevantkeyset


Independent Stream Sampling

◊ BernoulliSampling– IIDsamplingofobjectswithsomeprobabilityp

– Sampledweightx hasHTestimatex/p

◊ PoissonSampling– Weightxi sampledwithprobabilitypi ;HTestimatexi /pi

◊ WhentousePoissonvs.Bernoullisampling?– Elephantsandmice:Poissonallowsprobabilitytodependonweight…

◊ Whatisbestchoiceofprobabilitiesforgivenstream{xi}?


Bernoulli Sampling

◊ Theeasiestpossiblecaseofsampling:allweightsare1– N objects,andwanttosamplek fromthemuniformly

– Eachpossiblesubsetofk shouldbeequallylikely

◊ UniformlysampleanindexfromN (withoutreplacement)k times– Somesubtleties:trulyrandomnumbersfrom[1…N]onacomputer?

– Assumethatrandomnumbergeneratorsaregoodenough

◊ CommontrickinDB:assignarandomnumbertoeachitemandsort– CostlyifN isverybig,butsoisrandomaccess

◊ Interestingproblem:takeasinglelinearscanofdatatodrawsample– Streamingmodelofcomputation:seeeachelementonce

– Application:IPflowsampling,toomany(forus)tostore

– (Forawhile)commontechinterviewquestion


Reservoir Sampling

“Reservoirsampling”describedby[Knuth 69, 81];enhancements[Vitter 85]◊ Fixedsizek uniformsamplefromarbitrarysizeN streaminonepass– Noneedtoknowstreamsizeinadvance

– Includefirstk itemsw.p.1

– Includeitemn>k withprobabilitypn=k/n,n>k

□ Pickj uniformlyfrom{1,2,…,n}□ Ifj≤k,swapitemn intolocationj inreservoir,discardreplaceditem

◊ Neatproofshowstheuniformityofthesamplingmethod:– LetSn =samplesetaftern arrivals

k=7 n

m (< n)

Previouslysampleditem:induction

mÎ Sn-1 w.p.pn-1Þ mÎ Sn w.p.pn-1 *(1– pn /k)=pn

Newitem:selectionprobability

Prob[nÎ Sn ]=pn :=k/n


Reservoir Sampling: Skip Counting

◊ Simpleapproach:checkeachiteminturn– O(1) peritem:

– Fineifcomputationtime<interarrivaltime

– OtherwisebuildupcomputationbacklogO(N)

◊ Better:“skipcounting”– Findrandomindexm(n)ofnextselection>n– Distribution:Prob[m(n)≤m]=1- (1-pn+1)*(1-pn+2)*…*(1-pm)

◊ Expectednumberofselectionsfromstreamisk+Σk<m≤N pm =k+Σk<m≤N k/m=O(k(1+ln (N/k)))

◊ Vitter’85providedalgorithmwiththisaveragerunningtime


Reservoir Sampling via Order Sampling

◊ Ordersamplinga.k.a.bottom-ksample,min-hashing

◊ Uniformsamplingofstreamintoreservoirofsizek

◊ Eacharrivaln:generateone-timerandomvaluern Î U[0,1]– rn alsoknownashash,rank,tag…

◊ Storekitemswiththesmallestrandomtags

0.391 0.908 0.291 0.555 0.619 0.273

n Eachitemhassamechanceofleasttag,souniform

n Fasttoimplementviapriorityqueue

n Canrunonmultipleinputstreamsseparately,thenmerge


Handling Weights

◊ Sofar:uniformsamplingfromastreamusingareservoir

◊ Extendtonon-uniformsamplingfromweightedstreams– Easycase:k=1

– Samplingprobabilityp(n)=xn/WnwhereWn=Si=1n xi

◊ k>1 isharder– Canhaveelementswithlargeweight:wouldbesampledwithprob 1?

◊ Numberofdifferentweightedorder-samplingschemesproposedtorealizedesireddistributionalobjectives– Rankrn=f(un,xn )forsomefunctionfandunÎ U[0,1]

– k-mins sketches[Cohen 1997],Bottom-ksketches[Cohen Kaplan 2007]– [Rosen 1972], Weightedrandomsampling[Efraimidis Spirakis 2006]– OrderPPSSampling[Ohlsson 1990, Rosen 1997] – PrioritySampling[Duffield Lund Thorup 2004], [Alon+DLT 2005]


Weighted random sampling

◊ Weightedrandomsampling[Efraimidis Spirakis 06] generalizesmin-wise– Foreachitemdrawrn uniformlyatrandominrange[0,1]

– Computethe‘tag’ofanitemasrn (1/xn)

– Keeptheitemswiththek smallesttags

– Canprovethecorrectnessoftheexponentialsamplingdistribution

◊ Canalsomakeefficientviaskipcountingideas


Priority Sampling

◊ Eachitemxi givenpriorityzi =xi /ri withrn uniformrandomin(0,1]

◊ Maintainreservoirofk+1items(xi ,zi )ofhighestpriority◊ Estimation

– Letz*=(k+1)st highest priority

– Top-kpriorityitems:weightestimate x’I =max{xi ,z*}

– Allother items:weightestimate zero

◊ Statisticsandbounds– x’I unbiased; zerocovariance:Cov[x’i ,x’j ]=0fori≠j

– Relative varianceforanysubset sum≤1/(k-1)[Szegedy, 2006]


Priority Sampling in Databases

◊ OneTimeSamplePreparation– Computeprioritiesofallitems,sortindecreasingpriorityorder

□ Nodiscard

◊ SampleandEstimate– EstimateanysubsetsumX(S)= SiÎS xi byX’(S)=SiÎS x’I forsomeS’Ì S

– Method:selectitemsindecreasingpriorityorder

◊ Twovariants:boundedvarianceorcomplexity1. S’=firstkitemsfromS:relativevariancebounded≤1/(k-1)

□ x’I=max{xi ,z*}where z*=(k+1)st highestpriorityinS

2. S’=itemsfromSinfirstk:executiontimeO(k)

□ x’I=max{xi ,z*}where z*=(k+1)st highestpriority

[Alon et. al., 2005]


Making Stream Samples Smarter

◊ Observation:wesee thewholestream,evenifwecan’tstoreit– Cankeepmoreinformationaboutsampleditemsifrepeated

– Simpleinformation:ifitemsampled,countallrepeats

◊ CountingSamples[Gibbons & Mattias 98]– Samplenewitemswithfixedprobabilityp,countrepeatsasci– Unbiasedestimateoftotalcount:1/p+(ci – 1)

◊ SampleandHold[Estan & Varghese 02]:generalizetoweightedkeys– Newkeywithweightb sampledwithprobability1- (1-p)b

◊ Lowervariancecomparedwithindependentsampling– Butsamplesizewillgrowaspn

◊ Adaptivesampleandhold:reducep whenneeded– “Stickysampling”:geometricdecreasesinp [Manku, Motwani 02]– Muchsubsequentworktuningdecreaseinptomaintainsamplesize


Sketch Guided Sampling

◊ Gofurther:avoidsamplingtheheavykeysasmuch– Uniformsamplingwillpickfromtheheavykeysagainandagain

◊ Idea:useanoracletotellwhenakeyisheavy[Kumar Xu 06] – Adjustsamplingprobabilityaccordingly

◊ Canusea“sketch”datastructuretoplaytheroleoforacle– Likeahashtablewithcollisions,tracksapproximatefrequencies

– E.g.(Counting)BloomFilters,Count-MinSketch

◊ Trackprobabilitywithwhichkeyissampled,useHTestimators– Setprobabilityofsamplingkeywith(estimated)weightwas

1/(1+ew)forparametere : decreasesaswincreases– Decreasinge improvesaccuracy,increasessamplesize


Challenges for Smart Stream Sampling

◊ Currentrouterconstraints– FlowtablesmaintainedinfastexpensiveSRAM

□ Tosupportperpacketkeylookupatlinerate

◊ Implementationrequirements– SampleandHold:stillneedperpacketlookup– SampledNetFlow:(uniform)samplingreduceslookuprate

□ Easiertoimplement despite inferiorstatistical properties

◊ Longdevelopmenttimestorealizenewsamplingalgorithms

◊ Similarconcernsaffectsamplinginotherapplications– Processinglargeamountsofdataneedsawarenessofhardware

– Uniformsamplingmeansnocoordinationneededindistributedsetting


Future for Smarter Stream Sampling

◊ SoftwareDefinedNetworking– Current:proprietarysoftwarerunningonspecialvendorequipment

– Future:opensoftwareandprotocolsoncommodityhardware

◊ Potentiallyoffersflexibilityintrafficmeasurement– Allocatesystemresourcestomeasurementtasksasneeded– Dynamicreconfiguration,finegrainedtuningofsampling

– Stateful packetinspectionandsamplingfornetworksecurity

◊ Technicalchallenges:– Highratepacketprocessinginsoftware

– Transparentsupportfromcommodityhardware

– OpenSketch:[Yu, Jose, Miao, 2013]◊ Sameissuesinotherapplications:useofcommodityprogrammableHW


Stream Sampling:Sampling as Cost Optimization


Matching Data to Sampling Analysis

◊ Genericproblem1:Countingobjects:weightxi =1Bernoulli(uniform)samplingwithprobabilitypworksfine

– EstimatedsubsetcountX’(S)=#{samplesinS}/p

– RelativeVariance(X’(S))=(1/p-1)/X(S)

□ givenp,getanydesiredaccuracyforlargeenoughS

◊ Genericproblem2:xi inParetodistribution,a.k.a.80-20law– Smallproportionofobjectspossessalargeproportionoftotalweight

□ Howtobesttosampleobjectstoaccuratelyestimateweight?

– Uniformsampling?

□ likelytoomitheavyobjectsÞbighitonaccuracy□ makingselectionsetS largedoesn’thelp

– Selectm largestobjects?

□ biased&smallerobjectssystematicallyignored


Heavy Tails in the Internet and Beyond◊ Filessizesinstorage

◊ Bytesandpacketspernetworkflow

◊ Degreedistributionsinwebgraph,socialnetworks


Non-Uniform Sampling

◊ Extensiveliterature:seebookby[Tille, “Sampling Algorithms”, 2006]◊ Predates“BigData”– Focusonstatisticalproperties,notsomuchcomputational

◊ IPPS:InclusionProbabilityProportionaltoSize– VarianceOptimalforHTEstimation

– Samplingprobabilitiesformultivariateversion:[Chao 1982, Tille 1996]– Efficientstreamsamplingalgorithm:[Cohen et. al. 2009]


Costs of Non-Uniform Sampling

◊ Independentsamplingfromn objectswithweights{x1,…,xn}

◊ Goal:findthe“best”samplingprobabilities{p1,…,pn}

◊ Horvitz-Thompson:unbiasedestimationofeachxi by

◊ Twocoststobalance:1. EstimationVariance:Var(x’i)=x2i(1/pi – 1)

2. ExpectedSampleSize:Sipi

◊ MinimizeLinearCombinationCost:Si (xi2(1/pi–1)+z2 pi)– z expressesrelativeimportanceofsmallsamplevs.smallvariance

otherwise0selectediweightifpx

x' iii =


Minimal Cost Sampling: IPPS

IPPS:InclusionProbabilityProportionaltoSize

◊ MinimizeCostSi (xi2 (1/pi– 1)+z2 pi)subjectto1≥pi ≥0

◊ Solution:pi =pz(xi)=min{1,xi /z}

– smallobjects(xi <z)selectedwithprobabilityproportionaltosize

– largeobjects(xi ≥z)selectedwithprobability1

– Callz the“samplingthreshold”

– Unbiasedestimatorxi/pi =max{xi ,z}

◊ Perhapsreminiscentofimportancesampling,butnotthesame:

– makenoassumptionsconcerningdistributionofthex

pz(x)

1

zx


Error Estimates and Bounds

◊ VarianceBased:– HTsamplingvarianceforsingleobjectofweightxi

□ Var(x’i)=x2i(1/pi – 1)=x2i(1/min{1,xi/z}– 1)≤zxi– SubsetsumX(S)=SiÎS xi isestimatedbyX’(S)=SiÎS x’i

□ Var(X’(S))≤zX(S)

◊ ExponentialBounds– E.g.Prob[X’(S)=0]≤exp(- X(S)/z)

◊ Boundsaresimpleandpowerful– dependonlyonsubsetsumX(S),notindividualconstituents


Sampled IP Traffic Measurements

◊ PacketSampledNetFlow– Samplepacketstreaminroutertolimitrateofkeylookup:uniform1/N

– Aggregatesampledpacketsintoflowrecordsbykey

◊ Model:packetstreamof(key,bytesize)pairs{(bi,ki)}

◊ Packetsampledflowrecord(b,k)where b=Σ {bi:i sampled� ki =k}– HTestimateb*Noftotalbytesinflow

◊ Downstreamsamplingofflowrecordsinmeasurementinfrastructure– IPPSsampling,probabilitymin{1,b*N/z}

◊ ChainedvarianceboundforanysubsetsumX offlows– Var(X’)≤(z+Nbmax)Xwherebmax=maximumpacketbytesize

– Regardlessofhowpacketsaredistributedamongstflows

[Duffield, Lund, Thorup, IEEE ToIT, 2004]


Estimation Accuracy in Practice

◊ Estimateanysubsetsumcomprisingatleastsomefractionf ofweight

◊ Suppose:samplesizem

◊ Analysis:typicalestimationerrorε (relativestandarddeviation)obeys

◊ 2**16=storageneededforaggregatesover16bitaddressprefixes□ Butsamplinggivesmoreflexibilitytoestimatetrafficwithinaggregates

kf1ε

0.10%

1.00%

10.00%

100.00%

0.0001 0.001 0.01 0.1 1R

SD ε

fraction f

m = 2**16 samples

Estimate fractionf =0.1%withtypicalrelativeerror

12%:

m


Heavy Hitters: Exact vs. Aggregate vs. Sampled◊ Samplingdoesnottellyouwheretheinterestingfeaturesare– Butdoesspeeduptheabilitytofindthemwithexistingtools

◊ Example:HeavyHitterDetection– Setting:Flowrecordsreporting10GB/strafficstream

– Aim:findHeavyHitters=IPprefixescomprising≥0.1%oftraffic

– Responsetimeneeded:5minute

◊ Compare:– Exact:10GB/sx5minutesyieldsupwardsof300Mflowrecords

– 64kaggregatesover16bitprefixes:nodeeperdrill-downpossible

– Sampled:64kflowrecords:anyaggregate≥0.1%accurateto10%Exact Aggregate Sampled


Cost Optimization for Sampling

Severaldifferentapproachesoptimizefordifferentobjectives:

1. FixedSampleSizeIPPSSample– VarianceOptimalsampling:minimalvarianceunbiasedestimation

2. StructureAwareSampling– Improveestimationaccuracyforsubnetqueriesusingtopologicalcost

3. FairSampling– Adaptivelybalancesamplingbudgetoversubpopulationsofflows– Uniformestimationaccuracyregardlessofsubpopulationsize

4. StableSampling– Increasestabilityofsamplesetbyimposingcostonchanges


IPPS Stream Reservoir Sampling◊ Eacharrivingitem:

– Provisionally include iteminreservoir

– Ifm+1 items, discard1itemrandomly

□ Calculatethresholdz tosamplem itemsonaverage:z solvesSi pz(xi)=m□ Discarditemi withprobabilityqi =1– pz(xi)

□ Adjustm survivingxiwithHorvitz-Thompsonx’i =xi /pi =max{xi,z}◊ Efficient Implementation:

– ComputationalcostO(logm)peritem,amortizedcostO(loglogm)

[Cohen, Duffield, Lund, Kaplan, Thorup; SODA 2009, SIAM J. Comput. 2011]

x9x8x7x6x5x4x3x2x1

Example: m=9

x10

Recalculatethreshold z:

=

=10

1ii 9z}xmin{1,

z

0

1

RecalculateDiscard probs:

z}xmin{1,-1q ii =

x7x6x5x4x3x2x1

x9x8

x10

Adjust weights:z},max{xx' ii =

x’9x’8

x’10

x’6x’5x’4x’3x’2x’1


Structure (Un)Aware Sampling

◊ Samplingisoblivioustostructureinkeys(IPaddresshierarchy)– Estimationdispersestheweightofdiscardeditemstosurvivingsamples

◊ Queriesstructureaware:subsetsumsoverrelatedkeys(IPsubnets)– AccuracyonLHSisdecreasedbydiscardingweightonRHS

Æ

0 1

00 01 10

000 001 010 011 100 101 110 111

11


Localizing Weight Redistribution

◊ Initialweightset{xi :iÎS}forsomeSÌΩ– E.g. Ω =possible IPaddresses, S =observedIPaddresses

◊ Attribute“rangecost”C({xi :iÎR})foreachweightsubsetRÍS– Possible factorsforRangeCost:

□ Samplingvariance

□ Topologye.g.height oflowestcommonancestor

– Heuristics: R* =Nearest Neighbor{xi,xj}ofminimalxixj◊ SamplekitemsfromS:

– Progressivelyremoveoneitem fromsubsetwithminimal rangecost:

– While(|S|>k)

□ FindR*ÍS ofminimal rangecost.

□ RemoveaweightfromR* w/VarOpt

[Cohen, Cormode, Duffield; PVLDB 2011]

Æ

0 1

00 01 10

000 001 010 011 100 101 110 111

11

Nochangeoutsidesubtree belowclosestancestor

Orderofmagnitude reduction inaveragesubneterrorvs.VarOpt


Fair Sampling Across Subpopulations

◊ Analysisqueriesoftenfocusonspecificsubpopulations– E.g.networking:differentcustomers,userapplications,networkpaths

◊ Widevariationinsubpopulationsize– 5ordersofmagnitudevariationintrafficoninterfacesofaccessrouter

◊ Ifuniformsamplingacrosssubpopulations:– Poorestimationaccuracyonsubsetsumswithinsmallsubpopulations

Sample

Color=subpopulation

,=interesting items

– occurrenceproportionaltosubpopulation size

UniformSamplingacross subpopulations:

– Difficulttotrackproportionofinterestingitemswithinsmall subpopulations:


Fair Sampling Across Subpopulations

◊ Minimizerelativevariancebysharingbudgetmoversubpopulations– Totaln objects insubpopulations n1,…,ndwithSini=n– Allocate budgetmi toeachsubpopulation ni withSimi=m

◊ MinimizeaveragepopulationrelativevarianceR=const.Si1/mi

◊ Theorem:– R minimizedwhen{mi}areMax-MinFairshareofm underdemands {ni}

◊ Streaming– Problem:don’tknowsubpopulation sizes{ni}inadvance

◊ Solution:progressivefairsharingasreservoirsample– Provisionally include eacharrival

– Discard1itemasVarOpt samplefromanymaximal subpopulation

◊ Theorem[Duffield; Sigmetrics2012]: – Max-MinFairatalltimes; equality indistributionwithVarOpt samples {mi fromni}


Stable Sampling

◊ Setting:Samplingapopulationoversuccessiveperiods

◊ Sampleindependentlyateachtimeperiod?– Costassociatedwithsamplechurn

– Timeseriesanalysisofsetofrelativelystablekeys

◊ Findsamplingprobabilitiesthroughcostminimization– MinimizeCost=EstimationVariance+z*E[#Churn]

◊ SizemsamplewithmaximalexpectedchurnD– weights{xi},previoussamplingprobabilities{pi}

– findnewsamplingprobabilities{qi}tominimizecostoftakingmsamples

– MinimizeSix2i/qi subjectto1≥qi ≥0,SIqi =mandSI|pi – qi |≤ D

[Cohen, Cormode, Duffield, Lund 13]


Summary of Part 1

◊ Samplingasapowerful,generalsummarizationtechnique

◊ UnbiasedestimationviaHorvitz-Thompsonestimators

◊ Samplingfromstreamsofdata– Uniformsampling:reservoirsampling

– Weightedgeneralizations:sampleandhold,countingsamples

◊ Advancesinstreamsampling– Thecostprincipleforsampledesign,andIPPSmethods– Threshold,priorityandVarOptsampling

– Extendingthecostprinciple:

□ structureaware,fairsampling,stablesampling,sketchguided


Outline









Data Scale:Hashing and Coordination


Sampling from the set of items

◊ Sometimesneedtosamplefromthedistinctsetofobjects– Notinfluencedbytheweightornumberofoccurrences

– E.g.samplefromthedistinctsetofflows,regardlessofweight

◊ Needsamplingmethodthatisinvarianttoduplicates

◊ Basicidea:buildafunctiontodeterminewhattosample– A“random”functionf(k)àR

– Usef(k) tomakeasamplingdecision:consistentdecisionforsamekey


Permanent Random Numbers

◊ Oftenconvenienttothinkoff asgiving“permanentrandomnumbers”– Permanent:assignedonceandforall

– Random:treatasiffullyrandomlychosen

◊ Thepermanentrandomnumberisusedinmultiplesamplingsteps– Same“random”numbereachtime,soconsistent (correlated)decisions

◊ Example:usePRNstodrawasampleofs fromN viaordersampling– Ifs<<N,smallchanceofseeingsameelementindifferentsamples– ViaPRN,strongerchanceofseeingsameelement

□ Cantrackpropertiesovertime,givesaformofstability

◊ EasiestwaytogeneratePRNs:applyahash function totheelementid– EnsuresPRNcanbegeneratedwithminimalcoordination

– Explicitlystoringarandomnumberforallobservedkeysdoesnotscale


Hash Functions

Manypossiblechoicesofhashingfunctions:

◊ Cryptographichashfunctions:SHA-1,MD5,etc.– Resultsappear“random”formosttests(usingseed/salt)

– Canbeslowforhighspeed/highvolumeapplications

– Fullpowerofcryptographicsecuritynotneededformoststatisticalpurposes

□ Althoughpossiblesometrade-offsinrobustnesstosubversionifnotused

◊ Heuristichashfunctions:srand(),mod– Usuallyprettyfast

– Maynotberandomenough:structureinkeysmaycausecollisions

◊ Mathematicalhashfunctions:universalhashing,k-wisehashing– Haveprecisemathematicalpropertiesonprobabilities

– Canbeimplementedtobeveryfast


Mathematical Hashing

◊ K-wiseindependence:Pr[h(x1)=y1 Ù h(x2)=y2Ù …Ù h(xt)=yt]=1/Rt

– Simple function:ctxt +ct-1xt-1 +…c1x+c0 modP

– ForfixedprimeP,randomlychosenc0 …ct– Canbemadeveryfast(chooseP tobeMersenneprimetosimplifymods)

◊ (Twisted)tabulationhashing[Thorup Patrascu 13]– Interpreteachkeyasasequenceofshortcharacters,e.g.8*8bits– Usea“trulyrandom”look-uptableforeachcharacter(so8*256entries)

– Taketheexclusive-ORoftherelevanttablevalues

– Fast,andfairlycompact

– Strongenoughformanyapplicationsofhashing(hashtablesetc.)


Bottom-k sampling

◊ Samplefromthesetofdistinctkeys– Hasheachkeyusingappropriatehashfunction

– Keepinformationonthekeyswiththes smallesthashvalues

– ThinkofasordersamplingwithPRNs…

◊ Usefulforestimatingpropertiesofthesupportsetofkeys– Evaluateanypredicateonthesampledsetofkeys

◊ Sameconcept,severaldifferentnames:– Bottom-ksampling,Min-wisehashing,K-minimumvalues

0.391 0.908 0.291 0.391 0.391 0.273


Subset Size Estimation from Bottom-k

◊ Wanttoestimatethefractiont=|A|/|D|– D istheobservedsetofdata

– A isanarbitrarysubsetgivenlater

– E.g.fractionofcustomerswhoaresportsfansfrommidwestaged18-35

◊ Simplealgorithm:– Runbottom-ktogetsamplesetS,estimatet’=|A∩S|/s– Errordecreasesas1/√s

– Analysisdueto[Thorup 13]: simplehashfunctionssufficeforbigenoughs


Similarity Estimation

◊ Howsimilararetwosets,A andB?

◊ Jaccard coefficient:|AÇ B|/|AÈ B|– 1ifA,B identical,0iftheyaredisjoint

– Widelyused,e.g.tomeasuredocumentsimilarity

◊ Simpleapproach:sampleanitemuniformlyfromA andB– Probabilityofseeingsameitemfromboth:|AÇ B|/(|A| ´ |B|)– Chanceofseeingsameitemtoolowtobeinformative

◊ Coordinatedsampling:usesamehashfunctiontosamplefromA,B– Probabilitythatsameitemsampled:|AÇ B|/|AÈ B|– Repeat:theaveragenumberofagreementsgivesJaccard coefficient

– Concentration:(additive)errorscalesas1/√s


Technical Issue: Min-wise hashing

◊ Foranalysistowork,thehashfunctionmustbefullyrandom– Allpossiblypermutationsoftheinputareequallylikely

– Unrealisticinpractice:descriptionofsuchafunctionishuge

◊ “Simple”hashfunctionsdon’tworkwell– Universalhashfunctionsaretooskewed

◊ Needhashfunctionsthatare“approximatelymin-wise”– Probabilityofsamplingasubsetisalmostuniform– Tabulationhashingasimplewaytoachievethis


Bottom-k hashing for F0 Estimation

◊ F0 isthenumberofdistinctitemsinthestream– afundamentalquantitywithmanyapplications

– E.g.numberofdistinctflowsseenonabackbonelink

◊ Letm bethedomainofstreamelements:eachdataitemis[1…m]

◊ Pickarandom(pairwise)hashfunctionh:[m]® [R]

◊ Applybottom-ksamplingunderhashfunctionh– Letvs =s’th smallest(distinct)valueofh(i) seen

◊ Ifn=F0 <s,giveexactanswer,elseestimateF’0 =sR/vs– vs/R » fractionofhashdomainoccupiedbys smallest

R0R vs


Analysis of F0 algorithm

◊ Canshowthatitisunlikelytohaveanoverestimate– Toomanyitemshashedbelowafixedvalue

– Cantreateacheventofanitemhashingtoolowasindependent

◊ Similaroutlinetoshowunlikelytohaveanoverestimate

◊ (Relative)errorscalesas1/√s

◊ Spacecost:– Stores hashvalues,soO(slogm)bits– CanimprovetoO(s+logm)withadditionalhashingtricks

– Seealso“StreamedApproximateCountingofDistinctElements”,KDD’14

RsR/(1+e)n0R vs


Consistent Weighted Sampling

◊ Wanttoextendbottom-kresultswhendatahasweights

◊ Specifically,twodatasetsA andB whereeachelementhasweight– Weightsareaggregated:weseewholeweightofelementtogether

◊ WeightedJaccard:wantprobabilitythatsamekeyischosenbybothtobeåi min(A(i),B(i))/åi max(A(i),B(i))

◊ Samplingmethodshouldobeyuniformityandconsistency– Uniformity:elementi pickedfromAwithprobabilityproportionaltoA(i)

– Consistency:ifi ispickedfromA,andB(i)>A(i),theni alsopickedforB

◊ Simplesolution:assumingintegerweights,treatweightA(i)asA(i)unique(different)copiesofelementi,applybottom-k– Limitations:slow,unscalablewhenweightscanbelarge

– Needtorescalefractionalweightstointegralmultiples


Consistent Weighted Sampling

◊ Efficientsamplingdistributionsexistachievinguniformityandconsistency

◊ Basicidea:consideraweightw asw/D differentelements– Computetheprobabilitythatanyoftheseachievestheminimumvalue

– StudythelimitingdistributionasD® 0

◊ ConsistentWeightedSampling[Manasse, McSherry, Talway 07], [Ioffe 10]– Usehashofitemtodeterminewhichpointssampledviacarefultransform

– Manydetailsneededtocontainbit-precision,allowfastcomputation

◊ Othercombinationsofkeyweightsarepossible[Cohen Kaplan Sen 09]– Minofweights,maxofweights,sumof(absolute)differences


Trajectory Sampling

◊ Aims[Duffield Grossglauser 01]:– Probepacketsateachroutertheytraverse

– Collatereportstoinferlinklossandlatency

– Needtosample;independentsamplingnouse

◊ Hash-basedsampling:– Allrouters/packets:computehashh ofinvariantpacketfields

– SampleifhÎ someH andreporttocollector;tunesampleratewith|H|– Usehighentropypacketfieldsashashinput,e.g.IPaddresses,IDfield

– Hashfunctionchoicetrade-offbetweenspeed,uniformity&security

◊ StandardizedinInternetEngineeringTaskForce(IETF)– Serviceprovidersneedconsistencyacrossdifferentvendors

– Severalhashfunctionsstandardized,extensible– Sameissuesariseinotherbigdataecosystems(appsandAPIs)


Hash Sampling in Network Management ◊ Manydifferentnetworksubsystemsusedtoprovideservice– Monitoredthrougheventlogs,passivemeasurementoftraffic&protocols

– Needcross-systemsamplethatcapturesfullinteractionbetweennetworkandarepresentativesetofusers

◊ Ideal:hash-basedselectionbasedoncommonidentifier

◊ Administrativechallenges!Organizationaldiversity

◊ Timelinesschallenge:– Selectionidentifiermaynotbepresentatameasurementlocation

– Example:commonidentifier=anonymized customerid

□ PassivetrafficmeasurementbasedonIPaddress

□ MappingofIPaddresstocustomerIDnotavailableremotely□ AttributionoftrafficIPaddresstoauserdifficulttocomputeatlinespeed


Advanced Sampling from Sketches

◊ Difficultcase:inputswithpositive andnegativeweights

◊ Wanttosamplebasedontheoverallfrequencydistribution– Samplefromsupportsetofn possibleitems

– Sampleproportionalto(absolute)totalweights

– Sampleproportionaltosomefunctionofweights

◊ Howtodothissamplingeffectively?– Challenge:maybemanyelementswithpositiveandnegativeweights– Aggregateweightsmayendupzero:howtofindthenon-zeroweights?

◊ Recentapproach:L0sampling– L0 samplingenablesnovel“graphsketching”techniques

– Sketchesforconnectivity,sparsifiers [Ahn, Guha, McGregor 12]


L0 Sampling

◊ L0 sampling:samplewithprob ≈fi0/F0– i.e.,sample(near)uniformlyfromitemswithnon-zerofrequency

◊ Generalapproach:[Frahling, Indyk, Sohler 05, C., Muthu, Rozenbaum05]– Sub-sampleallitems(presentornot)withprobabilityp

– Generateasub-sampledvectoroffrequenciesfp– Feedfp toak-sparserecoverydatastructure

□ Allowsreconstructionoffp ifF0 <k

– Iffp isk-sparse,samplefromreconstructedvector

– Repeatinparallelforexponentiallyshrinkingvaluesofp


Sampling Process

◊ Exponentialsetofprobabilities,p=1,½,¼,1/8,1/16…1/U– LetN=F0 =|{i :fi ¹ 0}|– Wanttheretobealevelwherek-sparserecoverywillsucceed

– Atlevelp,expectednumberofitemsselectedS isNp

– Picklevelp sothatk/3<Np £ 2k/3◊ Chernoffbound:withprobabilityexponentialink,1£ S£ k– Pickk=O(log1/d)toget1-dprobability

p=1

p=1/U

k-sparserecovery


Hash-based sampling summary

◊ Usehashfunctionsforsamplingwheresomeconsistencyisneeded– Consistencyoverrepeatedkeys

– Consistencyoverdistributedobservations

◊ Hashfunctionshavedualityofrandomandfixed– Treatasrandomforstatisticalanalysis

– Treatasfixedforgivingconsistencyproperties

◊ Canbecomequitecomplexandsubtle– Complexsamplingdistributionsforconsistentweightedsampling

– TrickycombinationofalgorithmsforL0 sampling

◊ Plentyofscopefornewhashing-basedsamplingmethods


Data Scale:Massive Graph Sampling


Massive Graph Sampling

◊ “GraphServiceProviders”– Searchproviders:webgraphs(billionsofpagesindexed)

– Onlinesocialnetworks

□ Facebook:~109users(nodes),~1012 links

– ISPs:communicationsgraphs

□ Fromflowrecords:node=src ordst IP,edgeiftrafficflowsbetweenthem

◊ Graphserviceproviderperspective– Alreadyhaveallthedata,buthowtouseit?

– Wantageneralpurposesamplethatcan:

□ Quicklyprovideanswerstoexploratoryqueries

□ Compactlyarchivesnapshotsforretrospectivequeries&baselining

◊ Graphconsumerperspective– Wanttoobtainarealisticsubgraphdirectlyorviacrawling/API


Retrospective analysis of ISP graphs

◊ Node=IPaddress

◊ Directededge=flowfromsourcenodetodestinationnode

compromise

control

flooding

• Hardtodetectagainstbackground

• Knownattackscanbedetected:– Signaturematchingbasedonpartialgraphs,

flowfeatures,timing

• Unknownattacksarehardertospot:– exploratory&retrospective analysis

– preserveaccuracyifsampling?BOTNET


Goals for Graph Sampling

Crudelydivideintothreeclassesofgoal:

1. Studylocal(nodeoredge)properties– Averageageofusers(nodes),averagelengthofconversation(edges)

2. Estimateglobalpropertiesorparametersofthenetwork– Averagedegree,shortestpathdistribution

3. Samplea“representative”subgraph– Testnewalgorithmsandlearningmorequicklythanonfullgraph

◊ Challenges:whatpropertiesshouldthesamplepreserve?– Thenotionof“representative”isverysubjective

– Canlistpropertiesthatshouldbepreserved(e.g.degreedbn,pathlengthdbn),buttherearealwaysmore…


Models for Graph Sampling

Manypossiblemodels,butreducetotwoforsimplicity

(seetutorialbyHasan,Ahmed,Neville,Kompella inKDD13)

◊ Staticmodel:fullaccesstothegraphtodrawthesample– The(massive)graphisaccessibleinfulltomakethesmallsample

◊ Streamingmodel:edgesarriveinsomearbitraryorder– Mustmakesamplingdecisionsonthefly

◊ Othergraphmodelscapturedifferentaccessscenarios– Crawlingmodel:e.g.exploringthe(deep)web,APIgivesnodeneighbours

– Adjacencyliststreaming:seeallneighboursofanodetogether


Node and Edge Properties

◊ Grossover-generalization:nodeandedgepropertiescanbesolvedusingprevioustechniques– Samplenodes/edge(inastream)– Handleduplicates(sameedgemanytimes)viahash-basedsampling

– Trackpropertiesofsampledelements

□ E.g.countthedegreeofsamplednodes

◊ Somechallenges.E.g.howtosampleanodeproportionaltoitsdegree?– Ifdegreeisknown(precomputed),thenusetheseasweights– Else,sampleedgesuniformly,thensampleeachendwithprobability½


Induced subgraph sampling

◊ Node-inducedsubgraph– Pass1:Sampleasetofnodes(e.g.uniformly)

– Pass2:collectalledgesincidentonsamplednodes

– Cancollapseintoasinglestreamingpass

– Can’tknowinadvancehowmanyedgeswillbesampled

◊ Edge-inducedsubgraph– Sampleasetofedges(e.g.uniformlyinonepass)

– Resultinggraphtendstobesparse,disconnected

◊ Edge-inducedvariant[Ahmed Neville Kompella 13]:– Takesecondpasstofillinedgesonsamplednodes

– Hack:combinepassestofillinedgesoncurrentsample


HT Estimators for Graphs

◊ CanconstructHTestimatorsfromuniformvertexsamples[Frank 78]– Evaluatethedesiredfunctiononthesampledgraph(e.g.averagedegree)

◊ Forfunctionsofedges (e.g.numberofedgessatisfyingaproperty):– Scaleupaccordingly,byN(N-1)/(k(k-1))forsamplesizek ongraphsizeN

– VarianceofestimatescanalsobeboundedintermsofN andk

◊ Similarforfunctionsofthreeedges(triangles)andhigher:– ScaleupbyNC3/kC3≈1/p3 togetunbiasedestimator– Highvariance,soothersamplingschemeshavebeendeveloped


Graph Sampling Heuristics

“Heuristics”,sincefewformalstatisticalpropertiesareknown

◊ Breadthfirstsampling:sampleanode,thenitsneighbours…– Biasedtowardshigh-degreenodes(morechancestoreachthem)

◊ Snowballsampling:generalizeBFbypickingmanyinitialnodes– Respondent-drivensampling:weightthesnowballsampletogive

statisticallysoundestimates[Salganik Heckathorn 04]◊ Forest-firesampling:generalizeBFbypickingonlyafractionofneighbourstoexplore[Leskovec Kleinberg Faloutsos 05]– Withprobabilityp,movetoanewnodeand“kill”currentnode

No“onetruegraphsamplingmethod”– Experimentsshowdifferentpreferences,dependingongraphandmetric

[Leskovec,Faloutsos06;Hasan,Ahmed,Neville,Kompella13]

– Noneofthesemethodsare“streamingfriendly”:requirestaticgraph□ Hack:applythemtothestreamofedgesas-is


Random Walks Sampling

◊ Randomwalkshaveprovenveryeffectiveformanygraphcomputations– PageRank fornodeimportance,andmanyvariations

◊ Randomwalkanaturalmodelforsamplinganode– Perform“longenough”randomwalktopickanode

– Howlongis“longenough”(formixingofRW)?

– Canget“stuck”inasubgraph ifgraphnotwell-connected– Costlytoperformmultiplerandomwalks

– Highlynon-streamingfriendly,butsuitsgraphcrawling

◊ MultidimensionalRandomWalks[Ribeiro, Towsley 10]– Pickk randomnodestoinitializethesample

– Pickarandomedgefromtheunionofedgesincidentonthesample

– Canbeviewedasawalkonahigh-dimensionalextensionofthegraph– Outperformsrunningk independentrandomwalks


Subgraph estimation: counting triangles

◊ Hottopic:sample-basedtrianglecounting– Triangles:simplestnon-trivialrepresentationofnodeclustering

□ Regardasprototypeformorecomplexsubgraphsofinterest

– Measureof“clusteringcoefficient”ingraph,parameteringraphmodels…

◊ Uniformsamplingperformspoorly:– Chancethatrandomlysamplededgeshappentoformsubgraph is≈0

◊ Bias thesamplingsothatdesiredsubgraph ispreferentiallysampled


Subgraph Sampling in Streams

WanttosampleoneoftheT trianglesinagraph

◊ [Buriol et al 06]:sampleanedgeuniformly,thenpickanode– Scanfortheedgesthatcompletethetriangle

– ProbabilityofsamplingatriangleisT/(|E|(|V|-2))

◊ [Pavan et al 13]:sampleanedge,thensampleanincidentedge– Scanfortheedgethatcompletesthetriangle

– (Afterbiascorrection)probabilityofsamplingatriangleisT/(|E|D)□ D = maxdegree,considerablysmallerthan|V| inmostgraphs

◊ [Jha et.al. KDD 2013]: sampleedges,thesamplepairsofincidentedges– Scanforedgesthatcomplete“wedges”(edgepairsincidentonavertex)

◊ Advert:GraphSampleandHold[Ahmed, Duffield, Neville, Kompella, KDD 2014]– Generalframeworkforsubgraph counting;e.g.trianglecounting

– Similaraccuracytopreviousstateofart,butusingsmallerstorage


Graph Sampling Summary

◊ Samplingarepresentativegraphfromamassivegraphisdifficult!

◊ Currentstateoftheart:– Samplenodes/edgesuniformlyfromastream

– Heuristicsamplingfromstatic/streaminggraph

◊ Samplingenablessubgraph sampling/counting– Mucheffortdevotedtotriangles(smallestnon-trivialsubgraph)

◊ “Real”graphsarericher– Differentnodeandedgetypes,attributesonboth

– Justscratchingsurfaceofsamplingrealisticgraphs


Current Directions in Sampling


Outline









Role and Challenges for Sampling

◊ Matching

– Samplingmediatesbetweendatacharacteristicsandanalysisneeds

– Example:samplefrompower-lawdistributionofbytesperflow…□ butalsomakeaccurateestimatesfromsamples

□ simpleuniformsamplingmissesthelargeflows

◊ Balance

– Weightedsamplingacrosskey-functions:e.g.customers,networkpaths,geolocations

□ coversmallcustomers,notjustlarge

□ coverallnetworkelements,notjusthighlyutilized

◊ Consistency

– Sampleallviewsofsameevent,flow,customer,networkelement□ acrossdifferentdatasets,atdifferenttimes

□ independentsamplingÞ smallintersectionofviews


Sampling and Big Data Systems

◊ Samplingisstillausefultoolinclustercomputing– Reducethelatencyofexperimentalanalysisandalgorithmdesign

◊ SamplingasanoperatoriseasytoimplementinMapReduce– Foruniformorweightedsamplingoftuples

◊ Graphcomputationsareacoremotivatorofbigdata– PageRankasacanonicalbigcomputation

– Graph-specificsystemsemerging(Pregel,LFgraph,Graphlab,Giraph…)– But… samplingprimitivesnotyetprevalentinevolvinggraphsystems

◊ Whentodothesampling?– Option1:Sampleasaninitialstepinthecomputation

□ Foldsampleintotheinitial“Map”step

– Option2:Sampletocreateastoredsamplegraphbeforecomputation□ Allowsmorecomplexsampling,e.g.randomwalksampling


Sampling + KDD

◊ Theinterplaybetweensamplinganddataminingisnotwellunderstood– NeedanunderstandingofhowML/DMalgorithmsareaffectedbysampling

– E.g.howbigasampleisneededtobuildanaccurateclassifier?

– E.g.whatsamplingstrategyoptimizesclusterquality

◊ Expectresultstobemethodspecific– I.e.“IPPS+k-means”ratherthan“sample+cluster”


Sampling and Privacy

◊ Currentfocusonprivacy-preservingdatamining– Deliverpromiseofbigdatawithoutsacrificingprivacy?

– Opportunityforsamplingtobepartofthesolution

◊ Naïvesamplingprovides“privacyinexpectation”– Yourdataremainsprivateifyouaren’tincludedinthesample…

◊ Intuition:uncertaintyintroducedbysamplingcontributes toprivacy– Thisintuitioncanbeformalizedwithdifferentprivacymodels

◊ Samplingcanbeanalyzedinthecontextofdifferentialprivacy– Samplingalonedoesnotprovidedifferentialprivacy– ButapplyingaDPmethodtosampleddatadoesguaranteeprivacy

– Atradeoffbetweensamplingrateandprivacyparameters

□ Sometimes,lowersamplingrateimprovesoverallaccuracy


Advert: Now Hiring…

◊ NickDuffield,TexasA&M– Phds inbigdata,graphsampling

◊ GrahamCormode,UniversityofWarwickUK– Phds inbigdatasummarization

(graphsandmatrices,fundedbyMSR)– Postdocs inprivacyanddatamodeling

(fundedbyEC,AT&T)

GrahamCormode,[email protected]

NickDuffield,TexasA&[email protected]


x9x8x7x6x5x4x3x2x1

x10

x’9x’8

x’10

x’6x’5x’4x’3x’2x’1

Æ

0 1

00 01 10

000 001 010 011 100 101 110 111

11

x sampling for big data - nick duffield · sampling for big data reservoir sampling via order...

Documents