data stream load shedding by sampling (cs2650)

14
Data Stream load shedding by Sampling (CS2650) Taecheol Oh

Upload: nelia

Post on 24-Jan-2016

55 views

Category:

Documents


0 download

DESCRIPTION

Data Stream load shedding by Sampling (CS2650). Taecheol Oh. Introduction. Many data stream sources are prone to dramatic spikes in volume An overloaded system will be unable to process all of its input data - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Stream load shedding by Sampling (CS2650)

Data Stream load sheddingby Sampling (CS2650)

Taecheol Oh

Page 2: Data Stream load shedding by Sampling (CS2650)

Introduction

Many data stream sources are prone to dramatic spikes in volume

An overloaded system will be unable to process all of its input data

So, discarding some fraction of the unprocessed data, becomes necessary in order for the system to continue to provide up-to-date query response

Page 3: Data Stream load shedding by Sampling (CS2650)

Sampling

Degrade gracefully by providing approximate answers during load spikes

With a basic statistics on the distribution of values, guarantee on the accuracy of queries for a given sampling rate

Page 4: Data Stream load shedding by Sampling (CS2650)

Semantic of sample

SAMPLE(R,f): produce a uniform random sample of R that contains a f fraction of the tuples in R

Sampling with Replacement (WR) Sample fn tuples, uniformly and independently

from R Specific tuples could be sampled multiple times

Sampling without Replacement (WoR) Sample fn distinct tuples from R

Independent Coin Flips (CF) For each tuple in R, choose it for the sample with

probability of f, independent of other tuples, B(n,f)

Page 5: Data Stream load shedding by Sampling (CS2650)

Density Preserving Sampling Suppose that we have N values x1, x2,

…, xN Partitioned into groups that have sizes

n1,n2,…,ng The expected sum of the weights of the

sampled points for each group is proportional to the group’s size

Page 6: Data Stream load shedding by Sampling (CS2650)

Experiments

STREAM ( Stanford stREam datA Manager ) A general purpose data stream management

system Traditional DBMS is for running one time queries

over finite stored data sets In applications, data takes the form of continuous

data streams rather than finite data sets In the STREAM project, consider data

management and query processing in the presence of multiple continuous, rapid, time-varying data streams

Page 7: Data Stream load shedding by Sampling (CS2650)

Abstract Semantics

The abstract semantics is based on two data types Steam and Relations

Stream: an unbounded bag of pairs <s,t> s: a tuple, t: time stamp, the logical arrival

time Relation: a bag of tuples at time t. an

instantaneous relationStreams Relations

Stream-to-Rlation

Relation-to-RelationRlation-to-Stream

Page 8: Data Stream load shedding by Sampling (CS2650)

Query Execution

When a continuous query is registered with the system, generate a query execution plan

Plans composed of three main components: OperatorsOperators QueuesQueues (input and inter-operator) State State (windows, history)

Global schedulerscheduler for plan execution

Page 9: Data Stream load shedding by Sampling (CS2650)

Simple Query Plan

Q1 Q2

State4⋈State3

Stream1 Stream2

Stream3

State1 State2⋈

SchedulerScheduler

Page 10: Data Stream load shedding by Sampling (CS2650)

Overview of Approach

Unweighted sampling vs Weighted sampling Unweighted sampling

Each element is sampled uniformly at random Algorithm

i 0While tuples are streaming by and M > 0 do

get tuple tigenerate random variable X from B(x,

1/n-i)M M – Xi i + 1

Page 11: Data Stream load shedding by Sampling (CS2650)

Overview of Approach

Weighted sampling Each element is sampled with a probability

proportional to its weight Algorithm

i 0, W Sum of weights, D 0While tuples are streaming by and M > 0 do

get tuple ti with weightgenerate random variable X from B(x,

weight/W-D)M M – XD D + weight of the tuplei i + 1

Page 12: Data Stream load shedding by Sampling (CS2650)

Overview of Approach

Weighted sampling considering the density

operator

queue

Page 13: Data Stream load shedding by Sampling (CS2650)

Overview of Approach

Weighted sampling considering the density

operator

Density measure

- - - - - - W, Z, Z, X, Y, X

Mapping function

Bit map / counter

queue

Weighted samplingcontroller

+1

Page 14: Data Stream load shedding by Sampling (CS2650)

Thanks