tracking most frequent items dynamically. article by g.cormode and s.muthukrishnan. presented by...

28
Tracking most frequent items dynamically. Article by G.Cormode and S.Muthukrishnan. Presented by Simon Kamenkovich.

Post on 18-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Tracking most frequent items dynamically.

Article by G.Cormode and S.Muthukrishnan.

Presented by Simon Kamenkovich.

Motivation

Most DB management systems maintains “hot items” statistics.

Hot items are used as simple outliers in data mining, anomaly detection in network applications and financial market data.

The proposed algorithm handles deletions and insertions and with better qualities then other existing methods

Introduction

Given: n integers in the range [1… m]

i

1

( ) - net occurance of any item i at time t,

n ( )( ) - current frequency of item i,

( )

1Item is hot if ( )

1

i

i m

jj

i

n t

tf t

n t

i f tk

So there may be up to k hot items and there may be none.

PreliminariesIf we’re allowed O(m) space, then simple heap will process

each insert and delete in O(log m) and find all the hot items in O(k logk).

Lemma 1: Any algorithm which guarantees to find all and only items

which have frequency greater then 1/(k+1) must store at Ω(m) bits.

Consider {1... }.

For insert n/k copies of ,

if : n/k 1/( n/k ) ( / ) /( / ) 1/( 1)

if : n/k /( n/k ) n/k / ( 1) /

n/k /( 1) n/k 1/( 1)

So we can

i

i

S m

i i

i S f n n k n n k k

i S f n n k k

k k

Proof :

extract set S.

Small Tail Property

1 2Say ... are the frequencies of items in the non

decreasing order. The set of frequencies have a

1if ( ) .1If there are k hot items then small tail property hold but the

oppo

m

ii k

f f f

f k

small tail

site is not always true.

Prior works

Algorithm Type Time Per Item SpaceMisra-Gries Deterministic O(log k)

amortizedO(k)

Frequent Randomized O(1) expected O(k)

Lossy Counting Deterministic O(log(n/k)) Ω(klog(n/k))

Charikar et al. Randomized Ω(k/ε²logn) Ω(k/ε²logn)

Gilbert et al.

(quantiles)

Randomized Ω(k²log²n) Ω(k²log²n)

Base method A returns the sum of the counts

of items in the range ( 2 1)...( 1)2 for 0 log ,0 / 2j j j

dyadic range sum oracle

i i j m i m

Definition :

Theorem 1: Calling DivideAndConquer(1,m,n/(k+1)) will output all and only hot items. A total of O(k log m/k) calls will be made to the oracle.

DivideAndConquer(l,r,thresh)if oracle(l,r) > thresh if (l=r) then output(l) else DivideAndConquer(l,r-l/2,thresh) DivideAndConquer(r-l/2+1,r,thresh)

Oracle design

N.Alon et al. - requires O(km) space Gilbert (Random Subset Sums) requires

O(k²log m log k/δ) space Charikar et al. requires O(klogm log k/δ) Cormode requires O(k logm log k/δ)

Group Testing

Idea: To design number of tests, each of which groups together a number of m items in order to find up to k items which test positive.

Majority item can be found for insertions only case in O(1) time and space by the algorithm of Boyer and Moore.

General procedure: for each transaction on item i we determine each subset it’s included in ,S(i). And increment or decrement the counters associated with the subsets.

Deterministic algorithmEach test includes half of the range [1..m], corres- ponding to binary representations of values

int c[0…log m]UpdateCounters(i, transtype, c[0…log m])c[0]=c[0] + difffor j=1 to log m do If (transtype = ins) c[j] = c[j] + bit(j,i) Else c[j] = c[j] - bit(j,i)

Deterministic algorithm(cont)

Theorem 2: The above algorithm finds a majority item if there is one with time O(log m) per operation.

FindMajority(c[0 ... log m])Position = 0, t =1for j=1 to log m do if (c[j] > c[0]/2) then position = position + t t = 2* treturn position

Randomized Algorithm: Let [1... ]denote the set of hot items.Set [1... ] is a

if | | 1

F m S m

good set S F

Definiton

1 2

: Picking ( log ) subsets by drawing m items inifromly at random from k the items [1... ] means that with constant probability we have

included k subsets , ... k

O k k

m

S S S

Theorem3

such that each is a

and ( )

i

ii

S good subset

F S F

1

: Probability of picking exactly one member of F after drawing

an item times is (1 ) (1 ) ,

1 2 In non trivial case 1 , so .

2 2 As for

m m

k km k k m km pk k m m m k mm

k pe

Proof

Coupon Collector Problem picking ( ln ) guarantees

that we have a set that is good for each hot item.

O k k

Coupon Collector Problem

X – number of trials required to collect at least one of each type of coupon Epoch i begins with after i-th success and ends with (i+1)-th success Xi – number of trials in the i-th epoch Xi distributed geometrically and pi = p(k-i)/k p is probability that coupon is good

1

0

1 1

0 0 1

,

1[ ] [ ] ln ( ) ( log )

( )

k

ii

k k kk

pi ni i i

X X

k k kE X E X H k O k O k k

p k i p i p

Using hash functions

, ,

1 Over all choices of a and b for ,Pr( ( ) ( ))

2a b a bx y h x h yk

Fact :

We don’t store sets explicitly – O(mlogk), instead we choose the set in pseudo-random fashion using hash functions: Fix a prime P > 2k, Take a and b uniformly from [0…P-1]

,

, , ,

( ) (( ) mod ) mod 2

{ | ( ) }a b

a b i a b

h x ax b P k

S x h x i

Choose T= log k/δ pairs of a,b ; each pair will define 2k sets.

Data Structure

log(k/δ) groups

2k subsets

log m counters

ProccessItem

Initialize c[0 … 2Tk][0 … log m]Draw a[1 … T], b[1 … T], c=0ProccessItem(i,transtype,T,k)for all (i,transtype) do if (trans = ins) c = c +1 else c = c – 1 for x = 1 to T do index = 2k(x-1) + (i*a[x]+b[x] mod P) mod 2k //they had 2(x-1)

UpdateCounters(i,transtype,c[index])

Space used by the algorithm is O(k log(k/ δ) log m).

Lemma

Lemma 2: The probability of each hot item being in at least one good set is at least 1- δ

Proof: For each T repetitions we put a hot item in one of 2k buckets.

i

i j

f 1 1The expected total frequency of other items: m=( )

2k 2 1 2( 1)

k

k k k

If m < 1/(k+1) then there is a majority and we can find it

If m > 1/(k+1) then we won’t be able to find a majority

Probability of failure < ½ by Markov inequality.

Probability to fail on each T is at most δ/k.

Probability of any hot items failing at most δ.

GroupTest

GroupTest(T,k,b)for i=1 to 2Tk do // they had T here

if c[i][0] > cb position = 0; t =1 for j = 1 to log m do if c[i][j] > cb and c[i][0] – c[i][j] > cb or c[i][j] < cb and c[i][0] – c[i][j] < cb then Skip to next i if c[i][j] > cb position = position + t t = 2 * t

return position

a,b,i

a,b,i

Given a subset S and its associated set of counters

it is possible to detect whether S is a good set using

small tail property

Lemma 3 :

Algorithm properties

Theorem 4: With proability at least 1- δ, calling the GroupTest(log k/δ,k,1/(k+1)) procedure finds all the hot items using O(k log k/δ logm) space. The time for an update is O(logk/δ log m) and the time to list all hot items is O(k log k/δ log m)

Corollary 1: If we have the small tail property, then we will output no items which are not hot.

Proof: We will look for hot items only if it exists and only it will be output.

Algorithm properties (cont)

Lemma 4: The output of the algorithm is the same for any reordering of the input data.

Corollary 2: The set of counters created with T= log k/ δ can be used to find hot items with parameter k’ for any k’<k with probability of success 1 – δ by calling GroupTest(logk/δ,k,1/(k’+1)

Proof: According to Lemma2, since

1 ' 1

2 ' 1 2( ' 1)

k

k k k

Experiments

Definitions: The recall is the proportion of the hot items that are found by the method to the total number of hot items.

The precision is the proportion of items identified by the algorithm, which are hot, to number of all output items.

GroupTesting algorithm was compared to Loosy Counting and Frequent algorithms.

The authors implemented them so that when an item is deleted we decrement the corresponding counter if such exist.

Synthetic data (Recall)

Zipf for hot items: 0 – distributed uniformly , 3 – highly skewed

Synthetic data (Precision)

GroupTesting required more memory and took longer to process each item.

Real Data (Recall)

Real data was obtained from one of AT&T network for part of a day.

Real Data (Precision)

Real data has guarantee of having small tail property….

Varying frequency at query time

Other algorithms are designed around fixed frequency threshold supplied in advance.

The data structure was build for queries at the 0.5% level, but was

then tested with queries ranged from 10% to 0.02%

Conclusions and extensions

New method which can cope with dynamic dataset is proposed.

It’s interesting to try to use the algorithm to compare the differences in frequencies between different datasets.

Can we find combinatorial design that achieve the same properties but in deterministic construction for maintaining hot items?

FIN