false positive or false negative: mining frequent itemsets from high speed transactional data...

22
False Positive or False Negativ False Positive or False Negativ e: e: Mining Frequent Itemsets from H Mining Frequent Itemsets from H igh Speed Transactional Data St igh Speed Transactional Data St reams reams Jeffrey Xu Yu , Zhihong Chong , Hongjun Lu , Jeffrey Xu Yu , Zhihong Chong , Hongjun Lu , Aoying Zhou Aoying Zhou VLDB 2004 VLDB 2004

Upload: dustin-cole

Post on 20-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

False Positive or False Negative: False Positive or False Negative: Mining Frequent Itemsets from High SMining Frequent Itemsets from High Speed Transactional Data Streamspeed Transactional Data Streams

Jeffrey Xu Yu , Zhihong Chong , Hongjun Lu , Aoying ZhouJeffrey Xu Yu , Zhihong Chong , Hongjun Lu , Aoying Zhou

VLDB 2004VLDB 2004

Page 2: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

IntroductionIntroduction

Mining data stream:Mining data stream:– Data items arrive Data items arrive

continuouslycontinuously– One scan of dataOne scan of data– Limited memoryLimited memory– Bounded errorBounded error

Page 3: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

IntroductionIntroduction

In this paper, develop algorithm of efIn this paper, develop algorithm of effectively mining frequent itemset witfectively mining frequent itemset with bound of memory consumptionh bound of memory consumption

Use false-negativeUse false-negative

Page 4: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

False PositiveFalse Positive

Most existing algorithm of mining freMost existing algorithm of mining frequent itemset are false-positive oriequent itemset are false-positive orientednted– Control memory consumption by error Control memory consumption by error

parameter parameter εε– Allow item’s support below min suppoAllow item’s support below min suppo

rt rt ss but above s – but above s –ε ε as frequentas frequent Approximate frequency counts over Approximate frequency counts over

data streams (VLDB 02)data streams (VLDB 02)

Page 5: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

False PositiveFalse Positive

Memory bound : O ( Memory bound : O ( .. log (log (εεN))N)) Dilemma of false-positive approachDilemma of false-positive approach

– εε smaller, less # of false-positive item inclu smaller, less # of false-positive item includedded

– Memory consumption increase reciprocally Memory consumption increase reciprocally in terms of in terms of εε

– In Apriori, k-th frequent itemset generate In Apriori, k-th frequent itemset generate (k+1)-th candidate itemset(k+1)-th candidate itemset

1lε

Page 6: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

False Positive & False False Positive & False NegativeNegative

s

S + ε

S - ε

False Positive

False Negative

All itemsets will output

Some will output

All itemsets will output

Some will output

Page 7: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

False NegativeFalse Negative

Error control and pruning Error control and pruning – εε : prune data, control error bound, : prune data, control error bound,

changeablechangeable ε decrease and approach to zero when # of ε decrease and approach to zero when # of

observation increaseobservation increase εε reciprocal of n reciprocal of n

– s : minimum supports : minimum support– n : # of observationn : # of observation

Page 8: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

False NegativeFalse Negative

Memory controlMemory control– δ δ : reliability, instead : reliability, instead ε ε control control

memory memory consumption consumption– Memory consumption related to Memory consumption related to

ln(1/ ln(1/ δδ)) In this approach not allow 1-In this approach not allow 1-

itemsets with support below s as itemsets with support below s as frequentfrequent

Page 9: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

Comparison:Comparison:False Positive & False NegativeFalse Positive & False Negative

Recall and PrecisionRecall and PrecisionAA : true frequent itemsets : true frequent itemsets BB : obtained frequent itemsets : obtained frequent itemsets

– Recall =Recall =

– Precision =Precision =

|A∩B|

|A||A∩B|

|B|

Page 10: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

Comparison:Comparison:False PositiveFalse Positive

εε= S/10= S/10 δδ=0.1=0.1

S(%S(%))

True True SizeSize

Mined Mined SizeSize

RecaRecallll

PrecisiPrecisionon

0.080.08 21,36121,361 126,307126,307 1.001.00 0.170.17

0.100.10 12,25212,252 68,27568,275 1.001.00 0.180.18

0.200.20 2,3592,359 23,15423,154 1.001.00 0.160.16

Page 11: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

Comparison:Comparison:False NegativeFalse Negative

s+ε: minimum supportS S

(%)(%)True True SizeSize

Mined Mined SizeSize

RecaRecallll

PrecisiPrecisionon

0.080.08 21,36121,361 18,35118,351 0.860.86 1.001.00

0.100.10 12,25212,252 10,41110,411 0.850.85 1.001.00

0.200.20 2,3592,359 1,7391,739 0.740.74 1.001.00

Page 12: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

Chernoff BoundChernoff Bound

Chernoff Bound give certain probabilistic Chernoff Bound give certain probabilistic guarantee on estimation of statistics abouguarantee on estimation of statistics about underlying datat underlying data

Pr{ T Pr{ T ≥ ≥ ее E[T]} E[T]} ≤ ≤ ее--E[T]E[T]

For example : Pick a lottery numberFor example : Pick a lottery number 0000,0001, …,9999. 0000,0001, …,9999. 1,000,000 people buy $1 ticket 1,000,000 people buy $1 ticket E[#winners] = 100E[#winners] = 100 Pr{TPr{T≧≧273} 273} ≦≦ e e-100-100

Page 13: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

Chernoff BoundChernoff Bound

Bernolli trails (coin flips):Bernolli trails (coin flips):– PrPr[[ooii=1]==1]=pp, , PrPr[[ooii=0]=1-=0]=1-pp– rr : # of heads in : # of heads in nn coin flips coin flips– npnp: expectation of : expectation of rr

for any for any γ> 0γ> 0

Page 14: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

Chernoff BoundChernoff Bound

Let Let rr as as rr//nn, min support , min support ss as as pp

Replace Replace ssγ with γ with εε

Right of equation beRight of equation be δ δ– Pr{|Pr{|RunningSupport – TrueSupportRunningSupport – TrueSupport||≥≥εεn n } } ≤≤δδ

Page 15: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

Frequent or InfrequentFrequent or Infrequent

A pattern X is potential A pattern X is potential infrequentinfrequent if if count(X) / n < s –εcount(X) / n < s –εnn in terms of n in terms of n A pattern X is potential A pattern X is potential frequentfrequent if it if it is is notnot potential potential infrequentinfrequent in terms of in terms of

n n

Page 16: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

FDPM-1(s, δ)FDPM-1(s, δ)

Page 17: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

ACD

FDPM-1(s, δ)FDPM-1(s, δ)

ItemItem AACountCount

Source

B1 12

C1

Memory is full

Compute new εn

Delete infrequent items

D

Page 18: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

FDPM-1(s, δ)FDPM-1(s, δ)

Algorithm ensure :Algorithm ensure :– item whose true frequency exceeds item whose true frequency exceeds sN sN arar

e output with probability of at least 1-e output with probability of at least 1-δδ– No item whose true frequency is less thaNo item whose true frequency is less tha

n n sNsN are output are output– Probability of the estimated support that Probability of the estimated support that

equal true support no less than 1-equal true support no less than 1-δδ

Page 19: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

Memory BoundMemory Bound

Sup(X) Sup(X) ≥≥ ( s – ε ( s – εnn) n) n |P| |P| ≤≤ 1/( s – ε 1/( s – εnn), when s – ε), when s – εnn>0>0

|P| = n|P| = n = = = =

nn = =

ns

s)/2ln(2

1

1

S – εn

s

)/2ln(22

Page 20: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

FDPM-2(s, δ)FDPM-2(s, δ)

Page 21: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

Mining Frequent Itemsets

Mining Frequent Itemsets Mining Frequent Itemsets from a Data Streamfrom a Data Stream

ItemItem

setset{A}{A} {B}{B} {AB{AB

}} ……

CounCountt ……

Source

……

n1

{A,B{A,B}}

………….... {E,F,{E,F,G}G}

ItemItem

SetSet{A}{A} {B}{B} {AB{AB

}}{E}{E} {F}{F} {EF{EF

}}

CounCountt 44 55 44 55 66 44

P

F{C} {D}

9 8 6 3 313 13 10

{E}

5

Memory is full, compute new εnDelete infrequent itemsets

{F} {EF}

6 4

Page 22: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying

ConclusionConclusion

False negativeFalse negative Limited memoryLimited memory Error bound with some probabilityError bound with some probability