fast and memory efficient mining of frequent closed itemsets claudio lucchese salvatore orlando...

39
Fast and Memory Effi cient Mining of Freq uent Closed Itemsets Claudio Lucches e Salvatore Orlan do DB group seminar Presenter: Leonidas

Upload: oswald-porter

Post on 20-Jan-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Fast and Memory Efficient Mining of Frequent Closed Itemsets

Claudio LuccheseSalvatore Orlando

Raffaele PeregoDB group seminarPresenter: Leonidas

Page 2: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Abstract• Frequent Itemsets Mining• Closed Itemsets• Mining Frequent Closed Itemsets• Handling duplicates• Brief introduction of the algorithm• Experimental results

Page 3: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Frequent Itemsets Mining• A set of items I, set of transactions D• Discover all the itemsets from I with support > min_sup

p• Support of a k-itemset I supp(I) : number of transactions in D i

ncludes I• I is a set of items from I• Transaction t in D is a set of items from I

• Well known algorithm: Apriori• Discover frequent itemsets

Page 4: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Weaknesses & Solutions• Number of frequent itemsets grows up quickly as min_s

upp decreases• Complexity of mining task increases rapidly

• Huge size of output• Complex for analysis

• Closed itemsets are one of the solutions• Unique maximal elements of the equivalence classes defined

over the lattice of all the frequent itemsets

Page 5: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Weaknesses & Solutions• Equivalence class

• Distinct group of frequent itemsets• Supported by same set of transactions

• Represent same knowledge• Vertical bitwise representation of data set

• Association Rules extracted are more meaningful [ZAKI04]• Redundancies are removed

• Suitable for dense data set• Frequent closed itemsets are much fewer than frequent items

ets

Page 6: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Closed Itemsets• I is subsets of items appearing in D• T is subset of transactions in D• Define two functions:

• Itemset I is closed iff

• Function is called Galois operator / closure operator

},|{)(

},|{)(

tiIitIg

tiTtiTf

D

Ι

)())(()( IgfIgfIc

gfc

TID Items

1 B D

2 A B C D

3 A C D

4 C

I

Page 7: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Equivalence classes• Two itemsets belong to same equivalence class iff

• They have same closure• Supported by same set of transactions

• An itemset I is closed iff• No supersets of I have the same support

A 2 B 2 C 3 D 3

AC 2 AD 2 BC 1 BD 2AB 1 CD 2

ABC 1 ABD 1 ACD 2 BCD 1

ABCD 1ABCD 1

ACD 2

BD 2

C 3 D 3

D 2

Frequent ClosedItemset

A 2

Frequent Itemset

Support

EquivalenceClass 4 4

TID Items

1 B D

2 A B C D

3 A C D

4 C

Page 8: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Mining Frequent Closed Itemsets• Search Space Browsing

• Traverse the lattice of frequent itemsets from one equivalence class to another

• Closure computation• Compute the closure of frequent itemsets• Determine the closed itemsets

Closure generator:• A single representative of an equivalence class• Can mine all the closed itemsets by computing the

closure of the generator for each class

Page 9: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Browsing the Search Space• Choose the key patterns (minimal elements) as gener

ators• Traverse the lattice formed by key patterns with Aprior

i-like algorithm[TAOU00]• Unfortunately, same closed itemset can be led from m

ore then one key patterns

A 2 B 2 C 3 D 3

AC 2 AD 2 BC 1 BD 2AB 1 CD 2

ABC 1 ABD 1 ACD 2 BCD 1

ABCD 1ABCD 1

ACD 2

BD 2

C 3 D 3

4 4

Page 10: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Browsing the Search Space• Closure climbing• New generators are built as the supersets of the closed

itemset discovered so far• Jump from an equivalence class to another• Cannot ensure the equivalence class is not visited yet

A 2 B 2 C 3 D 3

AC 2 AD 2 BC 1 BD 2AB 1 CD 2

ABC 1 ABD 1 ACD 2 BCD 1

ABCD 1ABCD 1

ACD 2

BD 2

C 3 D 3

4 4

Page 11: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Problem of duplicate• Need duplicate checking to avoid generating the same closed ite

mset• To avoid useless expensive closure operation, use following lem

ma:

• However, it is still expensive in time and space• All the mined closed itemsets need to be kept in main memory

• Several algorithms are forced to adopt a strict lexicographic visiting order of the search space to ensure correct duplicate avoidance

• CHARM[PEI00], CLOSET[PEI03], CLOSET+[ZAKI02]

c(Y)X)), then c(g(Y)g(X)(i.e.,

supp(Y)(X)Y and supp X and Y, ifitemsets XGiven two

Page 12: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Computing Closures• Besides Galois operator, make use of the lemma:

• Perform inclusion check for all items in I• The chcek is benefited from using vertical representation

of list of tidlist• Calculation can be either offline or online• Offline: compute closures for the entire set of generators

• Use key patterns, generators are shorter• Online: compute closure for a discovered generator

• Use closure climbing, generators are longer• Fewer checks for longer generators, more efficient

c(X)ig(i)I, g(X)m iand an itetemsets X Given an i

Item A B C D

T1 0 1 0 1

T2 1 1 1 1

T3 1 0 1 1

T4 0 0 1 0

tidlist

Page 13: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Handling duplicates• To identify the unique generator for each

equivalence class• Define order-preserving property of generator

• Check whether a given generator is order-preserving or not

• Compute the closure of order-preserving generators only

• Prune other generators

Page 14: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Handling duplicates• Order-preserving property of generators:

• It means that if items need to be added to an order-preserving generator to compute the closure, they need to follow the item i

• The introduction of order-preserving generator is used to avoid duplicate generation of closed itemset

X)\(c(X)iXc(X)

preserving-order

or either

iff to beY, is saidmset and iclosed ite

is a i, where YYorm Xr of the fA generato

Page 15: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Example• {A}= Ø∪{A} is order-preserving generator

• {C,D}={C}∪{D} is not order-preserving•

A 2 B 2 C 3 D 3

AC 2 AD 2 BC 1 BD 2AB 1 CD 2

ABC 1 ABD 1 ACD 2 BCD 1

ABCD 1ABCD 1

ACD 2

BD 2

C 3 D 3

4 4

Item A B C D

T1 0 1 0 1

T2 1 1 1 1

T3 1 0 1 1

T4 0 0 1 0

},{\)( DCAAcA

}{},{\}),({ ADCDCcD

Page 16: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Handling duplicates• We need to check whether a generator is order-preser

ving or not• Define a set called pre-set(gen) of a generator

• We can now check whether a generator is order-preserving by checking:

• If yes, then gen is not order-preserving

} and ,I,|{)set(-pre ijgenjjjgen iYgen

)()(such that )(set-pre jggenggenj

Page 17: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Handling duplicates• The goal is to compute the closure of order-preserving

generators only• For any closed itemset , there exists a sequence of or

der-preserving generators• Using closure climbing to climb a sequence of closed it

emsets and reach• For each closed itemset ,the sequence of order-prese

rving generators is unique

Y

Y

Y

Page 18: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

4

Handling duplicates• Example : },,,{ DCBAY

)(0 cY }{0 Agen

},,{)( 01 DCAgencY }{},,{1 BDCAgen

},,,{)( 1 DCBAgencY YYY 10 :Note

A 2

AC 2

ABCD 1

ACD 2

Generator = }{A

Generator = }{},,{ BDCA

Page 19: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

The DCI_CLOSED Algorithm• Two different types of data sets

• Dense & Sparse

• Dense data set• Transactions are long• Contain strongly correlated items

• Number of closed itemsets may be nearly equal to number of frequent itemsets in sparse data sets• Mining closed itemsets becomes more expensive

• Separated into two parts• DCI_CLOSEDs() & DCI_CLOSEDd()

Page 20: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

The DCI_CLOSED Algorithm• Discriminate between sparse and dense data sets:• Scan data set to find out frequent single items F1⊆ I• Build bitwise vertical data set VD• Items are increasingly sorted w.r.t. frequencies

• Decide whether a data set is sparse or dense• If percentage of 1s is large• If a large set of items is strongly correlated• Compute the percentage of the most frequent items that

co-occur in the same transaction

A 1010111B 0101101E 0101100

Page 21: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

The DCI_CLOSED Algorithm• 3 input parameters:

• CLOSED_SET=c(Ø), PRE_SET=Ø, POST_SET=F1\c(Ø)

• Get an item i from POST_SET (minimum in order)• Add i to CLOSED_SET to build new_gen (

closure climbing)

• Check validity of generator new_gen with PRE_SET

• Compute closure of new_gen using lemma 2 for CLOSED_SET• New closed set generated from new_gen

Page 22: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

The DCI_CLOSED Algorithm• Use PRE_SET to check validity of new_gen

• Guarantee duplicate generators will be correctly pruned out

• POST_SET is used to guarantee generators are produced according to Theorem 1

• POST_SET contains items j follow i in lexicographic order & not included in CLOSED_SET yet

} |_{POST_SET XjandjiSETPOSTjnew

Page 23: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Running example of DCI_CLOSEDd()

• CLOSED_SET = c(Ø)=Ø, PRE_SET=Ø, POST_SET={A,B,C,D}

• Compute closure of generator gen= Ø∪ {A}={A}• Check with PRE_SET order-preserving

• Check if g(A)⊂g(j), ∀j∈POST_SET• If yes, include j into CLOSED_SET

4

A 2

AC 2

ACD 2

Generator = }{A

A B C D

T1

0 1 0 1

T2

1 1 1 1

T3

1 0 1 1

T4

0 0 1 0

Generator =

Page 24: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Running example of DCI_CLOSEDd()

• CLOSED_SET={A,C,D}, PRE_SET=Ø, POST_SET={B}

• New generator gen= {A,C,D}∪ {B}={A,B,C,D}• Check with PRE_SET order-preserving

• gen is closed since POST_SET is empty

• Note: {A,C,D} {A,B,C,D}, need not to be in order

4

A 2

AC 2

ACD 2

ABCD 1

Generator = }{},,{ BDCA

A B C D

T1

0 1 0 1

T2

1 1 1 1

T3

1 0 1 1

T4

0 0 1 0

Page 25: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Running example of DCI_CLOSEDd()

• gen=Ø∪ {B}, PRE_SET={A}, POST_SET={C,D}

• gen is order-preserving by checking with g(A)

• Check g(B) with g(C) and g(D) get c(B)={B,D}

• {B,D} is closed by checking with POST_SET

4

A 2

AC 2

ACD 2

ABCD 1

B 2

BD 2

A B C D

T1

0 1 0 1

T2

1 1 1 1

T3

1 0 1 1

T4

0 0 1 0

Generator = }{B

Page 26: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Running example of DCI_CLOSEDd()

• CLOSED_SET={B,D}, PRE_SET={A}, POST_SET={C}

• gen now is {B,D}∪ {C} = {B,C,D}

• Check g({B,C,D}) with g(A), g({B,C,D})⊂g(A)• gen is not order-preserving and can be pruned with all

its possible extensions

4

A 2

AC 2

ACD 2

ABCD 1

Generator = }{},{ CDB

B 2

BD 2

BCD 1

A B C D

T1

0 1 0 1

T2

1 1 1 1

T3

1 0 1 1

T4

0 0 1 0

Page 27: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Running example of DCI_CLOSEDd()

• gen=Ø∪ {C}, PRE_SET={A,B}, POST_SET={D}

• gen is order-preserving by checking with g(A), g(B)

• gen cannot not be extended by checking with POST_SET, so it is closed

4

A 2

AC 2

ACD 2

ABCD 1A B C D

T1

0 1 0 1

T2

1 1 1 1

T3

1 0 1 1

T4

0 0 1 0

B 2

BD 2

BCD 1

C 3

Generator = }{C

Page 28: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Running example of DCI_CLOSEDd()

• CLOSED_SET={C}, PRE_SET={A,B}, POST_SET={D}

• gen now is {C}∪ {D} = {C,D}

• Check g({C,D}) with g(A), g({C,D})⊂g(A)• gen is not order-preserving and can be pruned with

considering its possible extensions

4

A 2

AC 2

ACD 2

ABCD 1A B C D

T1

0 1 0 1

T2

1 1 1 1

T3

1 0 1 1

T4

0 0 1 0

B 2

BD 2

BCD 1

CD 2

C 3

Generator = }{}{ DC

Page 29: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Running example of DCI_CLOSEDd()

• gen=Ø∪ {D}, PRE_SET={A,B,C}, POST_SET= Ø

• gen is order-preserving by checking with g(A), g(B), g(C)

• gen cannot not be extended by checking with POST_SET, so it is closed

4

A 2

AC 2

ACD 2

ABCD 1A B C D

T1

0 1 0 1

T2

1 1 1 1

T3

1 0 1 1

T4

0 0 1 0

B 2

BD 2

BCD 1

CD 2

C 3 D 3

Generator = }{B

Page 30: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Optimizations• Vertical data set (frequent single items) is represented

by a bitmap matrix VD MxN

• VD(i,j) =1 when item i of transaction j is frequent• Row i of the matrix represents g(i), the tidlist• Optimize the bitwise AND operations for

• tidlist intersections• Inclusion checks

• 3 optimization techniques

Page 31: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Optimizations• Data Set Projection (projection)

• For closed itemsets Z discovered by closed set X• g(Z) is supported by subsets of g(X)• Delete all columns from VD corresponding transactions not oc

curring in g(X)• This process is limited to generators of 1st level of recursion si

nce it is expensive

Page 32: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Optimizations• Data Sets with Highly Correlated Items (section eq)

• Columns of VD are reordered to profit of data correlation• Maximize the submatrix VE of VD having all rows and column

s are identical• VE is likely to be large and includes most frequent items• Many frequent itemsets can be mined within VE

T1 T2 T3 T4

A 0 1 0 1

B 1 1 1 1

C 1 1 0 1

D 0 1 0 1

T2 T4 T1 T3

A 1 1 0 0

B 1 1 1 1

C 1 1 1 0

D 1 1 0 0

Page 33: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Optimizations• Reusing Results of Previous Bitwise Intersections (incl

uded)• To check whether an itemset X is closed, compare X with its P

RE_SET• For X is closed, g(X)⊆g(j) for all j• Large part of g(X) may be included in g(j)• Let gh(X)⊆gh(j), so gh(X∪Y)⊆gh(j)• We can limit the check of various g(j) to the complementary p

art of gh(j)

g(j)

h

g(X Y)∪

check

g(X)

Page 34: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Optimizations• Actual number of bitwise AND operations vs. support t

hreshold• Optimizations “section eq” & “included” are most

effective

Page 35: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Performance Analysis• Competitors: FP-CLOSE[GRAH03], CLOSET+[PEI03]

• Environment: Windows XP, Pentium IV 2.8GHz, 512MB

• Spare & Dense data sets

Dataset Items Avg. Trans. Size

Transactions

T40I10D100K

1000 40 100000

Retail 16471 13 88162

Chess 76 37 3196

Pumsb 7117 74 49046

Page 36: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Performance Analysis• Data set: T40I10D100K, Retail

• DCI_CLOSED is faster in one order of magnitude

Page 37: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Performance Analysis• Data set: , CHESS, PUMSB

Page 38: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Performance Analysis• Time efficiency of duplicate checking

• Speedup up to six when support thresholds are small

chesschess

Page 39: Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

References• [GRAH03] G. Grahne and J. Zhu, “Efficiently Using Prefix-Trees in

Mining Frequent Itemsets,” Proc. ICDM Workshop Frequent Itemset Mining Implementations, Dec. 2003.

• [PEI00] J. Pei, J. Han, and R. Mao, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets,” Proc. ACM SIGMOD Int’l Workshop Data Mining and Knowledge Discovery, May 2000.

• [PEI03] J. Pei, J. Han, and J. Wang, “CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets,” Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, Aug. 2003.

• [TAOU00] R. Taouil, N. Pasquier, Y. Bastide, L. Lajhal, and G. Stumme, “Mining Frequent Patterns with Counting Inference,” SIGKDD Explorations, vol. 2, no. 2, Dec. 2000.

• [ZAKI02] M.J. Zaki and C.-J. Hsiao, “Charm: An Efficient Algorithm for Closed Itemsets Mining,” Proc. Second SIAM Int’l Conf. Data Mining, Apr. 2002.

• [ZAKI04] M.J. Zaki, “Mining Non-Redundant Association Rules,” Data Mining and Knowledge Discovery, vol. 9, no.3, pp. 223-248, 2004.