fast and memory efficient mining of frequent closed itemsets claudio lucchese salvatore orlando...

Fast and Memory Efficient Mining of Frequent Closed Itemsets

Claudio LuccheseSalvatore Orlando

Raffaele PeregoDB group seminarPresenter: Leonidas

Abstract• Frequent Itemsets Mining• Closed Itemsets• Mining Frequent Closed Itemsets• Handling duplicates• Brief introduction of the algorithm• Experimental results

Frequent Itemsets Mining• A set of items I, set of transactions D• Discover all the itemsets from I with support > min_sup

p• Support of a k-itemset I supp(I) : number of transactions in D i

ncludes I• I is a set of items from I• Transaction t in D is a set of items from I

• Well known algorithm: Apriori• Discover frequent itemsets

Weaknesses & Solutions• Number of frequent itemsets grows up quickly as min_s

upp decreases• Complexity of mining task increases rapidly

• Huge size of output• Complex for analysis

• Closed itemsets are one of the solutions• Unique maximal elements of the equivalence classes defined

over the lattice of all the frequent itemsets

Weaknesses & Solutions• Equivalence class

• Distinct group of frequent itemsets• Supported by same set of transactions

• Represent same knowledge• Vertical bitwise representation of data set

• Association Rules extracted are more meaningful [ZAKI04]• Redundancies are removed

• Suitable for dense data set• Frequent closed itemsets are much fewer than frequent items

ets

Closed Itemsets• I is subsets of items appearing in D• T is subset of transactions in D• Define two functions:

• Itemset I is closed iff

• Function is called Galois operator / closure operator

},|{)(

},|{)(

tiIitIg

tiTtiTf

D

Ι

)())(()( IgfIgfIc

gfc

TID Items

1 B D

2 A B C D

3 A C D

4 C

I

Equivalence classes• Two itemsets belong to same equivalence class iff

• They have same closure• Supported by same set of transactions

• An itemset I is closed iff• No supersets of I have the same support

A 2 B 2 C 3 D 3

AC 2 AD 2 BC 1 BD 2AB 1 CD 2

ABC 1 ABD 1 ACD 2 BCD 1

ABCD 1ABCD 1

ACD 2

BD 2

C 3 D 3

D 2

Frequent ClosedItemset

A 2

Frequent Itemset

Support

EquivalenceClass 4 4

TID Items

1 B D

2 A B C D

3 A C D

4 C

Mining Frequent Closed Itemsets• Search Space Browsing

• Traverse the lattice of frequent itemsets from one equivalence class to another

• Closure computation• Compute the closure of frequent itemsets• Determine the closed itemsets

Closure generator:• A single representative of an equivalence class• Can mine all the closed itemsets by computing the

closure of the generator for each class

Browsing the Search Space• Choose the key patterns (minimal elements) as gener

ators• Traverse the lattice formed by key patterns with Aprior

i-like algorithm[TAOU00]• Unfortunately, same closed itemset can be led from m

ore then one key patterns

A 2 B 2 C 3 D 3



ABCD 1ABCD 1

ACD 2

BD 2

C 3 D 3

4 4

Browsing the Search Space• Closure climbing• New generators are built as the supersets of the closed

itemset discovered so far• Jump from an equivalence class to another• Cannot ensure the equivalence class is not visited yet

A 2 B 2 C 3 D 3



ABCD 1ABCD 1

ACD 2

BD 2

C 3 D 3

4 4

Problem of duplicate• Need duplicate checking to avoid generating the same closed ite

mset• To avoid useless expensive closure operation, use following lem

ma:

• However, it is still expensive in time and space• All the mined closed itemsets need to be kept in main memory

• Several algorithms are forced to adopt a strict lexicographic visiting order of the search space to ensure correct duplicate avoidance

• CHARM[PEI00], CLOSET[PEI03], CLOSET+[ZAKI02]

c(Y)X)), then c(g(Y)g(X)(i.e.,

supp(Y)(X)Y and supp X and Y, ifitemsets XGiven two

Computing Closures• Besides Galois operator, make use of the lemma:

• Perform inclusion check for all items in I• The chcek is benefited from using vertical representation

of list of tidlist• Calculation can be either offline or online• Offline: compute closures for the entire set of generators

• Use key patterns, generators are shorter• Online: compute closure for a discovered generator

• Use closure climbing, generators are longer• Fewer checks for longer generators, more efficient

c(X)ig(i)I, g(X)m iand an itetemsets X Given an i

Item A B C D

T1 0 1 0 1

T2 1 1 1 1

T3 1 0 1 1

T4 0 0 1 0

tidlist

Handling duplicates• To identify the unique generator for each

equivalence class• Define order-preserving property of generator

• Check whether a given generator is order-preserving or not

• Compute the closure of order-preserving generators only

• Prune other generators

Handling duplicates• Order-preserving property of generators:

• It means that if items need to be added to an order-preserving generator to compute the closure, they need to follow the item i

• The introduction of order-preserving generator is used to avoid duplicate generation of closed itemset

X)\(c(X)iXc(X)

preserving-order

or either

iff to beY, is saidmset and iclosed ite

is a i, where YYorm Xr of the fA generato

Example• {A}= Ø∪{A} is order-preserving generator

•

• {C,D}={C}∪{D} is not order-preserving•

A 2 B 2 C 3 D 3



ABCD 1ABCD 1

ACD 2

BD 2

C 3 D 3

4 4

Item A B C D

T1 0 1 0 1

T2 1 1 1 1

T3 1 0 1 1

T4 0 0 1 0

},{\)( DCAAcA

}{},{\}),({ ADCDCcD

Handling duplicates• We need to check whether a generator is order-preser

ving or not• Define a set called pre-set(gen) of a generator

• We can now check whether a generator is order-preserving by checking:

• If yes, then gen is not order-preserving

} and ,I,|{)set(-pre ijgenjjjgen iYgen

)()(such that )(set-pre jggenggenj

Handling duplicates• The goal is to compute the closure of order-preserving

generators only• For any closed itemset , there exists a sequence of or

der-preserving generators• Using closure climbing to climb a sequence of closed it

emsets and reach• For each closed itemset ,the sequence of order-prese

rving generators is unique

Y

Y

Y

4

Handling duplicates• Example : },,,{ DCBAY

)(0 cY }{0 Agen

},,{)( 01 DCAgencY }{},,{1 BDCAgen

},,,{)( 1 DCBAgencY YYY 10 :Note

A 2

AC 2

ABCD 1

ACD 2

Generator = }{A

Generator = }{},,{ BDCA

The DCI_CLOSED Algorithm• Two different types of data sets

• Dense & Sparse

• Dense data set• Transactions are long• Contain strongly correlated items

• Number of closed itemsets may be nearly equal to number of frequent itemsets in sparse data sets• Mining closed itemsets becomes more expensive

• Separated into two parts• DCI_CLOSEDs() & DCI_CLOSEDd()

The DCI_CLOSED Algorithm• Discriminate between sparse and dense data sets:• Scan data set to find out frequent single items F1⊆ I• Build bitwise vertical data set VD• Items are increasingly sorted w.r.t. frequencies

• Decide whether a data set is sparse or dense• If percentage of 1s is large• If a large set of items is strongly correlated• Compute the percentage of the most frequent items that

co-occur in the same transaction

A 1010111B 0101101E 0101100

…

The DCI_CLOSED Algorithm• 3 input parameters:

• CLOSED_SET=c(Ø), PRE_SET=Ø, POST_SET=F1\c(Ø)

• Get an item i from POST_SET (minimum in order)• Add i to CLOSED_SET to build new_gen (

closure climbing)

• Check validity of generator new_gen with PRE_SET

• Compute closure of new_gen using lemma 2 for CLOSED_SET• New closed set generated from new_gen

The DCI_CLOSED Algorithm• Use PRE_SET to check validity of new_gen

• Guarantee duplicate generators will be correctly pruned out

• POST_SET is used to guarantee generators are produced according to Theorem 1

• POST_SET contains items j follow i in lexicographic order & not included in CLOSED_SET yet

} |_{POST_SET XjandjiSETPOSTjnew

Running example of DCI_CLOSEDd()

• CLOSED_SET = c(Ø)=Ø, PRE_SET=Ø, POST_SET={A,B,C,D}

• Compute closure of generator gen= Ø∪ {A}={A}• Check with PRE_SET order-preserving

• Check if g(A)⊂g(j), ∀j∈POST_SET• If yes, include j into CLOSED_SET

4

A 2

AC 2

ACD 2

Generator = }{A

A B C D

T1

0 1 0 1

T2

1 1 1 1

T3

1 0 1 1

T4

0 0 1 0

Generator =


• CLOSED_SET={A,C,D}, PRE_SET=Ø, POST_SET={B}

• New generator gen= {A,C,D}∪ {B}={A,B,C,D}• Check with PRE_SET order-preserving

• gen is closed since POST_SET is empty

• Note: {A,C,D} {A,B,C,D}, need not to be in order

4

A 2

AC 2

ACD 2

ABCD 1

Generator = }{},,{ BDCA

A B C D

T1

0 1 0 1

T2

1 1 1 1

T3

1 0 1 1

T4

0 0 1 0


• gen=Ø∪ {B}, PRE_SET={A}, POST_SET={C,D}

• gen is order-preserving by checking with g(A)

• Check g(B) with g(C) and g(D) get c(B)={B,D}

• {B,D} is closed by checking with POST_SET

4

A 2

AC 2

ACD 2

ABCD 1

B 2

BD 2

A B C D

T1

0 1 0 1

T2

1 1 1 1

T3

1 0 1 1

T4

0 0 1 0

Generator = }{B


• CLOSED_SET={B,D}, PRE_SET={A}, POST_SET={C}

• gen now is {B,D}∪ {C} = {B,C,D}

• Check g({B,C,D}) with g(A), g({B,C,D})⊂g(A)• gen is not order-preserving and can be pruned with all

its possible extensions

4

A 2

AC 2

ACD 2

ABCD 1

Generator = }{},{ CDB

B 2

BD 2

BCD 1

A B C D

T1

0 1 0 1

T2

1 1 1 1

T3

1 0 1 1

T4

0 0 1 0


• gen=Ø∪ {C}, PRE_SET={A,B}, POST_SET={D}

• gen is order-preserving by checking with g(A), g(B)

• gen cannot not be extended by checking with POST_SET, so it is closed

4

A 2

AC 2

ACD 2

ABCD 1A B C D

T1

0 1 0 1

T2

1 1 1 1

T3

1 0 1 1

T4

0 0 1 0

B 2

BD 2

BCD 1

C 3

Generator = }{C


• CLOSED_SET={C}, PRE_SET={A,B}, POST_SET={D}

• gen now is {C}∪ {D} = {C,D}

• Check g({C,D}) with g(A), g({C,D})⊂g(A)• gen is not order-preserving and can be pruned with

considering its possible extensions

4

A 2

AC 2

ACD 2

ABCD 1A B C D

T1

0 1 0 1

T2

1 1 1 1

T3

1 0 1 1

T4

0 0 1 0

B 2

BD 2

BCD 1

CD 2

C 3

Generator = }{}{ DC


• gen=Ø∪ {D}, PRE_SET={A,B,C}, POST_SET= Ø

• gen is order-preserving by checking with g(A), g(B), g(C)

• gen cannot not be extended by checking with POST_SET, so it is closed

4

A 2

AC 2

ACD 2

ABCD 1A B C D

T1

0 1 0 1

T2

1 1 1 1

T3

1 0 1 1

T4

0 0 1 0

B 2

BD 2

BCD 1

CD 2

C 3 D 3

Generator = }{B

Optimizations• Vertical data set (frequent single items) is represented

by a bitmap matrix VD MxN

• VD(i,j) =1 when item i of transaction j is frequent• Row i of the matrix represents g(i), the tidlist• Optimize the bitwise AND operations for

• tidlist intersections• Inclusion checks

• 3 optimization techniques

Optimizations• Data Set Projection (projection)

• For closed itemsets Z discovered by closed set X• g(Z) is supported by subsets of g(X)• Delete all columns from VD corresponding transactions not oc

curring in g(X)• This process is limited to generators of 1st level of recursion si

nce it is expensive

Optimizations• Data Sets with Highly Correlated Items (section eq)

• Columns of VD are reordered to profit of data correlation• Maximize the submatrix VE of VD having all rows and column

s are identical• VE is likely to be large and includes most frequent items• Many frequent itemsets can be mined within VE

T1 T2 T3 T4

A 0 1 0 1

B 1 1 1 1

C 1 1 0 1

D 0 1 0 1

T2 T4 T1 T3

A 1 1 0 0

B 1 1 1 1

C 1 1 1 0

D 1 1 0 0

Optimizations• Reusing Results of Previous Bitwise Intersections (incl

uded)• To check whether an itemset X is closed, compare X with its P

RE_SET• For X is closed, g(X)⊆g(j) for all j• Large part of g(X) may be included in g(j)• Let gh(X)⊆gh(j), so gh(X∪Y)⊆gh(j)• We can limit the check of various g(j) to the complementary p

art of gh(j)

g(j)

h

g(X Y)∪

check

g(X)

Optimizations• Actual number of bitwise AND operations vs. support t

hreshold• Optimizations “section eq” & “included” are most

effective

Performance Analysis• Competitors: FP-CLOSE[GRAH03], CLOSET+[PEI03]

• Environment: Windows XP, Pentium IV 2.8GHz, 512MB

• Spare & Dense data sets

Dataset Items Avg. Trans. Size

Transactions

T40I10D100K

1000 40 100000

Retail 16471 13 88162

Chess 76 37 3196

Pumsb 7117 74 49046

Performance Analysis• Data set: T40I10D100K, Retail

• DCI_CLOSED is faster in one order of magnitude

Performance Analysis• Data set: , CHESS, PUMSB

Performance Analysis• Time efficiency of duplicate checking

• Speedup up to six when support thresholds are small

chesschess

References• [GRAH03] G. Grahne and J. Zhu, “Efficiently Using Prefix-Trees in

Mining Frequent Itemsets,” Proc. ICDM Workshop Frequent Itemset Mining Implementations, Dec. 2003.

• [PEI00] J. Pei, J. Han, and R. Mao, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets,” Proc. ACM SIGMOD Int’l Workshop Data Mining and Knowledge Discovery, May 2000.

• [PEI03] J. Pei, J. Han, and J. Wang, “CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets,” Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, Aug. 2003.

• [TAOU00] R. Taouil, N. Pasquier, Y. Bastide, L. Lajhal, and G. Stumme, “Mining Frequent Patterns with Counting Inference,” SIGKDD Explorations, vol. 2, no. 2, Dec. 2000.

• [ZAKI02] M.J. Zaki and C.-J. Hsiao, “Charm: An Efficient Algorithm for Closed Itemsets Mining,” Proc. Second SIAM Int’l Conf. Data Mining, Apr. 2002.

• [ZAKI04] M.J. Zaki, “Mining Non-Redundant Association Rules,” Data Mining and Knowledge Discovery, vol. 9, no.3, pp. 223-248, 2004.

fast and memory efficient mining of frequent closed itemsets claudio lucchese salvatore orlando...

Documents

lattice of frequent

mined closed itemsets

closed ifffunction

closed itemsetto

analysisclosed itemsets

set of items

closed iffno supersets

set of transactionsan