fast and memory efficient mining of frequent closed itemsets claudio lucchese salvatore orlando...
TRANSCRIPT
Fast and Memory Efficient Mining of Frequent Closed Itemsets
Claudio LuccheseSalvatore Orlando
Raffaele PeregoDB group seminarPresenter: Leonidas
Abstract• Frequent Itemsets Mining• Closed Itemsets• Mining Frequent Closed Itemsets• Handling duplicates• Brief introduction of the algorithm• Experimental results
Frequent Itemsets Mining• A set of items I, set of transactions D• Discover all the itemsets from I with support > min_sup
p• Support of a k-itemset I supp(I) : number of transactions in D i
ncludes I• I is a set of items from I• Transaction t in D is a set of items from I
• Well known algorithm: Apriori• Discover frequent itemsets
Weaknesses & Solutions• Number of frequent itemsets grows up quickly as min_s
upp decreases• Complexity of mining task increases rapidly
• Huge size of output• Complex for analysis
• Closed itemsets are one of the solutions• Unique maximal elements of the equivalence classes defined
over the lattice of all the frequent itemsets
Weaknesses & Solutions• Equivalence class
• Distinct group of frequent itemsets• Supported by same set of transactions
• Represent same knowledge• Vertical bitwise representation of data set
• Association Rules extracted are more meaningful [ZAKI04]• Redundancies are removed
• Suitable for dense data set• Frequent closed itemsets are much fewer than frequent items
ets
Closed Itemsets• I is subsets of items appearing in D• T is subset of transactions in D• Define two functions:
• Itemset I is closed iff
• Function is called Galois operator / closure operator
},|{)(
},|{)(
tiIitIg
tiTtiTf
D
Ι
)())(()( IgfIgfIc
gfc
TID Items
1 B D
2 A B C D
3 A C D
4 C
I
Equivalence classes• Two itemsets belong to same equivalence class iff
• They have same closure• Supported by same set of transactions
• An itemset I is closed iff• No supersets of I have the same support
A 2 B 2 C 3 D 3
AC 2 AD 2 BC 1 BD 2AB 1 CD 2
ABC 1 ABD 1 ACD 2 BCD 1
ABCD 1ABCD 1
ACD 2
BD 2
C 3 D 3
D 2
Frequent ClosedItemset
A 2
Frequent Itemset
Support
EquivalenceClass 4 4
TID Items
1 B D
2 A B C D
3 A C D
4 C
Mining Frequent Closed Itemsets• Search Space Browsing
• Traverse the lattice of frequent itemsets from one equivalence class to another
• Closure computation• Compute the closure of frequent itemsets• Determine the closed itemsets
Closure generator:• A single representative of an equivalence class• Can mine all the closed itemsets by computing the
closure of the generator for each class
Browsing the Search Space• Choose the key patterns (minimal elements) as gener
ators• Traverse the lattice formed by key patterns with Aprior
i-like algorithm[TAOU00]• Unfortunately, same closed itemset can be led from m
ore then one key patterns
A 2 B 2 C 3 D 3
AC 2 AD 2 BC 1 BD 2AB 1 CD 2
ABC 1 ABD 1 ACD 2 BCD 1
ABCD 1ABCD 1
ACD 2
BD 2
C 3 D 3
4 4
Browsing the Search Space• Closure climbing• New generators are built as the supersets of the closed
itemset discovered so far• Jump from an equivalence class to another• Cannot ensure the equivalence class is not visited yet
A 2 B 2 C 3 D 3
AC 2 AD 2 BC 1 BD 2AB 1 CD 2
ABC 1 ABD 1 ACD 2 BCD 1
ABCD 1ABCD 1
ACD 2
BD 2
C 3 D 3
4 4
Problem of duplicate• Need duplicate checking to avoid generating the same closed ite
mset• To avoid useless expensive closure operation, use following lem
ma:
• However, it is still expensive in time and space• All the mined closed itemsets need to be kept in main memory
• Several algorithms are forced to adopt a strict lexicographic visiting order of the search space to ensure correct duplicate avoidance
• CHARM[PEI00], CLOSET[PEI03], CLOSET+[ZAKI02]
c(Y)X)), then c(g(Y)g(X)(i.e.,
supp(Y)(X)Y and supp X and Y, ifitemsets XGiven two
Computing Closures• Besides Galois operator, make use of the lemma:
• Perform inclusion check for all items in I• The chcek is benefited from using vertical representation
of list of tidlist• Calculation can be either offline or online• Offline: compute closures for the entire set of generators
• Use key patterns, generators are shorter• Online: compute closure for a discovered generator
• Use closure climbing, generators are longer• Fewer checks for longer generators, more efficient
c(X)ig(i)I, g(X)m iand an itetemsets X Given an i
Item A B C D
T1 0 1 0 1
T2 1 1 1 1
T3 1 0 1 1
T4 0 0 1 0
tidlist
Handling duplicates• To identify the unique generator for each
equivalence class• Define order-preserving property of generator
• Check whether a given generator is order-preserving or not
• Compute the closure of order-preserving generators only
• Prune other generators
Handling duplicates• Order-preserving property of generators:
• It means that if items need to be added to an order-preserving generator to compute the closure, they need to follow the item i
• The introduction of order-preserving generator is used to avoid duplicate generation of closed itemset
X)\(c(X)iXc(X)
preserving-order
or either
iff to beY, is saidmset and iclosed ite
is a i, where YYorm Xr of the fA generato
Example• {A}= Ø∪{A} is order-preserving generator
•
• {C,D}={C}∪{D} is not order-preserving•
A 2 B 2 C 3 D 3
AC 2 AD 2 BC 1 BD 2AB 1 CD 2
ABC 1 ABD 1 ACD 2 BCD 1
ABCD 1ABCD 1
ACD 2
BD 2
C 3 D 3
4 4
Item A B C D
T1 0 1 0 1
T2 1 1 1 1
T3 1 0 1 1
T4 0 0 1 0
},{\)( DCAAcA
}{},{\}),({ ADCDCcD
Handling duplicates• We need to check whether a generator is order-preser
ving or not• Define a set called pre-set(gen) of a generator
• We can now check whether a generator is order-preserving by checking:
• If yes, then gen is not order-preserving
} and ,I,|{)set(-pre ijgenjjjgen iYgen
)()(such that )(set-pre jggenggenj
Handling duplicates• The goal is to compute the closure of order-preserving
generators only• For any closed itemset , there exists a sequence of or
der-preserving generators• Using closure climbing to climb a sequence of closed it
emsets and reach• For each closed itemset ,the sequence of order-prese
rving generators is unique
Y
Y
Y
4
Handling duplicates• Example : },,,{ DCBAY
)(0 cY }{0 Agen
},,{)( 01 DCAgencY }{},,{1 BDCAgen
},,,{)( 1 DCBAgencY YYY 10 :Note
A 2
AC 2
ABCD 1
ACD 2
Generator = }{A
Generator = }{},,{ BDCA
The DCI_CLOSED Algorithm• Two different types of data sets
• Dense & Sparse
• Dense data set• Transactions are long• Contain strongly correlated items
• Number of closed itemsets may be nearly equal to number of frequent itemsets in sparse data sets• Mining closed itemsets becomes more expensive
• Separated into two parts• DCI_CLOSEDs() & DCI_CLOSEDd()
The DCI_CLOSED Algorithm• Discriminate between sparse and dense data sets:• Scan data set to find out frequent single items F1⊆ I• Build bitwise vertical data set VD• Items are increasingly sorted w.r.t. frequencies
• Decide whether a data set is sparse or dense• If percentage of 1s is large• If a large set of items is strongly correlated• Compute the percentage of the most frequent items that
co-occur in the same transaction
A 1010111B 0101101E 0101100
…
The DCI_CLOSED Algorithm• 3 input parameters:
• CLOSED_SET=c(Ø), PRE_SET=Ø, POST_SET=F1\c(Ø)
• Get an item i from POST_SET (minimum in order)• Add i to CLOSED_SET to build new_gen (
closure climbing)
• Check validity of generator new_gen with PRE_SET
• Compute closure of new_gen using lemma 2 for CLOSED_SET• New closed set generated from new_gen
The DCI_CLOSED Algorithm• Use PRE_SET to check validity of new_gen
• Guarantee duplicate generators will be correctly pruned out
• POST_SET is used to guarantee generators are produced according to Theorem 1
• POST_SET contains items j follow i in lexicographic order & not included in CLOSED_SET yet
} |_{POST_SET XjandjiSETPOSTjnew
Running example of DCI_CLOSEDd()
• CLOSED_SET = c(Ø)=Ø, PRE_SET=Ø, POST_SET={A,B,C,D}
• Compute closure of generator gen= Ø∪ {A}={A}• Check with PRE_SET order-preserving
• Check if g(A)⊂g(j), ∀j∈POST_SET• If yes, include j into CLOSED_SET
4
A 2
AC 2
ACD 2
Generator = }{A
A B C D
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
Generator =
Running example of DCI_CLOSEDd()
• CLOSED_SET={A,C,D}, PRE_SET=Ø, POST_SET={B}
• New generator gen= {A,C,D}∪ {B}={A,B,C,D}• Check with PRE_SET order-preserving
• gen is closed since POST_SET is empty
• Note: {A,C,D} {A,B,C,D}, need not to be in order
4
A 2
AC 2
ACD 2
ABCD 1
Generator = }{},,{ BDCA
A B C D
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
Running example of DCI_CLOSEDd()
• gen=Ø∪ {B}, PRE_SET={A}, POST_SET={C,D}
• gen is order-preserving by checking with g(A)
• Check g(B) with g(C) and g(D) get c(B)={B,D}
• {B,D} is closed by checking with POST_SET
4
A 2
AC 2
ACD 2
ABCD 1
B 2
BD 2
A B C D
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
Generator = }{B
Running example of DCI_CLOSEDd()
• CLOSED_SET={B,D}, PRE_SET={A}, POST_SET={C}
• gen now is {B,D}∪ {C} = {B,C,D}
• Check g({B,C,D}) with g(A), g({B,C,D})⊂g(A)• gen is not order-preserving and can be pruned with all
its possible extensions
4
A 2
AC 2
ACD 2
ABCD 1
Generator = }{},{ CDB
B 2
BD 2
BCD 1
A B C D
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
Running example of DCI_CLOSEDd()
• gen=Ø∪ {C}, PRE_SET={A,B}, POST_SET={D}
• gen is order-preserving by checking with g(A), g(B)
• gen cannot not be extended by checking with POST_SET, so it is closed
4
A 2
AC 2
ACD 2
ABCD 1A B C D
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
B 2
BD 2
BCD 1
C 3
Generator = }{C
Running example of DCI_CLOSEDd()
• CLOSED_SET={C}, PRE_SET={A,B}, POST_SET={D}
• gen now is {C}∪ {D} = {C,D}
• Check g({C,D}) with g(A), g({C,D})⊂g(A)• gen is not order-preserving and can be pruned with
considering its possible extensions
4
A 2
AC 2
ACD 2
ABCD 1A B C D
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
B 2
BD 2
BCD 1
CD 2
C 3
Generator = }{}{ DC
Running example of DCI_CLOSEDd()
• gen=Ø∪ {D}, PRE_SET={A,B,C}, POST_SET= Ø
• gen is order-preserving by checking with g(A), g(B), g(C)
• gen cannot not be extended by checking with POST_SET, so it is closed
4
A 2
AC 2
ACD 2
ABCD 1A B C D
T1
0 1 0 1
T2
1 1 1 1
T3
1 0 1 1
T4
0 0 1 0
B 2
BD 2
BCD 1
CD 2
C 3 D 3
Generator = }{B
Optimizations• Vertical data set (frequent single items) is represented
by a bitmap matrix VD MxN
• VD(i,j) =1 when item i of transaction j is frequent• Row i of the matrix represents g(i), the tidlist• Optimize the bitwise AND operations for
• tidlist intersections• Inclusion checks
• 3 optimization techniques
Optimizations• Data Set Projection (projection)
• For closed itemsets Z discovered by closed set X• g(Z) is supported by subsets of g(X)• Delete all columns from VD corresponding transactions not oc
curring in g(X)• This process is limited to generators of 1st level of recursion si
nce it is expensive
Optimizations• Data Sets with Highly Correlated Items (section eq)
• Columns of VD are reordered to profit of data correlation• Maximize the submatrix VE of VD having all rows and column
s are identical• VE is likely to be large and includes most frequent items• Many frequent itemsets can be mined within VE
T1 T2 T3 T4
A 0 1 0 1
B 1 1 1 1
C 1 1 0 1
D 0 1 0 1
T2 T4 T1 T3
A 1 1 0 0
B 1 1 1 1
C 1 1 1 0
D 1 1 0 0
Optimizations• Reusing Results of Previous Bitwise Intersections (incl
uded)• To check whether an itemset X is closed, compare X with its P
RE_SET• For X is closed, g(X)⊆g(j) for all j• Large part of g(X) may be included in g(j)• Let gh(X)⊆gh(j), so gh(X∪Y)⊆gh(j)• We can limit the check of various g(j) to the complementary p
art of gh(j)
g(j)
h
g(X Y)∪
check
g(X)
Optimizations• Actual number of bitwise AND operations vs. support t
hreshold• Optimizations “section eq” & “included” are most
effective
Performance Analysis• Competitors: FP-CLOSE[GRAH03], CLOSET+[PEI03]
• Environment: Windows XP, Pentium IV 2.8GHz, 512MB
• Spare & Dense data sets
Dataset Items Avg. Trans. Size
Transactions
T40I10D100K
1000 40 100000
Retail 16471 13 88162
Chess 76 37 3196
Pumsb 7117 74 49046
Performance Analysis• Data set: T40I10D100K, Retail
• DCI_CLOSED is faster in one order of magnitude
Performance Analysis• Data set: , CHESS, PUMSB
Performance Analysis• Time efficiency of duplicate checking
• Speedup up to six when support thresholds are small
chesschess
References• [GRAH03] G. Grahne and J. Zhu, “Efficiently Using Prefix-Trees in
Mining Frequent Itemsets,” Proc. ICDM Workshop Frequent Itemset Mining Implementations, Dec. 2003.
• [PEI00] J. Pei, J. Han, and R. Mao, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets,” Proc. ACM SIGMOD Int’l Workshop Data Mining and Knowledge Discovery, May 2000.
• [PEI03] J. Pei, J. Han, and J. Wang, “CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets,” Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, Aug. 2003.
• [TAOU00] R. Taouil, N. Pasquier, Y. Bastide, L. Lajhal, and G. Stumme, “Mining Frequent Patterns with Counting Inference,” SIGKDD Explorations, vol. 2, no. 2, Dec. 2000.
• [ZAKI02] M.J. Zaki and C.-J. Hsiao, “Charm: An Efficient Algorithm for Closed Itemsets Mining,” Proc. Second SIAM Int’l Conf. Data Mining, Apr. 2002.
• [ZAKI04] M.J. Zaki, “Mining Non-Redundant Association Rules,” Data Mining and Knowledge Discovery, vol. 9, no.3, pp. 223-248, 2004.