performance and scalability: apriori implementation

22
Performance and Scalability: Apriori Implementation

Post on 22-Dec-2015

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance and Scalability: Apriori Implementation

Performance and Scalability: Apriori

Implementation

Page 2: Performance and Scalability: Apriori Implementation

Apriori

R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB, 487-499, 1994

Page 3: Performance and Scalability: Apriori Implementation

Reducing Number of Comparisons Candidate counting:

Scan the database of transactions to determine the support of each candidate itemset

To reduce the number of comparisons, store the candidates in a hash structure

Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions Hash Structure

k

Buckets

Page 4: Performance and Scalability: Apriori Implementation

Generate Hash Tree

2 3 45 6 7

1 4 51 3 6

1 2 44 5 7 1 2 5

4 5 81 5 9

3 4 5 3 5 63 5 76 8 9

3 6 73 6 8

1,4,7

2,5,8

3,6,9Hash function

Suppose you have 15 candidate itemsets of length 3:

{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}

You need:

• Hash function

• Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)

Page 5: Performance and Scalability: Apriori Implementation

Association Rule Discovery: Hash tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function Candidate Hash Tree

Hash on 1, 4 or 7

Page 6: Performance and Scalability: Apriori Implementation

Association Rule Discovery: Hash tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function Candidate Hash Tree

Hash on 2, 5 or 8

Page 7: Performance and Scalability: Apriori Implementation

Association Rule Discovery: Hash tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function Candidate Hash Tree

Hash on 3, 6 or 9

Page 8: Performance and Scalability: Apriori Implementation

Subset Operation

1 2 3 5 6

Transaction, t

2 3 5 61 3 5 62

5 61 33 5 61 2 61 5 5 62 3 62 5

5 63

1 2 31 2 51 2 6

1 3 51 3 6

1 5 62 3 52 3 6

2 5 6 3 5 6

Subsets of 3 items

Level 1

Level 2

Level 3

63 5

Given a transaction t, what are the possible subsets of size 3?

Page 9: Performance and Scalability: Apriori Implementation

Subset Operation Using Hash Tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1 2 3 5 6

1 + 2 3 5 63 5 62 +

5 63 +

1,4,7

2,5,8

3,6,9

Hash Functiontransaction

Page 10: Performance and Scalability: Apriori Implementation

Subset Operation Using Hash Tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function1 2 3 5 6

3 5 61 2 +

5 61 3 +

61 5 +

3 5 62 +

5 63 +

1 + 2 3 5 6

transaction

Page 11: Performance and Scalability: Apriori Implementation

Subset Operation Using Hash Tree

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 8

3 5 6

3 5 7

6 8 9

2 3 4

5 6 7

1 2 4

4 5 71 2 5

4 5 8

1,4,7

2,5,8

3,6,9

Hash Function1 2 3 5 6

3 5 61 2 +

5 61 3 +

61 5 +

3 5 62 +

5 63 +

1 + 2 3 5 6

transaction

Match transaction against 11 out of 15 candidates

Page 12: Performance and Scalability: Apriori Implementation

Prefix Tree Representation

Efficient Implementations of Apriori and EclatChristian Borgelt., FIMI’03

Page 13: Performance and Scalability: Apriori Implementation

Prefix Tree

Page 14: Performance and Scalability: Apriori Implementation

Prefix Tree Structure for Counting

Page 15: Performance and Scalability: Apriori Implementation

Other key optimization Recording the items

Why is this relevant? Transaction Tree

Organize transaction into trees Count through two trees

Page 16: Performance and Scalability: Apriori Implementation

Important websites: FIMI workshop

Not only Apriori and FIM FP-tree, ECLAT, Closed, Maximal

http://fimi.cs.helsinki.fi/ Christian Borgelt’s website

http://www.borgelt.net/software.html Ferenc Bodon’s website

http://www.cs.bme.hu/~bodon/en/apriori/

Page 17: Performance and Scalability: Apriori Implementation

References: Christian Borgelt, Efficient

Implementations of Apriori and Eclat, FIMI’03

Ferenc Bodon, A fast APRIORI implementation, FIMI’03

Ferenc Bodon, A Survey on Frequent Itemset Mining, Technical Report, Budapest University of Technology and Economic, 2006

Page 18: Performance and Scalability: Apriori Implementation

Scalability How to handle very large dataset? The dataset can not be stored in the main

memory Performance of out-of-core

datasets/Performance of in-core datasets

Page 19: Performance and Scalability: Apriori Implementation

Partition: Scan Database Only Twice

Any itemset that is potentially frequent in DB

must be frequent in at least one of the partitions

of DB

Scan 1: partition database and find local frequent

patterns

Scan 2: consolidate global frequent patterns

A. Savasere, E. Omiecinski, and S. Navathe. An

efficient algorithm for mining association in large

databases. In VLDB’95

Page 20: Performance and Scalability: Apriori Implementation

DHP: Reduce the Number of Candidates A k-itemset whose corresponding hashing bucket

count is below the threshold cannot be frequent Candidates: a, b, c, d, e

Hash entries: {ab, ad, ae} {bd, be, de} …

Frequent 1-itemset: a, b, d, e

ab is not a candidate 2-itemset if the sum of count of {ab,

ad, ae} is below support threshold

J. Park, M. Chen, and P. Yu. An effective hash-based

algorithm for mining association rules. In

SIGMOD’95

Page 21: Performance and Scalability: Apriori Implementation

Sampling for Frequent Patterns

Select a sample of original database, mine frequent

patterns within sample using Apriori

Scan database once to verify frequent itemsets found

in sample, only borders of closure of frequent patterns

are checked Example: check abcd instead of ab, ac, …, etc.

Scan database again to find missed frequent patterns

H. Toivonen. Sampling large databases for association

rules. In VLDB’96

Page 22: Performance and Scalability: Apriori Implementation

DIC: Reduce Number of ScansABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{}

Itemset lattice

Once both A and D are determined frequent, the counting of AD begins

Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins

Transactions

1-itemsets2-itemsets

…Apriori

1-itemsets2-items

3-itemsDICS. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97