1 frequent itemsets association rules and market basket analysis cs240b--ucla notes by carlo zaniolo...
Post on 19-Dec-2015
218 views
TRANSCRIPT
1
Frequent ItemsetsAssociation rules
and market basket analysis
CS240B--UCLANotes by Carlo Zaniolo
Most slides borrowed fromJiawei Han,UIUC
May 2007
2
Association Rules & Correlations
Basic concepts Efficient and scalable frequent itemset mining
methods: Apriori, and improvements FP-growth
Rule derivation, visualization and validation Multi-level Associations Temporal associations and frequent sequences Other association mining methods Summary
3
Market Basket Analysis: the context
Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping basket”
Customer1
Customer2 Customer3
Milk, eggs, sugar, bread
Milk, eggs, cereal, bread Eggs, sugar
4
Market Basket Analysis: the context
Given: a database of customer transactions, where each transaction is a set of items
Find groups of items which are frequently purchased together
5
Goal of MBA
Extract information on purchasing behavior Actionable information: can suggest
new store layouts new product assortments which products to put on promotion
MBA applicable whenever a customer purchases multiple things in proximity credit cards services of telecommunication companies banking services medical treatments
6
MBA: applicable to many other contexts
Telecommunication:
Each customer is a transaction containing the set of customer’s phone calls
Atmospheric phenomena:
Each time interval (e.g. a day) is a transaction containing the set of observed event (rains, wind, etc.)
Etc.
7
Association Rules
Express how product/services relate to each other, and tend to group together
“if a customer purchases three-way calling, then will also purchase call-waiting”
simple to understandactionable information: bundle three-
way calling and call-waiting in a single package
8
Frequent Itemsets
Transaction:Relational format Compact format<Tid,item> <Tid,itemset><1, item1> <1, {item1,item2}><1, item2> <2, {item3}><2, item3>
Item: single element, Itemset: set of itemsSupport of an itemset I: # of transaction containing I
Minimum Support : threshold for support
Frequent Itemset : with support .
Frequent Itemsets represents set of items which are positively correlated
9
Frequent Itemsets Example
Support({dairy}) = 3 (75%)Support({fruit}) = 3 (75%)Support({dairy, fruit}) = 2 (50%)
If = 60%, then
{dairy} and {fruit} are frequent while {dairy, fruit} is not.
Transaction ID Items Bought1 dairy,fruit2 dairy,fruit, vegetable3 dairy4 fruit, cereals
10
Itemset support & Rules confidence
Let A and B be disjoint itemsets and let: s = support(AB) and
c= support(AB)/support(A)
Then the rule A B holds with support s and confidence c: write A B [s, c]
Objective of the mining task. Find all rules with minimum support minimum confidence Thus A B [s, c] holds if : s and c
11
Association Rules: Meaning
A B [ s, c ]
Support: denotes the frequency of the rule within transactions. A high value means that the rule involve a great part of database.
support(A B [ s, c ]) = p(A B)
Confidence: denotes the percentage of transactions containing A which contain also B. It is an estimation of conditioned probability .
confidence(A B [ s, c ]) = p(B|A) = p(A & B)/p(A).
12
Association Rules - Example
For rule A C:support = support({A, C}) = 50%confidence = support({A, C})/support({A}) = 66.6%
The Apriori principle:Any subset of a frequent itemset must be frequent
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F Frequent Itemset Support
{A} 75%{B} 50%{C} 50%{A,C} 50%
Min. support 50%Min. confidence 50%
13
Closed Patterns and Max-Patterns
A long pattern contains very many subpatterns---combinatorial explosion Closed patterns and max-patterns
An itemset is closed if none of its supersets has the same support Closed pattern is a lossless compression of freq. patterns--
Reducing the # of patterns and rules
An itemset is maximal frequent if none of its supersets is frequent But support of their subsets is not known – additional DB scans
are needed
14
Frequent Itemsets
Minimum support = 2
# Frequent = 13
TID Items
1 ABC
2 ABCD
3 BCE
4 ACDE
5 DE
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
124 123 1234 245 345
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
15
Maximal Frequent Itemset: if none of its supersets is frequent
Minimum support = 2
# Frequent = 13
# Maximal = 4
TID Items
1 ABC
2 ABCD
3 BCE
4 ACDE
5 DE
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
124 123 1234 245 345
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
16
Closed Frequent Itemset: None of its superset has the same support
Minimum support = 2
# Frequent = 13
# Closed = 9
# Maximal = 4
Closed and maximal
Closed but not maximal
TID Items
1 ABC
2 ABCD
3 BCE
4 ACDE
5 DE
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
124 123 1234 245 345
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
Closed and maximal
17
Maximal vs Closed Itemsets
FrequentItemsets
ClosedFrequentItemsets
MaximalFrequentItemsets
As we move from an itemset A to its superset support can:
1. Remain the same,
2. Drop but still remain above treshold, A is closed but not maximal
3. Drop below the threshold: A is maximal (and closed)
1
23
18
Scalable Methods for Mining Frequent Patterns
The downward closure property of frequent patterns Every subset of a frequent itemset must be frequent
[antimonotonic property] If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts}
also contains {beer, diaper}
Scalable mining methods: Three major approaches Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00) Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
19
Apriori: A Candidate Generation-and-Test Approach
Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
Method: Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length k frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be generated
20
Association Rules & Correlations
Basic concepts Efficient and scalable frequent itemset mining
methods: Apriori, and improvements
21
The Apriori Algorithm—An Example
Database TDB
1st scan
C1L1
L2
C2 C22nd scan
C3 L33rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2
Itemset sup{A, C} 2{B, C} 2{B, E} 3{C, E} 2
Itemset
{B, C, E}
Itemset sup{B, C,
E}2
Supmin = 2
22
Important Details of Apriori
How to generate candidates? Step 1: self-joining Lk
Step 2: pruning
How to count supports of candidates? Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd acde from acd and ace
Pruning: acde is removed because ade is not in L3
C4={abcd}
23
How to Generate Candidates?
Suppose the items in Lk-1 are listed in an
order
Step 1: self-joining Lk-1 insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-
1
Step 2: pruningforall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
24
How to Count Supports of Candidates?
Why counting supports of candidates a problem? The total number of candidates can be very huge
One transaction may contain many candidates
Data Structures used: Candidate itemsets can be stored in a hash-tree
or in a prefix-tree (trie)--example
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25
Effect of Support Distribution
Many real data sets have skewed support distribution
Support distribution of a retail data set
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26
Effect of Support Distribution
How to set the appropriate minsup threshold?– If minsup is set too high, we could miss itemsets
involving interesting rare items (e.g., expensive products)
– If minsup is set too low, it is computationally expensive and the number of itemsets is very large
Using a single minimum support threshold may not be effective
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27
Rule Generation
How to efficiently generate rules from frequent itemsets?– In general, confidence does not have an anti-monotone
propertyc(ABC D) can be larger or smaller than c(AB
D)
– But confidence of rules generated from the same itemset has an anti-monotone property
– e.g., L = {A,B,C,D}:
c(ABC D) c(AB CD) c(A BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28
Rule Generation
Given a frequent itemset L, find all non-empty subsets f L such that f L–f satisfies the minimum confidence requirement
If |L| = k, then there are 2k candidate association rules (including L and L)
– Example: L= {A,B,C,D} is the frequent itemset, then
– The candidate rules are:
ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC BD, AD BC, BC AD, BD AC, CD AB,
But antimonotonicity will make things converge fast.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29
Lattice of rules: confidence(f
L–f)=support(L)/support(f)
ABCD=>{ }
BCD=>A ACD=>B ABD=>C ABC=>D
BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD
D=>ABC C=>ABD B=>ACD A=>BCD
ABCD=>{ }
BCD=>A ACD=>B ABD=>C ABC=>D
BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD
D=>ABC C=>ABD B=>ACD A=>BCD
Pruned Rules
Low Confidence Rule
L={A,B,C,D}
L= f
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30
Rule Generation for Apriori Algorithm
1. Candidate rule is generated by merging two rules that share the same prefixin the rule consequent
2. join(CD=>AB,BD=>AC)would produce the candidaterule D => ABC
3. Prune rule D=>ABC if itssubset AD=>BC does not havehigh confidence.
4. Finally check the validity of rule D=>ABC (This is not an expensive operation so we might skip 3)
BD=>ACCD=>AB
D=>ABC
31
Rules: some useful, some trivial, others unexplicable
Useful: “On Thursdays, grocery store consumers often purchase diapers and beer together”.
Trivial: “Customers who purchase maintenance agreements are very likely to purchase large appliances”.
Unexplicable: “When a new hardaware store opens, one of the most sold items is toilet rings.”
Conclusion: Inferred rules must be validate by domain expert, before they can be used in the marketplace: Post Mining of association rules.
32
Mining for Association Rules
The main steps in the process1. Select a minimum support/confidence
level 2. Find the frequent itemsets3. Find the association rules4. Validate (postmine) the rules so found.
33
Mining for Association Rules: Checkpoint
Apriori opened up a big commercial market for DM association rules came from the db fields, classifier from AI,
clustering precedes both … and DM
Many open problem areas, including1. Performance: Faster Algorithms needed for frequent itemsets
2. Improving statistical/semantic significance of rules
3. Data Stream Mining for association rules. Even Faster algorithms needed, incremental computation, adaptability, etc. Also the post-mining process becomes more challenging.
34
Performance: Efficient Implementation Apriori in SQL
Hard to get good performance out of pure SQL (SQL-92)
based approaches alone
Make use of object-relational extensions like UDFs,
BLOBs, Table functions etc. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating
association rule mining with relational database systems: Alternatives and implications. In SIGMOD’98
A much better solution: use UDAs—native or imported.
Haixun Wang and Carlo Zaniolo: ATLaS: A Native Extension of SQL for Data Mining. SIAM International Conference on Data Mining 2003, San Francisco, CA, May 1-3, 2003
35
Performance for Apriori
Challenges Multiple scans of transaction database [not for data
streams]
Huge number of candidates
Tedious workload of support counting for candidates
Many Improvements suggested: general ideas Reduce passes of transaction database scans
Shrink number of candidates
Facilitate counting of candidates
36
Partition: Scan Database Only Twice
Any itemset that is potentially frequent in DB must
be frequent in at least one of the partitions of DB
Scan 1: partition database and find local frequent
patterns
Scan 2: consolidate global frequent patterns
A. Savasere, E. Omiecinski, and S. Navathe. An
efficient algorithm for mining association in large
databases. In VLDB’95
Does this scaleup to larger partitions?
37
Sampling for Frequent Patterns
Select a sample S of original database,
mine frequent patterns within sample using
Apriori
To avoid losses mine for a support less than
that required
Scan rest of database to find exact counts. H. Toivonen. Sampling large databases for
association rules. In VLDB’96
38
DIC: Reduce Number of Scans
ABCD
ABC ABD ACD BCD
AB AC BC AD BD CD
A B C D
{}
Itemset lattice
Once both A and D are determined frequent, the counting of AD begins
Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins
Transactions
1-itemsets2-itemsets
…Apriori
1-itemsets2-items
3-itemsDICS. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97
39
Improving Performance (cont.)
APriori Multiple database scans are costly Mining long patterns needs many passes of
scanning and generates lots of candidates
To find frequent itemset i1i2…i100
# of scans: 100
# of Candidates: (1001) + (100
2) + … + (11
00
00) = 2100-1 =
1.27*1030 !
Bottleneck: candidate-generation-and-test
Can we avoid candidate generation?
40
Mining Frequent Patterns Without Candidate Generation
FP-Growth Algorithm
1. Build FP-tree: items are listed by decreasing
frequency
2. For each suffix (recursively)
Build its conditionalized subtree
and compute its frequent items
An order of magnitude faster than Apriori
4141
Frequent Patterns (FP) Algorithm
_________________________________________These slides are based on those by:
Yousry Taha,Taghrid Al-Shallali, Ghada AL Modaifer ,Nesreen AL Boiez
The algorithm consists of two steps:Step 1: builds the FP-Tree (Frequent Patterns Tree).Step 2: use FP_Growth Algorithm for finding frequent itemsets from the FP- Tree.
4242
Frequent Pattern Tree Algorithm:Example
• The first scan of database is same as Apriori, which derives the set of 1-itemsets & their support counts.
• The set of frequent items is sorted in the order of descending support count.
• An Fp-tree is constructed
• The Fp-tree is conditionalized and mined for frequent itemsets
T-ID List of Items
101 Milk, bread, cookies, juice
792 Milk, juice
1130 Milk, eggs
1735 Bread, cookies, coffee
4343
NULL
Milk:1Milk:2Milk:3
Bread:1
Cookies:1
Cookies:1
Bread:1
Juice:1
Juice:1
Item Id Support Node-link
milk 3
bread 2
cookies 2
juice 2
Table: Item header table
FP-tree
FP-Tree for T-ID List of Items
101 Milk, bread, cookies, juice
792 Milk, juice
1130 Milk, eggs
1735 Bread, cookies, coffee
4444
FP-Growth Algorithm For Finding Frequent Itemsets
Steps:
1. Start from each frequent length-1 pattern (as an initial suffix pattern).
2. Construct its conditional pattern base which consists of the set of prefix paths in the FP-Tree co-occurring with suffix pattern.
3. Then, Construct its conditional FP-Tree & perform mining on such a tree.
4. The pattern growth is achieved by concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-Tree.
5. The union of all frequent patterns (generated by step 4) gives the required frequent itemset.
4545
FP-Growth: for each suffix find (1) its supporting paths, (2) its conditional FP-tree, and (3) the frequent patterns with such an ending (suffix)
Suffix Tree paths supporting suffix
(conditional pattern base)
Conditional FP-Tree
Frequent pattern
generated
juice {(milk,
bread,cookies:1),
(milk: 1)}
{milk:2} {juice,
milk:2}
cookies {(milk, bread:1),(bread:
1)}
{bread: 2} {cookies,
bread:2}
bread {(milk: 1)} - -
milk - - -
… then expand the suffix and repeat these operations
4646
NULL
Milk:1Milk:2Milk:3
Bread:1
Cookies:1
Cookies:1
Bread:1
Juice:1
Juice:1
Starting from least frequent suffix: Juice
NULL
Milk:1Milk:2Milk:3
Bread:1
Cookies:1
Juice:1
Juice:1
2
4747
Conditionalized tree for Suffix “Juice ”
NULL
Milk:2
Thus: (Juice, Milk:2) is a frequent pattern
4848
NULL
Milk:1Milk:2Milk:3
Bread:1
Cookies:1
Cookies:1
Bread:1
Now Patterns with Suffix “Cookies ”
Item Id Sup Count Node-link
milk 3 ..
bread 2 Next
cookies 2 NOW
juice Done Done
NULL
Bread:2
NULL
Milk:1Milk:2Milk:1
Bread:1
Bread:1 Thus: (Cookies, Bread:2)
is frequent
4949
Why Frequent Pattern Growth Fast ?
• Performance study showsFP-growth is an order of magnitude faster than Apriori
• Reasoning− No candidate generation, no candidate test
− Use compact data structure
− Eliminate repeated database scan
− Basic operation is counting and FP-tree building
5050
Other types of Association RULES
• Association Rules among Hierarchies.
• Multidimensional Association
• Negative Association
51516/22/2000
FP-growth vs. Apriori: Scalability With the Support Threshold
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Ru
n t
ime(s
ec.)
D1 FP-grow th runtime
D1 Apriori runtime
Data set T25I20D10K
52526/22/2000
FP-growth vs. Apriori: Scalability With Number of Transactions
0
10
20
30
40
50
60
0 20 40 60 80 100
Number of transactions (K)
Ru
n t
ime
(sec
.)
FP-growth
Apriori
Data set T25I20D100K (1.5%)
53
FP-Growth: pros and cons
FP- tree is Complete Preserve complete information for frequent pattern mining Never break a long pattern of any transaction
FP- tree Compact Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more frequently
occurring, the more likely to be shared Never be larger than the original database (not count node-
links and the count field) FP-tree is generate in one scan of database (data
streams mining?) However, deriving the frequent patterns from the FP-tree is
still computationally expensive—improved algorithms needed for data streams.