mining generalized association rules r. srikant & r. agrawal (ibm) presentation by: colin cherry
Post on 17-Dec-2015
222 Views
Preview:
TRANSCRIPT
Objectives
• What are generalized association rules?
• Why do we care?
• How can we get them efficiently?
• How can we reduce rule redundancy?
• Is the efficient method any good?
Motivation
• Association rules find rules of the form:XY, where X and Y are sets of items
• What if there is structure over your items?
• Structure can be used to generalize
Hierarchy Example
Pepsi Coke
Cola
Soft Drink
…
…
Beverage
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.… …
Hierarchy Example
On Sale Not On Sale
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.… …
• Goal of this paper: Given hierarchies over items: Capture interesting rules at all levels of
multiple hierarchies
Simple Fix
• Just add parents to each transaction.
• {Coke, 7-up, ranch Doritos, bananas}
would become:{Coke, 7-up, ranch Doritos, bananas,
Doritos, cola, clear pop, soft drink, chips, junk food, fruit, produce}
Fix Cont’d
• Run Apriori on expanded database
• Redefine association rules:
Make sure:XY XY={} Y contains no ancestors of any item in X
Problems with the fix
• Counting may slow down Total number of items & average
transaction size will grow
• Could get a lot of redundant rulesMilk Cereal (70%) Skim Milk Cereal (70%)
Do we care?
An Efficient Algorithm
• “Cumulate”Filtering ancestors added to
transactionsHierarchy-aware itemset pruning
• For more complicated, speculative algorithms, see paper
Filtering Ancestors
• Not counting soft drink? Don’t add it.Only add ancestors that are in at least
one of the candidate itemsets
• Delete any items we are not countingNot counting Doritos? Replace with chips
• Each iteration: Pre-compute the ancestors for each item
Itemset Pruning
• No sense counting both {coke,cola,chips} and {coke,chips}, they’ll always be the same
• Take out {coke,cola} during count size=2 and you’ll never have to deal with it
€
ˆ y = ancestor(y)
∀X : (y ∈ X ∧ ˆ y ∈ X) → sup(X) = sup(X −{ ˆ y })
Reducing Redundancy
Milk Cereal (8% sup, 70% conf)Skim Milk Cereal (2% sup, 70% conf)
• If Skim Milk accounts for 1/4 of Milk sales, then the 2nd rule is redundant
• Expected support and confidence (wrt hierarchy) will define interesting
Close Ancestors
• An itemset Z’ is an ancestor of Z if: Z’ = Z with some items replaced by ancestors Z’ has the same number of items as Z
• Z’ is a close ancestor of Z if: No ancestor of Z has Z’ as an ancestor
Take {coke,bananas} as ZZ’={cola, bananas} is a close ancestorZ’={soft drink, bananas} is not close Z’={cola,fruit} is not close
Interestingness
• A rule XY is interesting if for all interesting, close ancestors X’Y’:
Sup({X,Y}) > R*ExpSup({X,Y}|{X’,Y’})or:
Conf(XY) > R*ExpConf(XY|X’Y’)
• R is defined by the user
Putting it all together
• #1 is interesting - has no ancestor• #2 is interesting - twice expected support• #3 is not interesting
Has exactly expected support according to closest ancestor (#2)
Item Sup
Clothes 20
Outerwear 8
Jacket 4
Rule Sup (Exp)
Clothes Footwear 10 (-)
Outerwear Footwear 8 (4)
Jackets Footwear 4 (4)
Experiments
• Lots of experiments on artificial data in paper.
• We’ll look at the results of using Cumulate on real data
• Compare to the quick fix - just adding in ancestors to transactions
Interestingness Results
• Hierarchical Interestingness pruning:
R = 25% resulted in pruning roughly 40% of the rules
R = 50% resulted in pruning roughly 50% of the reuslts
• Pruning had a significant impact!
Objectives Revisited
• What are generalized association rules? Rules aware of hierarchies over items
• Why do we care? Support can be low for individual items
• How can we get them efficiently? Cumulate algorithm - hierarchy aware counting
• How can we reduce rule redundancy? Check surprise with respect to ancestors
• Is the efficient method any good? Yeap!
Hierarchy Example
Cans Bottles
Beverage
Fridge
…
…
Impulse
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.… …
Pros
• Rules over items low in the tree may not have minimum support
• Can raise min support Shoot for fewer, more general rulesBUT: You can catch rules at any level of
the hierarchy
Data Sets
• Supermarket: 500,000 items 1.5 million transactionsHierarchy has 4 levels, 118 roots
• Department Store: 200,000 items 500,000 transactionsHierarchy has 7 levels, 89 roots
top related