mining association rules. association rules association rules… –… can predict any attribute...

13
Mining Association Rules

Post on 18-Dec-2015

243 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used

Mining Association Rules

Page 2: Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used

Association rules• Association rules…

– … can predict any attribute and combinations of attributes… are not intended to be used together as a set of rules

• Problem: immense number of possible associations– Output needs to be restricted to show only the most predictive

associations only those with high support and high confidence.

• Support: number of instances predicted correctly• Confidence: number of correct predictions, as proportion of all instances that

rule applies to

• Example: 4 cool days with normal humidity If temperature = cool then humidity = normal

Support = 4, confidence = 100%• Normally: minimum support and confidence prespecified

– (e.g. 58 rules with support 2 and confidence 95% for weather data)

Page 3: Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used

Item sets• Support: the number of instances covered by all tests in

the rule (LHS and RHS!)

• Item: one test i.e. an attribute-value pair– Terminology derived from market basket analysis, where the

items are articles in shopping cart.

• Item set: all items occurring in a rule

• Goal: find only rules that exceed pre-defined support

We can do it by finding all item sets with the given minimum support and then generating rules from them!

Page 4: Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used

Weather data

Page 5: Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used

Item sets for weather data

Page 6: Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used

Generating rules from an item set

The numbers on the right show the number of instances for which all three conditions are true (that is, the coverage) divided by the number of instances for which the conditions in the antecedent are true.

Interpreted as a fraction, they represent the accuracy. The numbers are easily obtained from the previous table.

Page 7: Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used

Rules for the weather data• Rules with support > 1 and confidence = 100%:

• In total: 3 rules with support four, 5 with support three, and 50 with support two.

Page 8: Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used

Generating item sets efficiently

• How can we efficiently find all frequent item sets?

• Finding one-item sets easy. One pass through the data.

• Idea: use the one-item sets to generate two-item sets, two-item sets to generate three-item sets, …– If (A B) is frequent item set, then (A) and (B) have to be

frequent item sets as well!• A, B are attribute-value tests.

• In general: if X is frequent k-item set, then all (k-1)- item subsets of X are also frequent

Compute k-item set by merging (k-1)-item sets

Page 9: Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used

Passes on Data• For each k, finding the k-item set involves a pass through

the dataset to count the items in each set.

• After the pass the surviving item sets are stored in hash table.– Recall a hash table facilitates very quick finding.

• The key of an entry in the hash table will be the concatenation of items in the item set.

• The value of an entry will be the coverage.

Page 10: Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used

An example• Given: five three-item sets

(A B C), (A B D), (A C D), (A C E), (B C D)

• Candidate four-item sets:

(A B C D) OK because all the 3-item subsets are frequent itemsets

(A C D E) Not OK because of (C D E) not being

frequent itemset.

• The hash table helps with these checks.

• Final check by counting instances in dataset!

Page 11: Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used

Generating rules efficiently• Suppose we found the frequent itemsets.

• Now, we are looking for all high-confidence rules– Support of antecedent obtained from hash table– But: brute-force method is (2N-1)

• Better way: building (c+1)-consequent rules from c-consequent ones

• Observation: (c+1)-consequent rule can only hold if all corresponding c-consequent rules also hold

• Resulting algorithm similar to procedure for large item sets

Page 12: Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used

Discussion of association rules

• Above method makes one pass through the data for each different size item set

• Other possibility: generate (k+2)-item sets just after (k+1)-item sets have been generated without validating first the (k+1)-item sets.– Result: more ( k+2)-item sets than necessary will be

considered but less passes through the data

– Makes sense if data too large for main memory

Page 13: Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used

Tabular format• Tabular format isn’t always appropriate.

• Association rules are often used in situations where attributes are unary – either present or not. – E.g. in supermarket basket analysis,

• an instance is a particular shopping cart• attributes are all the items for sale• attribute values are present if that item appears in the cart and

missing otherwise.

• Most carts contain far fewer items than there are in the supermarket, and it’s more efficient to represent each instance as a list of attributes whose value is present.