chapter 10 association rules

71
© 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 10 ASSOCIATION RULES Cios / Pedrycz / Swiniarski / Kurgan Cios / Pedrycz / Swiniarski / Kurgan

Upload: parker

Post on 11-Jan-2016

48 views

Category:

Documents


3 download

DESCRIPTION

Chapter 10 ASSOCIATION RULES. Cios / Pedrycz / Swiniarski / Kurgan. Outline. Introduction Association Rules and Transactional Data Basic Concepts Mining Single Dimensional, Single-Level Boolean Association Rules Naïve Algorithm Apriori Algorithm - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

Chapter 10ASSOCIATION RULES

Cios / Pedrycz / Swiniarski / KurganCios / Pedrycz / Swiniarski / Kurgan

Page 2: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 2

Outline

• Introduction• Association Rules and Transactional Data

– Basic Concepts

• Mining Single Dimensional, Single-Level Boolean Association Rules

– Naïve Algorithm– Apriori Algorithm– Generating Association Rules from Frequent Itemsets– Improving Efficiency of the Apriori Algorithm– Finding Interesting Association Rules

Page 3: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 3

Introduction

Association rules mining is another, after clustering, key unsupervised data mining method that finds interesting associations (relationships, dependencies) in large sets of data items.

Page 4: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 4

Introduction

AR are used to describe associations or correlations among a set of items in transaction databases, relational databases, and data warehouses

– applications:• basket data analysis, cross-marketing, catalog design,

clustering, data preprocessing, genomics, etc.- rule format: LHS RHS [support, confidence]

Examples:

buys(x, diapers) buys(x, beers) [0.5%, 60%]

major(x, CS) AND takes(x, Advanced Data Analysis) level(x, PhD) [1%, 75%]

Page 5: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

5

RulesBody ==> Consequent [ Support , Confidence ]

• Body: represents the examined data. • Consequent: represents a discovered property

for the examined data. • Support: represents the percentage of the

records satisfying the body or the consequent. • Confidence: represents the percentage of the

records satisfying both the body and the consequent to those satisfying only the body.

Page 6: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 6

Introductionin this shopping basket

customer bought tomatoes, carrots, bananas, bread, eggs,

sup, milk, etc.how the demographical

information affects what the customer buys?

is bread usually bought with milk?

does a specific milk brand make any difference?

where we place the tomatoes in the store to maximize their

sales?

is the bread bought when both milk and eggs are bought

together?

what can be learned using association rules?

Page 7: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 7

IntroductionAR can be derived when data describes events that occur

at the same time / in close proximity.

They can be:– useful, when containing high quality, actionable information

• diapers beer

– trivial, when they are valid and supported by data, but useless since they describe well known facts

• milk eggs

– inexplicable, when they concern valid and new facts, but they cannot be utilized

• grocery store milk is sold as often as bread

Page 8: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 8

IntroductionSome of the common applications:

– planning store layouts• we can place products that have a strong purchasing relationship close

together OR• we place such products far apart to increase traffic to purchase other

items

– planning bundling products and offering coupons• knowing that buys(x, diapers) buys(x, beers), discounts are not offered

on both beer and diapers at the same time– we discount one to increase sales and make money on the other

– designing direct marketing campaign• mailing a camcorder promotion to people who bought VCR is best when it

comes approximately two to three months after the VCR purchase

– other applications will be discussed later…

Page 9: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 9

Introduction

How do we derive ARs:

– techniques are based on probability and statistics

– the process consists of 4 steps1. prepare input data

2. choose items of interest… (itemset)

3. compute probabilities, joint probabilities, etc…

4. find (the most probable) association rules…

Page 10: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 10

Transactional Data

Data should be provided in transactional form– each record consists of transaction ID and information

about all items that constitute the transaction

Transaction ID Subset of all available items

TID Transaction (basket)

1000 Beer, Diapers, Eggs

… ….

Page 11: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 11

Transactional Data

First look at the transactional data– example

– rules (just by looking at the data)• Beer Eggs

• Apples Celery

TID Transaction (basket)

1000 Apples, Celery, Diapers

2000 Beer, Celery, Eggs

3000 Apples, Beer, Celery, Eggs

4000 Beer, Eggs

Page 12: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 12

Transactional Data

• Nominal data can also be transformed into transactions

– transformed into transactional format

ExampleAttributes

Target Attribute

Bar Fri/Sat Hungry Rain Estimate Wait?

e1 No No Yes No 0-10 Yes

… … … … … … …

TID Transaction (basket)

1 Bar=No, Fri/Sat=No, Hungry=Yes, Rain=No, Estimate=0-10, Wait?=Yes

… ….

Page 13: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 13

Association Rules

What do we want to do?– given database of transactions, where each transaction is a

list of items (purchased by a customer in a visit)

– find all rules that correlate presence of one set of items with that of another set of items• how many items do we consider in each set?

• how do we define correlation?– how strong the correlation should be?

• how do we extract useful rules?

Let us answer these questions

Page 14: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 14

Basic Definitions

itemset

transaction

association rule

dataset (set of transactions)

IT

},...,,{ 21 miiiI

BAIBIA

BA

,,

D

Page 15: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 15

Basic Definitions

How to measure interestingness of the rules?– each rule has two assigned measures

• support which indicates the frequencies of the occurring patterns

– defined as ratio of # of transactions containing A and B to the total # of transactions

• confidence which denotes the strength of implication in the rule

– defined as ratio of # of transactions containing A and B to the #of transactions containing A

||||

||}|{||)(

D

TBADTBAs

||}|{||

||}|{||)(

TADT

TBADTBAc

Page 16: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 16

Basic Definitions

How to measure interestingness of the rules?– example: diapers beer

• support

• confidence

||||

||}|{||)(

D

TBADTBAs

||}|{||

||}|{||)(

TADT

TBADTBAc

customerbuys diapers

customerbuys both

customerbuys beer

Page 17: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 17

Basic Definitions

How to measure interestingness of the rules?• let’s use a more complex rule and explain the concepts in

terms of probabilities– find all the rules A and B C with minimum confidence and

support» support is probability that a transaction contains {A, B, C}» confidence is conditional probability that a transaction

having {A, B}, also contains C

A and B C (support is 25%, confidence is 100%)

if we decide to have minimum support of 50% and minimum confidence of 50%, we generate two rules from the data:

A C (support 50%, confidence 66.6%)

C A (support 50%, confidence 100%)

TID Transaction

1 A, B, C

2 A, C

3 A, D

4 B, E, F

Page 18: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 18

Basic Definitions

What is I?

What is T for TID=2000?

What is support(Beer Eggs)?

What is confidence(Beer Eggs)?

TID Transaction (basket)

1000 Apples, Celery, Diapers

2000 Beer, Celery, Eggs

3000 Apples, Beer, Celery, Eggs

4000 Beer, Eggs

Page 19: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 19

Basic Definitions

What is I?

Apples, Beer, Celery, Diapers, Eggs m (# of items) = 5

What is T for TID=2000?

Beer, Celery, Eggs

What is support(Beer Eggs)?

3/4 = 75%

What is confidence(Beer Eggs)?

3/3 = 100%

TID Transaction (basket)

1000 Apples, Celery, Diapers

2000 Beer, Celery, Eggs

3000 Apples, Beer, Celery, Eggs

4000 Beer, Eggs

Page 20: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 20

Basic Definitions

Frequent itemsets– itemset is any set of items– k-itemset is an itemset containing k items– frequent itemset is an itemset that satisfies a minimum

support level

– the problem arises when we try to analyze dataset that contains m items

• how many itemsets are there?– many if m is large

Page 21: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 21

Strong Association Rules

Given an itemset it is easy to generate association rulesExample itemset = {Beer, Diapers}:

Beer, Diapers• Beer Diapers• Diapers Beer• Diapers, Beer

– we are interested only in strong rules• those which satisfy minimum support and minimum confidence

– these two are user-defined parameters

Page 22: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 22

Strong Association Rules

Given itemsets, it is easy to generate association rules– example itemsets:

{Beer, Diapers} with support 40%

{Beer, Diapers, Eggs} with support 20%

– rule

IF customer buys Beer and Diapers THEN the probability that (s)he buys Eggs is 50% can be inferred

• This provides simple descriptive pattern

Page 23: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 23

Association Rules

How to generate ARs?– two basic steps

1. Find all frequent itemsets – those that satisfy minimum support

2. Find all strong association rules– generate association rules from frequent itemsets– select and keep rules that satisfy minimum confidence

Page 24: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 24

Naïve Algorithm

How do we generate frequent itemsets (step 1)?– naïve algorithm

n = |D|for each subset s of I { counter = 0 for each transaction T in D { if s is a subset of T counter = counter + 1 } if minimum support ≤ counter / n add s to frequent itemsets }

Page 25: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 25

Frequent Itemsets

Does the naïve algorithm work well?• we have 2m subsets of I

• we have to scan n transactions for each subset

– thus we perform O(2mn) tests

– O(2mn) complexity shows that• the algorithm growth exponentially with the number of

items m

• Thus we have to use some other approach!

Page 26: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 26

Frequent Itemsets

FIs support the apriori property– if A is not a frequent itemset, then any superset of A is

not a frequent itemset• we will use this property to speed up the computations

Proofn is # of transactions

suppose A is a subset of i transactions

if A’ A, then A’ is a subset of i’ i transactions

thus

if i/n < minimum support, so is i’/n

Page 27: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 27

Frequent Itemsets

Using the apriori property

– candidate k-itemsets are build from frequent (k-1)-itemsets

• find all frequent 1-itemsets

• extend (k-1)-itemsets to candidate k-itemsets

• prune candidate itemsets that do not meet the minimum support requirement

Page 28: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 28

Apriori Algorithm

Improved algorithm to generate frequent itemsets

L1 = {frequent 1-itemsets}for (k=2; L(k-1) is not empty; k++) { Ck is generated as k-itemset candidate from L(k-1)

for each transaction T in D { Ct=subset(Ck,T) // k-itemsets that are subsets of T for each k-itemset c in Ct

c.count++; } Lk = {c in Ck such that c.count ≥ minimum support} }the frequent itemsets are the union of the Lk

Page 29: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 29

Frequent Itemsets

Apriori approach reduces number of considered itemsets (# of scans of D)– how do we generate k-itemset candidates?

1. for each item i that is not in a given frequent (k-1)-itemset, but in some other frequent (k-1) itemset in Lk-1

– add i to the (k-1)-itemset to create a k-itemset candidate– remove duplicates

» examplefrequent 1-itemsets {A}, {B}, {C}candidate 2-itemsets {A, B}, {A, C}, {B, A}, {B, C}, {C, A}, {C, B}eliminate duplicates {A, B}, {A, C}, {B, C}

2. joining together frequent (k-1)-itemsets– if frequent (k-1)-itemsets have (k-2)-items in common, create a k-itemset

candidate by adding two different items to (k-2) common items» example

{A, B, C} joined with {A, B, D} gives {A, B, C, D}

L1 = {frequent 1-itemsets}for (k=2; L(k-1) is not empty; k++) { Ck is generated as k-itemset candidates from L(k-1)

for each transaction T in D { Ct=subset(Ck,T) // candidates that are subsets of T for each candidate c in Ct

c.count++; } Lk = {c in Ck such that c.count ≥ minimum suport} }the frequent itemsets are the union of the Lk

Page 30: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 30

Association Rules

1. The frequent set is computed iteratively– 1st iteration

• large 1-candidate itemsets are found by scanning

– kth iteration• large k-candidate itemsets are generated by applying apriori-based

generation to large (k-1)-itemsets• apriori rule generates only those k-itemsets whose every

(k-1)-itemset subset is frequent (above the threshold)

2. Generating rules– for each frequent itemset X output all rules

Y (X – Y) if s(X) / s(Y) > minimum confidence• Y is a subset of X

1. Find all frequent itemsets – those that satisfy minimum support

2. Find all strong association rules– generate association rules from frequent itemsets– select and keep rules that satisfy minimum confidence

Page 31: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 31

Example

We will generate association rules from the transactional data given below:– minimum support = 50%

• the # of transactions above the minimum support is 4 x 50 % = 2– minimum confidence = 60%

TID Transaction

1000 Apples, Celery, Diapers

2000 Beer, Celery, Eggs

3000 Apples, Beer, Celery, Eggs

4000 Beer, Eggs

TID Transaction

1000 A, C, D

2000 B, C, E

3000 A, B, C, E

4000 B, E

Page 32: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 32

Exampleitemset

support count

A 2B 3C 3D 1E 3

TID Transaction

1000 A, C, D

2000 B, C, E

3000 A, B, C, E

4000 B, E

generate candidate 1-itemsets

generate frequent 2-itemsets

itemset

support count

A 2B 3C 3E 3delete candidates

below minimum support

generate candidate 2-itemsets

itemset

support count

A, B 1A, C 2A, E 1B, C 2B, E 3C, E 2

generate frequent 1-itemsets

itemset

support count

A, C 2B, C 2B, E 3C, E 2

generate frequent 3-itemsets

generate candidate 3-itemsets

itemset support count

B, C, E 2

we do not use 3-itemsets with (A, B)

and (A, E)they are below

minimum support

itemset support count

B, C, E 2

C1

L1

C2

C3

L2

L3

Page 33: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 33

Example

Finally we derive association rules• from generated frequent 3-itemset {B, C, E} with s = 50%

– we need to satisfy minimum confidence of 60%

B and C E with support = 50% and confidence = 100%B and E C with support = 50% and confidence = 66.7%C and E B with support = 50% and confidence = 100%B C and E with support = 50% and confidence = 66.7%C B and E with support = 50% and confidence = 66.7%E B and C with support = 50% and confidence = 66.7%

TID Transaction

1000 A, C, D

2000 B, C, E

3000 A, B, C, E

4000 B, E

Page 34: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 34

Improving Efficiency of Apriori

The Apriori algorithm can be modified to improve its efficiency (computational complexity) by:– hashing– removal of transactions that do not contain frequent

itemsets– sampling of the data – partitioning of the data– mining frequent itemsets without generation of

candidate itemsets

Page 35: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 35

Improving efficiency: Hashing

• Hashing– is used to reduce the size of the candidate k-itemsets,

i.e., itemsets that are generated from frequent itemsets from iteration k-1, Ck, for k>1

• for instance, when scanning D to generate L1 from the candidate 1-temsets in C1, we can at the same time generate all 2-itemsets for each transaction, hash (map) them into different buckets of the hash table structure and increase the corresponding bucket counts

• a 2-itemset, which corresponding bucket count is below the support threshold, cannot be frequent and thus we can remove it from the candidate set C2.

In this way we reduce the number of candidate 2-itemsets that must be examined to obtain L2

Page 36: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 36

Improving efficiency: Hashing

• To add itemset, start at root and go down until a leaf is reached

• At interior node at depth d, decide which branch to follow by applying hash function to the dth item of the itemset

• When # of items in a leaf node exceeds some threshold convert leaf node to an internal node

Page 37: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 37

Improving efficiency: Hashing

• To find candidates contained in given transaction, t– Hash on every item in t at the root node

• to ensure that itemsets that start with an item not in t are ignored

– At interior node reached by hashing the item i in the transaction hash on each item that comes after i in t

Page 38: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 38

Improving efficiency: Hashing

Let C3 = {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}}

Ct = subset(C3, {1 0 1 1 0})

First build hash-tree with candidate itemsets

Then determine which of those itemsets are actually in the given transaction

Page 39: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 39

Hash-Tree

Hash Table

{2 3 4}

ROOT

{1 3 4}{1 3 5}

{1 2 3}{1 2 4}

1

2 3

Ct = {1 3 4}

d = 2 Hash on the next

item in t: 3, ignore itemsets with 2 since not in t.

2

t = {1 0 1 1 0}

At root (d=1), hash on items in t: 1, 3, 4. 3, 4 return nothing since no

itemsets start with 3 or 4. 2 Is ignored since not in t.

Check which are in t. Those that are get added to Ct as

output from subset function.

Manually verify that of the 5 itemsets in C3 only {1 3 4} was actually present in the transaction

Page 40: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

Direct Hashing and Pruning(DHP)

40

Page 41: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

DHP

41

Page 42: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 42

Improving Efficiency of Apriori

• Removal of transactions that do not contain frequent itemsets– we remove transactions that do not contain frequent

itemsets– In general, if a transaction does not contain any

frequent k-itemsets, it cannot contain any frequent (k+1) itemsets, and thus can be removed from computation of any frequent t-itemsets, where t > k

Page 43: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 43

Improving Efficiency of Apriori

• Sampling of the data– we generate association rules based on a sampled

subset of transactions in D• a randomly selected subset S of D is used to search for the

frequent itemsets

• generation of frequent itemsets from S is more efficient (faster) but some of the rules that would have been generated from D may be missing, and some rules generated from S may not be present in D

Page 44: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 44

Improving Efficiency of Apriori• Partitioning of the data

– partitioning generates frequent itemsets by finding frequent itemsets in subsets (partition) of D

Page 45: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 45

Improving Efficiency of Apriori

• Mining frequent itemsets without generation of candidate itemsets– one of the main limiting aspects of the Apriori is that it

can generate very large number of candidate itemsets • for instance, for 10,000 1-itemsets, the Apriori algorithm generates

approximately 10,000,000 candidate 2-itemsets

– other limiting aspect is that the Apriori may need to repeatedly scan the data set D

– to address these issues, a divide-and-conquer method, which decomposes the overall problem into a set of smaller tasks, is used

• the method, referred to as frequent-pattern growth (FP-growth) compresses the set of frequent (individual) items from D into a frequent pattern tree (FP-tree)

Page 46: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

46

General Algorithm to be improved1. In the first pass, the support of each individual

item is counted, and the large ones are determined

2. In each subsequent pass, the frequent itemsets determined in the previous pass is used to generate new itemsets called candidate itemsets.

3. The support of each candidate itemset is counted, and the frequent ones are determined.

4. This process continues until no new large itemsets are found.

Page 47: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

47

AIS Algorithm

• Candidate itemsets are generated and counted on-the-fly as the database is scanned.

1. For each transaction, it is determined which of the frequent itemsets of the previous pass are contained in this transaction.

2. New candidate itemsets are generated by extending these frequent itemsets with other items in this transaction.

• The disadvantage is that this results in unnecessarily generating and counting too many candidate itemsets that turn out to be small.

Page 48: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

48

Example

TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database

Itemset

Support

{1} 2

{2} 3

{3} 3

{5} 3

L1Itemse

tSuppo

rt

{1 3}* 2

{1 4} 1

{3 4} 1

{2 3}* 2

{2 5}* 3

{3 5}* 2

{1 2} 1

{1 5} 1

C2

Itemset

Support

{1 3 4} 1

{2 3 5}*

2

{1 3 5} 1

C3

Page 49: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

49

SETM Algorithm

• Candidate itemsets are generated on-the-fly as the database is scanned, but counted at the end of the pass.

1. New candidate itemsets are generated the same way as in AIS algorithm, but the TID of the generating transaction is saved with the candidate itemset in a sequential structure.

2. At the end of the pass, the support count of candidate itemsets is determined by aggregating this sequential structure

• It has the same disadvantage of the AIS algorithm.• Another disadvantage is that for each candidate itemset,

there are as many entries as its support value.

Page 50: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

50

Example

TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database

Itemset

Support

{1} 2

{2} 3

{3} 3

{5} 3

L1Itemse

tTID

{1 3} 100

{1 4} 100

{3 4} 100

{2 3} 200

{2 5} 200

{3 5} 200

{1 2} 300

{1 3} 300

{1 5} 300

{2 3} 300

{2 5} 300

{3 5} 300

{2 5} 400

C2

Itemset

TID

{1 3 4} 100

{2 3 5} 200

{1 3 5} 300

{2 3 5} 300

C3

Page 51: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

51

Apriori Algorithm

• Candidate itemsets are generated using only the frequent itemsets of the previous pass without considering the transactions in the database.

1.The frequent itemset of the previous pass is joined with itself to generate all itemsets whose size is higher by 1.

2.Each generated itemset, that has a subset which is not frequent, is deleted. The remaining itemsets are the candidate ones.

Page 52: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

52

Example

TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database

Itemset

Support

{1} 2

{2} 3

{3} 3

{5} 3

L1Itemse

tSuppo

rt

{1 2} 1

{1 3}* 2

{1 5} 1

{2 3}* 2

{2 5}* 3

{3 5}* 2

C2

Itemset

Support

{2 3 5}*

2

C3 {1 2 3}

{1 3 5}

{2 3 5}

Page 53: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

53

AprioriTid Algorithm

• The database is not used at all for counting the support of candidate itemsets after the first pass.

1. The candidate itemsets are generated the same way as in Apriori algorithm.

2. Another set C’ is generated of which each member has the TID of each transaction and the large itemsets present in this transaction. This set is used to count the support of each candidate itemset.

• The advantage is that the number of entries in C’ may be smaller than the number of transactions in the database, especially in the later passes.

Page 54: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

54

Example

TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database

Itemset

Support

{1} 2

{2} 3

{3} 3

{5} 3

L1Itemse

tSuppo

rt

{1 2} 1

{1 3}* 2

{1 5} 1

{2 3}* 2

{2 5}* 3

{3 5}* 2

C2

Itemset

Support

{2 3 5}*

2

C3

100 {1 3}

200 {2 3}, {2 5}, {3 5}

300 {1 2}, {1 3}, {1 5}, {2 3}, {2 5}, {3 5}

400 {2 5}

C’2200 {2 3

5}

300 {2 3 5}

C’3

Page 55: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

55

Performance Analysis

Page 56: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

56

AprioriHybrid Algorithm

Performance Analysis shows that:

1. Apriori does better than AprioriTid in the earlier passes.

2. AprioriTid does better than Apriori in the later passes.

• Hence, a hybrid algorithm can be designed that uses Apriori in the initial passes and switches to AprioriTid when it expects that the set C’ will fit in memory.

Page 57: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

RARM and FOLDARM

• Both the techniques use a data structure SOTrieIT (Support ordered Trie Itemset)– Trie: position in the tree defines the key it is associated with– Support-ordered: most frequent on left-most branch

• This trie-like tree structure stores the support counts of all 1-itemsets and 2-itemsets in the database.– This structure is then sorted according to the support counts of

each node in descending order

• Unlike RARM, FOLDARM allows transactions and unique transactional items to be added and removed without the need to destroy the current SOTrieIT and rebuild it.

57

Page 58: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

FP-Growth Algorithm

58

Page 59: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 59

Finding Interesting Association Rules

Depending on the minimum support and confidence values the user may generate a large number of rules to analyze and assess

How to filter out the rules that are potentially the most interesting? – whenever a rule is interesting (or not) can be evaluated either

objectively or subjectively

• the ultimate but subjective user’s evaluation cannot be quantified or anticipated; they are different for different users

• that is why objective interestingness measures, based on the statistical information present in D are used

Page 60: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 60

Finding Interesting Association Rules

The subjective evaluation of association rules often boils down to checking if a given rule is unexpected (i.e., surprises the user) and actionable (i.e., the user can use it for something useful) – useful, when they provide high quality actionable

informatione.g. diapers beers

– trivial, when they are valid and supported by data but useless since they confirm well known facts

e.g. milk bread

– inexplicable, when they contain valid new facts but cannot be utilized

e.g. grocery_store milk_is_sold_as_often_as_bread

Page 61: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 61

Finding Interesting Association Rules

In most cases, confidence and support values associated with each rule are used as the objective measure to select the most interesting rules

– rules that have these values higher with respect to other rules are preferred

– although this simple approach works in many cases, we will show that sometimes rules with high confidence and support may be uninteresting or even misleading

Page 62: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 62

Objective interestingness measures – example

• let us assume that transactional data contains milk and bread as the frequent items

• 2,000 transactions were recorded – in 1,200 customers bought milk– in 1,650 customers bought bread– in 900 customers bought both milk and bread

milk not milk total

bread 900 750 1650

not bread 300 50 350

total 1200 800 2000

Finding Interesting Association Rules

Page 63: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 63

Objective interestingness measures – Example

• given the minimum support threshold of 40% and minimum confidence threshold of 70% rule

“milk bread [45%, 75%]” would be generated

• on the other hand, due to low support and confidence values the rule “milk not bread [15%, 25%]” would not be generated

• the latter rule is by far more “accurate” while the first may be misleading

Finding Interesting Association Rules

Page 64: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 64

Objective interestingness measures – example

• “milk bread [45%, 75%]” rule– probability of buying bread is 82.5%, while confidence of

milk bread is lower and equals 75%– bread and milk are negatively associated, i.e., buying one

decreases buying the other– obviously using this rule would not be a wise decision

Finding Interesting Association Rules

milk not milk total

bread 900 750 1650

not bread 300 50 350

total 1200 800 2000

Page 65: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 65

Finding Interesting Association Rules

Objective interestingness measures – alternative approach to evaluate interestingness of

association rules is to use measures based on correlation– for A B rule, the itemset A is independent of the

occurrence of the itemset B if P(A B) = P(A)P(B). Otherwise, itemsets A and B are dependent and correlated as events.

– correlation measure (also referred to as lift and interest), is defined between itemsets A and B as

)()(

)(

BPAP

BAP

Page 66: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 66

Finding Interesting Association Rules

Objective interestingness measures – correlation measure

• if the correlation value is less than 1, then the occurrence of A is negatively correlated (inhibits) the occurrence of B

• if the value is greater than 1, then A and B are positively correlated, which means that occurrence of one promotes occurrence of the other

• if correlation equals 1, then A and B are independent, i.e., there is no correlation between these itemsets

– correlation value for the rule milk bread is0.45 / (0.6*0.825) = 0.45 / 0.495 = 0.91

Page 67: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 67

Other measures for association rule

Objective interestingness measures – All-confidence– Collective strength– Coverage: It measures how often a rule X -> Y is applicable in a

database similar to support– Conviction: the frequency at which the rule makes an incorrect

prediction– Succinctness: characterized by clear, precise expression in few

words– Leverage: Leverage measures the difference of X and Y appearing

together in the data set and what would be expected if X and Y where statistically dependent. 

– Lift: Leverage measures the difference of X and Y appearing together in the data set and what would be expected if X and Y where statistically dependent. 

Page 68: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 68

Applications

• Web Personalization– personalization and recommendation systems for

browsing web pages• e.g. Amazon’s recommended books

Mobasher, et al., Effective Personalization Based on Association Rule Discovery from Web Usage Data, ACM Workshop on Web Information and Data Management, 2001

• Genomic Data– mining to find associations in genomic data

Satou, K., et al., Finding Association Rules on Heterogeneous Genome Data, Proceedings of the Pacific Symposium on Biocomputing, pp.397-408, pp.1997

Page 69: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 69

Genomic Data

Summary of results– 182,388 association rules were generated with

• minimum support = 5%• minimum confidence = 65%

– post processing of the generated rules selected rules with maximum support of 30

• final number of generated rules was 381– they were corroborated by biological data

» common substructures in serine endopeptidases were found

Page 70: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 70

References

R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules Between Sets of Items in Large Databases, Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 207-216, 1993

R. Agrawal, and R. Srikant, Fast Algorithms for Mining Association Rules, Proceedings of the 20th VLDB Conference, pp. 487-499, 1994

J. Han, M. Kamber, Data Mining, Morgan Kaufmann, 2001

Jong Soo Park, Member, Ming-Syan Chen and Philip S. Yu, Using a Hash-Based Method with Transaction Trimming for Mining Association Rules, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 5, SEPTEMBER/OCTOBER 1997

• Slides 5, 46 – 56 from Mohamed G. Elfeky, Purdue University• Slide 58, Florian Verhein, School of Information Technologies,The

University of Sydney,Australia

Page 71: Chapter 10 ASSOCIATION RULES

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 71

References

• Amitabha Das, Wee-Keong Ng, Yew-Kwong Woon, Rapid association rule mining, CIKM '01: Proceedings of the tenth international conference on Information and knowledge management, Oct 2001

• Yew-Kwong Woon Wee-Keong Ng Amitabha Das, Fast Online Dynamic Association Rule Mining, Web Information Systems Engineering, 2001. Proceedings of the Second International Conference, 2001