mining multiple-level association rules in large databases authors : jiawei han simon fraser...

52
Mining Multiple-level Association Rules in Large Databases Authors : Jiawei Han Simon Fraser University, British Columbia. Yongjian Fu University of Missouri-Rolla, Missouri. Presented by Michael Johnson IEEE Transactions on Knowledge and Data Engineering, 1999 1

Upload: zaria-russett

Post on 14-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

● Mining Multiple-level Association Rules in Large Databases

Authors :Jiawei Han

Simon Fraser University, British Columbia.

Yongjian FuUniversity of Missouri-Rolla,

Missouri.

Presented by Michael Johnson

IEEE Transactions on Knowledge and Data Engineering, 1999

1

Outline1. What is MLAR?

● Concepts

● Motivation

2. A Method For Mining M-L Association Rules● Problems/Solutions

● Definitions

● Algorithm Example

● Interestingness

● Optimizations

3. Conclusions/Future Work

4. Exam Questions2

Outline1. What is MLAR?

● Concepts

● Motivation

2. A Method For Mining M-L Association Rules● Problems/Solutions

● Definitions

● Algorithm Example

● Interestingness

● Optimizations

3. Conclusions/Future Work

4. Exam Questions3

What is MLDM?What's the difference between the following

rules: Rule A →70% of customers who bought

diapers also bought beer Rule B →45% of customers who bought

cloth diapers also bought dark beer Rule C →35% of customers who bought

Pampers also bought Samuel Adams beer.

4

Rule A applies at a generic higher level of abstraction (product)

Rule B applies at a more specific level of abstraction (category)

Rule C applies at the lowest level of abstraction (brand).

This process is called drilling down.5

What is MLDM?

What Do We Gain?

Concrete rules allow for More targeted marketing

New marketing strategies

More concrete relationships

6

Hierarchy Types

Generalization/Specialization (is a relationships)

Is a With Multiple Inheritance Whole-Part hierarchies (is-part-of; has-part)

7

Is A Relationship

2-Wheels

MotorcycleSUV

Vehicle

4-Wheels

Sedan Bicycle

8

Generalization to Specialization

Is A With Multiple Inheritance

Recreational

SnowmobileBicycle

Vehicle

Commuting

Car

9

Whole-Part Hierarchies

Hard Drive

PlatterCPU

Computer

Motherboard

RAM RW Head

10

Outline1. What is MLAR?

● Concepts

● Motivation

2. A Method For Mining M-L Association Rules● Problems/Solutions

● Definitions

● Algorithm Example

● Interestingness

3. Conclusions/Future Work

4. Exam Questions

11

MLAR: Main Goal

As usual we are trying to develop a method to extract non-trivial, interesting, and strong rules from our transactional database.

A method which:

Avoids trivial rules (Milk→Bread) Common sense

Avoids coincidental rules (Toy→Milk) Low support

12

What Do We Need?

1. Data Representation At Different Levels Of Abstraction

● Explicitly stored in databases

● Provided by experts or users

● Generated via clustering (OLAP)

2. Efficient Methods for ML Rule Mining (focus of this paper)

13

Possible Methods

Apply single-level Apriori Algorithm to each of the multiple levels under the same miniconf and minsup.

Potential Problems? Higher Levels of abstraction will naturally have higher

support, support decreases as we drill down What is the optimal minsup for all levels? Too high a minsup → too few itemsets for lower levels Too low a minsup → too many uninteresting rules

14

Possible Solutions

Adapt minsup for each level Adapt minconf for each level Do both This paper studies a progressive deepening

method developed by extension of the Apriori Algorithm, focused on minsup

15

Assumption

If an item is non-frequent at one level, its descendants no longer figure in further analysis.

Explore only descendants of frequent items as we drill down

Are there problems that arise from making these assumptions?

16

May eliminate possible interesting rules for itemsets at one level whose ancestors are not frequent at higher levels.

If so, can be addressed by 2 workarounds

2 minsup values at higher levels – one for filtering infrequent items, the other for passing down frequent items to lower levels; latter called level passage threshold (lph)

The lph may be adjusted by user to allow descendants of sub frequent items

17

Problems

Differences From Previous Research

Other studies use same minsup across different levels of the hierarchy

This study…. Uses different minsup values at different levels of

the hierarchy Analyzes different optimization techniques Studies the use of interestingness measures

18

Requirements

Transactional database must contain:1. Item dataset containing item description:

{<Ai>,<description>}

2. A transaction dataset T containing set of transactions..{Tid*,{Ap…Aq}}

*Tid is a transaction identifier (key)

19

Algorithm Flow

20

At Level 1: Generate frequent itemsets

Get table filtered for frequent itemsets T[2]

At subsequent levels: Generate candidate subsets using Apriori

Calculate support for generated candidates

Union 'passing' subsets with existing rule set

Repeat until no additional rules are generated, or desired level is reached

Outline1. What is MLAR?

● Concepts

● Motivation

2. A Method For Mining M-L Association Rules● Problems/Solutions

● Definitions

● Algorithm Example

● Interestingness

● Optimizations

3. Conclusions/Future Work

4. Exam Questions21

Definitions

22

A pattern or an itemset A is one item Ai or a set of conjunctive items Ai Λ …. Λ Aj

The support of a pattern is the number of transactions that contain A vs. the total number of transactions σ(A|S)

The confidence of a rule A → B in S is given by:φ(A→B) = σ(AUB)/σ(A) (i.e. conditional probability)

Specify 2 thresholds: minsup(σ’) and miniconf (φ’); different values at different levels

Definitions

A pattern A is frequent in set S at if: the support of A is no less than the corresponding minimum

support threshold σ’

A rule A → B is strong for a set S, if:a. each ancestor of every item in A and B is frequent at its

corresponding level

b. A Λ B is frequent at the current level and

c. φ(A→B)≥ φ’ (miniconf criteria)

This ensures that the patterns examined at the lower levels arise from itemsets that have a high support at higher levels

23

Outline1. What is MLAR?

● Concepts

● Motivation

2. A Method For Mining M-L Association Rules● Problems/Solutions

● Definitions

● Algorithm Example

● Interestingness

3. Conclusions/Future Work

4. Exam Questions

24

food

milk

2% chocolate

bread

white wheat

Dairyland Foremost Old Mills Wonder

Generalized ID System:

2% Foremost Milk Coded as GID:112

(1st item in Level 1, 1st item in level 2, 2nd item in level 3)

25

Level 1:

Level 2:

Level 3: ... .........

● Example: Taxonomy

Trans-id Bar_code_set

351428 {17325, 92108, ….}

653234 {23423, 56432,…}

Bar_code GID Category Brand Content Size Price

17325 112 milk Foremost 2% 1 Gal $3.31

….. ….. …… ….. ….. …..

Table 1:A sales-transaction Table

Table 2:A sales_item Description Relation

26

Example: Dataset

TID Items

T1 {111 , 121 , 211 , 221}

T2 {111 , 211 , 222 , 323}

T3 {112 , 122 , 221 , 411}

T4 {111 , 121}

T5 {111 , 122 , 211 , 221 , 413}

T6 {211 , 323 , 524}

T7 {323 , 411 , 524 , 713}27

Example: Preprocessing

Join sales_transaction table to sales_item table to produce encoded transaction table T[1]:

T[1]

Find Level-1 frequent itemsets

Minsup = 4Level-1 Frequent-1 Itemsets

L[1,1]

28

Example: Step 1

TID Items

T1 {111 , 121 , 211 , 221}

T2 {111 , 211 , 222 , 323}

T3 {112 , 122 , 221 , 411}

T4 {111 , 121}

T5 {111 , 122 , 211 , 221 , 413}

T6 {211 , 323 , 524}

T7 {323 , 411 , 524 , 713}

Itemset Support

{1**} 5

{2**} 5

Level-1 Frequent 2-Itemsets

L[1,2]Itemset Support

{1**,2**} 4

T[1]

Create T[2] by filtering T[1] w/ L[1,1]

29

Example: Step 2

TID Items

T1 {111 , 121 , 211 , 221}

T2 {111 , 211 , 222 , 323}

T3 {112 , 122 , 221 , 411}

T4 {111 , 121}

T5 {111 , 122 , 211 , 221 , 413}

T6 {211 , 323 , 524}

T7 {323 , 411 , 524 , 713}

Itemset Support

{1**} 5

{2**} 5

Filtered T[2]TID Items

T1 {111 , 121 , 211 , 221}

T2 {111 , 211 , 222}

T3 {112 , 122 , 221}

T4 {111 , 121}

T5 {111 , 122 , 211 , 221}

T6 {211}

30

Example: Step 3

Filtered T[2]TID Items

T1 {111 , 121 , 211 , 221}

T2 {111 , 211 , 222}

T3 {112 , 122 , 221}

T4 {111 , 121}

T5 {111 , 122 , 211 , 221}

T6 {211}Itemset Support

{11*, 12*, 22*} 3

{11*, 21*, 22*} 3

L[2,1]

Itemset Support

{11*, 12*} 4

{11*, 21*} 3

{11*,22*} 4

{12*, 22*} 3

{21*, 22*} 3

L[2,2]

Itemset Support

{11*} 5

{12*} 4

{21*} 4

{22*} 4

L[2,2]

L[2,3]

Find Level-2 Frequent Itemsets

Minsup = 3

31

Example: Step 4

Filtered T[2]TID Items

T1 {111 , 121 , 211 , 221}

T2 {111 , 211 , 222}

T3 {112 , 122 , 221}

T4 {111 , 121}

T5 {111 , 122 , 211 , 221}

T6 {211}

Itemset Support

{111, 211*} 3

L[3,1]Itemset Support

{111} 4

{211} 4

{221} 3

L[3,2]

Find Level-3 Frequent Itemsets

Minsup = 3

Stop: Lowest Level Reached

Outline1. What is MLAR?

● Concepts

● Motivation

2. A Method For Mining M-L Association Rules● Problems/Solutions

● Definitions

● Algorithm Example

● Interestingness

● Optimizations

3. Conclusions/Future Work

4. Exam Questions

32

33

Are All Of The Strong Rules Interesting?

MLDM creates unique challenges for rule pruning

The paper defines two filters for interesting rules:

1. Removal of redundant rules

2. Removal of unnecessary rules

Redundant Rules

34

Consider a strong rule at Level 1: Milk→Bread

food

milk

2% chocolate

bread

white wheat

Dairyland Foremost Old Mills Wonder ... .........

This rule is likely to have descendent rules which may or may not contain additional information, even if they met our minconf and minsup criteria at that level:

2% Milk→Wheat Bread, 2% Milk→White Bread, Chocolate Milk→Wheat Bread

We need a way to distinguish between rules that add information and those that are redundant

35

A rule is redundant if the confidence for a rule falls in a certain range and the items in the rule are descendents of a different rule.

Redundant Rules

36

Applying Redundant Rule reduction eliminates 40-70% of discovered Strong Rules

Redundant Rules

Unnecessary Rules

37

Consider the following rules:

R: Milk→Bread (minsup = 80%)

R': Milk, Butter → Bread (minsup = 80%)

How much additional information do we gain from the R'?

MLDM can produced very complex rules that meet our minsup and minconf criteria, but do contain much unique/useful information.

We need a way to distinguish between rules that add information and those that are Unnecessary

38

A rule R is unnecessary if there is a simpler rule R' and φ(R) is within a given range of φ(R')

Unnecessary Rule

Outline1. What is MLAR?

● Concepts

● Motivation

2. A Method For Mining M-L Association Rules● Problems/Solutions

● Definitions

● Algorithm Example

● Interestingness

● Optimizations

3. Conclusions/Future Work

4. Exam Questions39

40

Hardware Setup

Hardware: Sun Microsystems SPARCstation 20

1. 32MB RAM

2. 100Mhz Clock

3. CLI

41

Authors proposed 3 iterations of the original algorithm1)ML_T1LA

●Use only one encoded table T[1]

2)ML_TML1

●Generate T[1], T[2], … T[n+1]

3)ML_T2LA

●Uses T[2], but calculates down level support with a single scan

Algorithm Optimizations

42

Instead of generating T[2] from T[1], ML_T1LA algorithm generates support for all levels of hierarchy in a single scan from T[1]

Pros:Avoids generation of new transaction table

Limits number of scans to the size of the largest transaction

Cons:Scanning T[1] requires scanning all items, even infrequent ones

Performance may suffer for DB w/ many infrequent itemsets

Large memory required (32MB RAM = page swapping!!!)

ML_T1LA

43

Instead of using only T[2] for rule mining, ML_TML1 algorithm generates a table for each level, using L[i,1] to filter T[i] and create T[i+1]

Pros:Saves significant processing time if only a small portion of the data is frequent at each level

●Allows for creation of T[i] and L[i,1] in parallel

Cons:May not be efficient if only a small number of items is filtered at each level

ML_TML1

44

Like the base algorithm, ML_T2LA creates T[2] table from the frequent itemsets in T[1]. However, it allows for parallel creation of L[i,k].

Pros:

Saves time by limiting the number of scans

Cons:

May not be efficient if only a small number of items is filtered at each level

ML_T2LA

45

While the figures show that T2LA is best for most of the time, the authors preferred ML_T1LA

Experimental Results

Outline1. What is MLAR?

● Concepts

● Motivation

2. A Method For Mining M-L Association Rules● Problems/Solutions

● Definitions

● Algorithm Example

● Interestingness

● Optimizations

3. Conclusions/Future Work

4. Exam Questions46

Conclusions

This paper demonstrated: Extending association rules from single-level to multiple-

level.

A top-down progressive deepening technique for mining multiple-level association rules.

Filtering of uninteresting association rules

Performance optimization techniques (not covered)

47

Future Work

Develop efficient algorithms for mining multiple-level sequential patterns

Cross-level associations Improve interestingness of rules

48

Outline1. What is MLAR?

● Concepts

● Motivation

2. A Method For Mining M-L Association Rules● Problems/Solutions

● Definitions

● Algorithm Example

● Interestingness

● Optimizations

3. Conclusions/Future Work

4. Exam Questions49

Exam Question 1

Q. What is a major drawback to multiple-level data mining using the same minsup at all levels of a concept hierarchy?

A. Large support exists at higher levels of the hierarchy; smaller support at lower levels. In order to insure that sufficiently strong association rules are generated at the lower levels, we must reduce the support at higher levels which, in turn, could result in generation of many uninteresting rules at higher levels. Thus we are faced with the problem of determining which is the optimal minsup at all levels

50

Exam Question 2

Q. Give an example of a multiple level association rule

A.

High level: 80% of people who buy cereal also buy milk

Low Level: 25% of people who buy Cheerios cereal buy Hood 2% Milk

51

Exam Question 3

Q. There were 3 examples of hierarchy types in multiple level rule mining. Pick one and draw an example

52

2-Wheels

MotorcycleSUV

Vehicle

4-Wheels

Sedan Bike Recreational

SnowmobileBicycle

Vehicle

Commuting

Car

Hard Drive

PlatterCPU

Computer

Motherboard

RAM RW Head

Is-A Is-AMultiple

Inheritance

Whole-Part