bab 05 - association mining

58
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 1/58 Bab 5 Bab 5 Mining Association Rules Mining Association Rules Arif Djunaidy e-mail: [email protected] URL: www.its-sby.edu/~arif

Upload: mochammad-adji-firmansyah

Post on 21-Jul-2016

14 views

Category:

Documents


3 download

DESCRIPTION

Data Mining

TRANSCRIPT

Page 1: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 1/58

Bab 5Bab 5Mining Association Mining Association

RulesRules

Arif Djunaidye-mail: [email protected]

URL: www.its-sby.edu/~arif

Page 2: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 2/58

Outline What is association rules mining? The Apriori algorithm Iceberg Queries Methods to improve Apriori’s efficiency Mining frequent patterns without

candidate generation Interestingness measurements Multiple-level associations rules mining

Page 3: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 3/58

Association rule mining: Finding frequent patterns, associations, correlations, or causal

structures among sets of items or objects in transaction databases, relational databases, and other information repositories.

Applications: Basket data analysis, cross-marketing, catalog design,

clustering, classification, etc. Examples.

buys(x, “computer”) buys(x, “software”) [2%, 75%] age(x, “mature”) ^ takes(x, “DM”) grade(x, “A”) [5%, 75%]

What Is Association Rules Mining?

Page 4: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 4/58

Association Rules Mining: Basic Principle

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Also known as market basket analysis

Market-Basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Example of Association Rules

{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},

Implication means co-occurrence, not causality!

Page 5: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 5/58

Definition: Frequent Itemset Itemset

A collection of one or more items• Example: {Milk, Bread, Diaper}

k-itemset• An itemset that contains k items

Support count () Frequency of occurrence of an

itemset E.g. ({Milk, Bread,Diaper}) = 2

Support Fraction of transactions that

contain an itemset E.g. s({Milk, Bread, Diaper}) =

2/5 Frequent Itemset

An itemset whose support is greater than or equal to a minsup threshold

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Page 6: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 6/58

Definition: Association Rule

Example:Beer}Diaper,Milk{

4.052

|T|)BeerDiaper,,Milk( s

67.032

)Diaper,Milk()BeerDiaper,Milk,(

c

Association Rule An implication expression of the

form X Y, where X and Y are itemsets

Example: {Milk, Diaper} {Beer}

Rule Evaluation Metrics Support (s)

• Fraction of transactions that contain both X and Y

Confidence (c)• Measures how often items in Y

appear in transactions thatcontain X

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Page 7: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 7/58

Association Rule Mining Task

Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold

High confidence = strong pattern High support = occurs often

Less likely to be random occurrence Larger potential benefit from acting on the rule

Page 8: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 8/58

Application 1 (Retail Stores)

Real market baskets chain stores keep TBs of customer purchase

info Value?

• how typical customers navigate stores• positioning tempting items• suggests cross-sell opportunities – e.g.,

hamburger sale while raising ketchup price• …

High support needed, or no $$’s

Page 9: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 9/58

Application 2 (Information Retrieval)

Scenario 1 baskets = documents items = words in documents frequent word-groups = linked concepts.

Scenario 2 items = sentences baskets = documents containing sentences frequent sentence-groups = possible

plagiarism

Page 10: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 10/58

Application 3 (Web Search)

Scenario 1 baskets = web pages items = outgoing links pages with similar references about same

topic Scenario 2

baskets = web pages items = incoming links pages with similar in-links mirrors, or same

topic

Page 11: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 11/58

Mining Association RulesExample of Rules:

{Milk,Diaper} {Beer} (s=0.4, c=0.67){Milk,Beer} {Diaper} (s=0.4, c=1.0){Diaper,Beer} {Milk} (s=0.4, c=0.67){Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Observations:• All the above rules are binary partitions of the same itemset:

{Milk, Diaper, Beer}

• Rules originating from the same itemset have identical support but can have different confidence

• Thus, we may decouple the support and confidence requirements

Page 12: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 12/58

Mining Association Rules Goal – find all association rules such that

Support s confidence c

Reduction to Frequent Itemsets Problems Find all frequent itemsets X Given X={A1, …,Ak}, generate all rules X-Aj Aj Confidence = sup(X)/sup(X-Aj) Support = sup(X) Exclude rules whose confidence is too low

Observe X-Aj also frequent support known Finiding all frequent itemsets is the hard part!

Page 13: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 13/58

Association Rule Mining: A Road Map

Boolean vs. quantitative associations (Based on the types of values handled) buys(x, “WINDOWS 2K”) ^ buys(x, “SQLServer”) buys(x,

“DBMiner”) [0.2%, 50%] age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”)

[1%, 75%] Single dimension vs. multiple dimensional

associations (see ex. Above) Single level vs. multiple-level analysis

Page 14: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 14/58

How are association rules mined form large databases?

Association rule mining is a two-step process.1. Find all frequent itemsets:

By definition, each of these itemsets will occur at least as frequent as a pre-determined minimum support count.

2. Generate strong association rules form the frequent itemsets:By definition, these rules must satisfy minimum support and minimum confidence

Page 15: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 15/58

Itemset Lattice: An Examplenull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given m items, there are 2m-1 possible candidate itemsets

Page 16: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 16/58

Scale of Problem WalMart

sells m=100,000 items tracks n=1,000,000,000 baskets

Web several billion pages approximately one new “word” per page

Exponential number of itemsets m items → 2m-1 possible itemsets Cannot possibly example all itemsets for large m Even itemsets of size 2 may be too many m=100,000 → 5 trillion item pairs

Page 17: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 17/58

Frequent Itemsets in SQL DBMSs are poorly suited to association rule mining Star schema

Sales Fact Transaction ID degenerate dimension Item dimension

Finding frequent 3-itemsets:SELECT Fact1.ItemID, Fact2.ItemID, Fact3.ItemID, COUNT(*)FROM Fact1 JOIN Fact2 ON Fact1.TID = Fact2.TID AND Fact1.ItemID < Fact2.ItemIDJOIN Fact3 ON Fact1.TID = Fact3.TID AND Fact1.ItemID < Fact2.ItemID AND Fact2.ItemID < Fact3.ItemIDGROUP BY Fact1.ItemID, Fact2.ItemID, Fact3.ItemIDHAVING COUNT(*) > 1000

Finding frequent k-itemsets requires joining k copies of fact table Joins are non-equijoins Impossibly expensive!

Page 18: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 18/58

Association Rules and Data Warehouses

Typical procedure: Use data warehouse to apply filters

• Mine association rules for certain regions, dates Export all fact rows matching filters to flat file

• Sort by transaction ID• Items in same transaction are grouped together

Perform association rule mining on flat file An alternative:

Database vendors are beginning to add specialized data mining capabilities

Efficient algorithms for common data mining tasks are built in to the database system

• Decisions trees, association rules, clustering, etc. Not standardized yet

Page 19: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 19/58

Finding Frequent Pairs

Frequent 2-Sets hard case already focus for now, later extend to k-sets

Naïve Algorithm Counters – all m(m–1)/2 item pairs (m = # of distinct

items) Single pass – scanning all baskets Basket of size b – increments b(b–1)/2 counters

Failure? if memory < m(m–1)/2 m=100,000 → 5 trillion item pairs Naïve algorithm is impractical for large m

Page 20: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 20/58

Pruning Candidate ItemsetsMonotonicity principle:

If an itemset is frequent, then all of its subsets must also be frequent

Monotonicity principle holds due to the following property of the support measure:

Converse: If an itemset is infrequent, then all of its

supersets must also be infrequent

)()()(:, YsXsYXYX

Page 21: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 21/58

Found to be Infrequent

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Illustrating Monotonicity Principlenull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDEPruned supersets

Page 22: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 22/58

Mining Frequent Itemsets: the Key StepThe Apriori principle:Any subset of a frequent itemset must be frequent Find the frequent itemsets: the sets of items that

have minimum support A subset of a frequent itemset must also be a frequent

itemset• i.e., if {AB} is a frequent itemset, both {A} and {B} should be

a frequent itemset Iteratively find frequent itemsets with cardinality from 1 to

k (k-itemset) Use the frequent itemsets to generate association

rules.

Page 23: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 23/58

The Apriori Algorithm Join Step: Ck is generated by joining Lk-1with itself Prune Step: Any (k-1)-itemset that is not frequent

cannot be a subset of a frequent k-itemset Pseudo-code:

Ck: Candidate itemset of size kLk : frequent itemset of size kL1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_supportendreturn k Lk;

Page 24: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 24/58

The Apriori Algorithm — Example (sup_min=2)

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2

Scan D

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

Page 25: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 25/58

L2=(1,3)

1->3

Sup(1U3)=2

conf(1->3) = sup(1U3)/sup(1)=2/2=100%

3->1Sup(1U3)=2conf(3->1) = sup(1U3)/sup(3)=2/3=67%

Generateing Associatin Rules form Frequent Itemsets

Page 26: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 26/58

L3 = (2,3,5)

2 U 3 -> 5sup (2U3U5) = 2, conf (2U3 -> 5) = sup(2U3U5)/sup(2U3) = 2/2 = 100%

2 -> 3 U 5sup (2U3U5) = 2, conf (2 -> 3 U 5) = sup(2U3U5)/sup(2) = 2/3 = 67%

2 U 5 -> 3sup (2U3U5) = 2, conf (2U5 -> 3) = sup(2U3U5)/sup(2U5) = 2/3 = 67%

3U5 -> 2sup (2U3U5) = 2, conf (3U5 -> 2) = sup(2U3U5)/sup(3U5) = 2/2 = 100%

3 -> 2U5sup (2U3U5) = 2, conf (3 ->2U 5) = sup(2U3U5)/sup(3) = 2/3 = 67%

5 -> 2U3sup (2U3U5) = 2, conf (5 -> 2U3) = sup(2U3U5)/sup(5) = 2/3 = 67%

Generateing Associatin Rules form Frequent Itemsets

Page 27: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 27/58

How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

Step 2: pruningforall itemsets c in Ck do

forall (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from Ck

Page 28: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 28/58

Example of Generating Candidates

L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3

abcd from abc and abd acde from acd and ace

Pruning: acde is removed because ade is not in L3

C4={abcd}

Page 29: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 29/58

Iceberg Queries

Icerberg query: Compute aggregates over one or a set of attributes only for those whose aggregate values is above certain threshold

Example:select P.custID, P.itemID, sum(P.qty)from purchase Pgroup by P.custID, P.itemIDhaving sum(P.qty) >= 10

Compute iceberg queries efficiently by Apriori: First compute lower dimensions Then compute higher dimensions only when all the lower

ones are above the threshold

Page 30: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 30/58

Iceberg Queries (Cont.)

Generate cust_list, a list of customer who bought three or more items in total, for example,

select P.cust_IDfrom Purchases Pgroup by P.cust_IDhaving SUM(P.qty)>=3;

Generate item_list, a list ofitems that were purchased by any customer in quantuties of three or more, for example,

select P.item_IDfrom Purchases Pgroup by P.item_IDhaving SUM(P.qty)>=3;

Page 31: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 31/58

Is Apriori Fast Enough? — Performance Bottlenecks

The core of the Apriori algorithm: Use frequent (k – 1)-itemsets to generate candidate frequent k-

itemsets Use database scan and pattern matching to collect counts for

the candidate itemsets The bottleneck of Apriori: candidate generation

Huge candidate sets:• 104 frequent 1-itemset will generate 107 candidate 2-

itemsets• To discover a frequent pattern of size 100, e.g., {a1, a2, …,

a100}, one needs to generate 2100 1030 candidates. Multiple scans of database:

• Needs (n +1 ) scans, n is the length of the longest pattern

Page 32: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 32/58

Methods to Improve Apriori’s Efficiency Transaction reduction:

A transaction that does not contain any frequent k-itemset is

useless in subsequent scans because it can not contain any

fewquent (k+1)-itemsets. Therefore, such a transaction can be

removed from further consideration.

Partitioning: Any itemset that is potentially frequent in DB must be frequent

in at least one of the partitions of DB

Page 33: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 33/58

Partitioning

Transactions in D

Divide D into n partitions

Find the frequent itemsets local to each partition

(1 scan)

Combine all local frequent itemsets to form candidate itemset

Find global frequent itemsets among candidates

(1 scan)

Frequent itemsets in D

Phase IIPhase I

Page 34: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 34/58

Scan once Algorithm (Support count: 3)

Item a Item b Item c Item d Item e

Transaction 1 1 1 0 1 1

Transaction 2 0 1 1 0 1

Transaction 3 1 1 0 1 1

Transaction 4 1 1 1 0 1

Transaction 5 1 1 1 1 1

Transaction 6 0 1 1 1 0

Table – Boolean relational database D

Page 35: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 35/58

Scan once Algorithm

Figure: A complete itemset tree for the five items a, b, c, d and e exemplified in database shown in the table

a b c d e

ab ac ad ae bc bd be cd ce de

abc abd abe acd ace ade bcd bce bde cde

abcd abce abde acde bcde

abcde

Level 0 (C15)

Level 1 (C25)

Level 2 (C35)

Level 3 (C45)

Level 4 (C55)

d

d d dc

a c

c

db

b d c d d

Page 36: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 36/58

Support Count

T6 T5 T4 T3 T2 T1 Itemset T1-a T1-b T1-d T1-e … T6-b T6-c T6-d

45 1

11

11

11 1

11

ab

11 1

44

11

11

11

11

cd 1

11

54

11

11

11

1 11

eab 1/2 1/2

11/2

23

11

11 1

acad

1/21/2 1/2

1/21/2

44 1

11

11

11

1 aebc

1/21/2

1/21/2 1/2

4 1 1 1 1 bd 1/2 1/2 1/2 1/252 1

11

1 1 1 1 becd

1/21/2

1/2 1/21/2 1/2

33

11

11

11

cede 1/2

1/21/2

1/21/2

23

11

11 1

abcabd

1/31/3

1/31/3 1/3

1/31/3

1/31/3

41

11

1 1 1 abeacd

1/31/3

1/31/3

1/3 1/31/3 1/3

23

11

11 1

aceade

1/31/3 1/3

1/31/3

1/31/3

23

1 11 1 1

bcdbce

1/31/3

1/31/3

1/31/3

1/31/3

1/3

31

11

1 1 bdecde

1/31/3

1/31/3 1/3

1/31/3

12

11 1

abcdabce

1/41/4

1/41/4

1/41/4

1/41/4

1/41/4

1/4

31

11

1 1 abdeacde

1/41/4

1/4 1/41/4

1/41/4

1/41/4

1/41/4

11

11

bcdeabcde 1/5

1/41/5

1/41/5

1/41/5

1/41/5

1/41/5

1/41/5

Page 37: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 37/58

Mining Frequent Patterns Without Candidate Generation

Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern

mining avoid costly database scans

Develop an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology: decompose mining

tasks into smaller ones Avoid candidate generation: sub-database test only!

Page 38: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 38/58

Construct FP-tree from a Transaction DB

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

min_support = 0.5

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Steps:1. Scan DB once, find

frequent 1-itemset (single item pattern)

2. Order frequent items in frequency descending order

3. Scan DB again, construct FP-tree

Page 39: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 39/58

Benefits of the FP-tree Structure

Completeness: preserves complete information for frequent pattern mining

Compactness reduce irrelevant information—infrequent items are gone frequency descending ordering: more frequent items are

more likely to be shared never be larger than the original database (if not count

node-links and counts)

Page 40: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 40/58

Mining Frequent Patterns Using FP-tree

General idea (divide-and-conquer) Recursively grow frequent pattern path using the FP-tree

Method For each item, construct its conditional pattern-base, and

then its conditional FP-tree Repeat the process on each newly created conditional FP-

tree Until the resulting FP-tree is empty, or it contains only

one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)

Page 41: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 41/58

Major Steps to Mine FP-tree

1) Construct conditional pattern base for each node in the FP-tree

2) Construct conditional FP-tree from each conditional pattern-base

3) Recursively mine conditional FP-trees and grow frequent patterns obtained so far

Page 42: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 42/58

Step 1: From FP-tree to Conditional Pattern Base

Starting at the frequent header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of that item to form

a conditional pattern base

Conditional pattern basesitem cond. pattern basec f:3a fc:3b fca:1, f:1, c:1m fca:2, fcab:1p fcam:2, cb:1

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

Page 43: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 43/58

Step 2: Construct Conditional FP-tree

For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the

pattern base

m-conditional pattern base:

fca:2, fcab:1

{}

f:3

c:3

a:3m-conditional FP-tree

All frequent patterns concerning mm, fm, cm, am, fcm, fam, cam, fcam

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header TableItem frequency head f 4c 4a 3b 3m 3p 3

Page 44: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 44/58

Mining Frequent Patterns by Creating Conditional Pattern-Bases

EmptyEmptyf{(f:3)}|c{(f:3)}c

{(f:3, c:3)}|a{(fc:3)}aEmpty{(fca:1), (f:1), (c:1)}b

{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m{(c:3)}|p{(fcam:2), (cb:1)}p

Conditional FP-treeConditional pattern-baseItem

Page 45: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 45/58

Single FP-tree Path Generation

Suppose an FP-tree T has a single path P The complete set of frequent pattern of T can be

generated by enumeration of all the combinations of the sub-paths of P

{}

f:3

c:3

a:3

m-conditional FP-tree

All frequent patterns concerning mm, fm, cm, am, fcm, fam, cam, fcam

Page 46: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 46/58

Principles of Frequent Pattern Growth

Pattern growth property Let be a frequent itemset in DB, B be 's conditional

pattern base, and be an itemset in B. Then is a frequent itemset in DB iff is frequent in B.

“abcdef ” is a frequent pattern, if and only if “abcde ” is a frequent pattern, and “f ” is frequent in the set of transactions containing

“abcde ”

Page 47: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 47/58

Why Is Frequent Pattern Growth Fast?

Our performance study shows FP-growth is an order of magnitude faster than Apriori,

and is also faster than tree-projection Reasoning

No candidate generation, no candidate test Use compact data structure Eliminate repeated database scan Basic operation is counting and FP-tree building

Page 48: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 48/58

Interestingness Measurements Objective measures

Two popular measurements: support; and confidence

Subjective measures (Silberschatz & Tuzhilin, KDD95)A rule (pattern) is interesting if it is unexpected (surprising to the user); and/or actionable (the user can do something with it)

Page 49: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 49/58

Criticism to Support and Confidence

Example 1: (Aggarwal & Yu, PODS98) Among 5000 students

• 3000 play basketball• 3750 eat cereal• 2000 both play basket ball and eat cereal

play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.

play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence

basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000

Page 50: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 50/58

Criticism to Support and Confidence (Cont.)

Example 2: X and Y: positively correlated, X and Z, negatively related support and confidence of X=>Z dominates

We need a measure of dependent or correlated events

P(B|A)/P(B) is also called the lift of rule A => B

X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1

Rule Support ConfidenceX=>Y 25% 50%X=>Z 37.50% 75%)()(

)(, BPAP

BAPcorr BA

Page 51: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 51/58

Other Interestingness Measures: Interest

Interest (correlation, lift)

taking both P(A) and P(B) in consideration P(AUB)=P(B)*P(A), if A and B are independent events A and B negatively correlated, if the value is less than 1;

otherwise A and B positively correlated

)()()(

BPAPBAP

X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1

Itemset Support InterestX,Y 25% 2X,Z 37.50% 0.9Y,Z 12.50% 0.57

Page 52: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 52/58

Multiple-Level Association Rules

Items often form hierarchy. Items at the lower level are

expected to have lower support.

Rules regarding itemsets at appropriate levels could be

quite useful. Transaction database can

be encoded based on dimensions and levels

We can explore shared multi-level mining

All

PrinterComputer

Desktop

CompaqIBM

Laptop B/WColor

TID Items T1 {111, 121, 211, 221} T2 {111, 211, 222, 323} T3 {112, 122, 221, 411} T4 {111, 121} T5 {111, 122, 211, 221, 413}

SonyHP

Page 53: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 53/58

Mining Multi-Level Associations

A top_down, progressive deepening approach: First find high-level strong rules:

computer printer [20%, 60%]. Then find their lower-level “weaker” rules:

desktop printer [6%, 50%]. Variations at mining multiple-level association

rules. Level-crossed association rules:

desktop Sony color printer Association rules with multiple, alternative

hierarchies: desktop Color printer

Page 54: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 54/58

Uniform SupportMulti-level mining with uniform support

Computer

[support = 10%]

Desktop

[support = 6%]

Laptop

[support = 4%]

Level 1min_sup = 5%

Level 2min_sup = 5%

Back

Page 55: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 55/58

Reduced SupportMulti-level mining with reduced support

Desktop

[support = 6%]

Laptop

[support = 4%]

Level 1min_sup = 5%

Level 2min_sup = 3%

Computer

[support = 10%]

Page 56: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 56/58

Multi-Dimensional Association: Concepts

Single-dimensional rules:buys(X, “milk”) buys(X, “bread”)

Multi-dimensional rules: Inter-dimension association rules (no repeated predicates)

age(X,”19-25”) occupation(X,“student”) buys(X,“coke”)

hybrid-dimension association rules (repeated predicates)age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)

Page 57: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 57/58

Summary

Association rule mining probably the most significant contribution from

the database community in KDD A large number of papers have been published

Many interesting issues have been explored An interesting research direction

Association analysis in other types of data: spatial data, multimedia data, time series data, etc.

Page 58: Bab 05 - Association Mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 58/58

AkhirAkhirBab 5Bab 5