bab 05 - association mining

Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 1/58

Bab 5Bab 5Mining Association Mining Association

RulesRules

Arif Djunaidye-mail: [email protected]

URL: www.its-sby.edu/~arif


Outline What is association rules mining? The Apriori algorithm Iceberg Queries Methods to improve Apriori’s efficiency Mining frequent patterns without

candidate generation Interestingness measurements Multiple-level associations rules mining


Association rule mining: Finding frequent patterns, associations, correlations, or causal

structures among sets of items or objects in transaction databases, relational databases, and other information repositories.

Applications: Basket data analysis, cross-marketing, catalog design,

clustering, classification, etc. Examples.

buys(x, “computer”) buys(x, “software”) [2%, 75%] age(x, “mature”) ^ takes(x, “DM”) grade(x, “A”) [5%, 75%]

What Is Association Rules Mining?


Association Rules Mining: Basic Principle

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Also known as market basket analysis

Market-Basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Example of Association Rules

{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},

Implication means co-occurrence, not causality!


Definition: Frequent Itemset Itemset

A collection of one or more items• Example: {Milk, Bread, Diaper}

k-itemset• An itemset that contains k items

Support count () Frequency of occurrence of an

itemset E.g. ({Milk, Bread,Diaper}) = 2

Support Fraction of transactions that

contain an itemset E.g. s({Milk, Bread, Diaper}) =

2/5 Frequent Itemset

An itemset whose support is greater than or equal to a minsup threshold

TID Items

1 Bread, Milk


3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer



Definition: Association Rule

Example:Beer}Diaper,Milk{

4.052

|T|)BeerDiaper,,Milk( s

67.032

)Diaper,Milk()BeerDiaper,Milk,(

c

Association Rule An implication expression of the

form X Y, where X and Y are itemsets

Example: {Milk, Diaper} {Beer}

Rule Evaluation Metrics Support (s)

• Fraction of transactions that contain both X and Y

Confidence (c)• Measures how often items in Y

appear in transactions thatcontain X

TID Items

1 Bread, Milk


3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer



Association Rule Mining Task

Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold

High confidence = strong pattern High support = occurs often

Less likely to be random occurrence Larger potential benefit from acting on the rule


Application 1 (Retail Stores)

Real market baskets chain stores keep TBs of customer purchase

info Value?

• how typical customers navigate stores• positioning tempting items• suggests cross-sell opportunities – e.g.,

hamburger sale while raising ketchup price• …

High support needed, or no $$’s


Application 2 (Information Retrieval)

Scenario 1 baskets = documents items = words in documents frequent word-groups = linked concepts.

Scenario 2 items = sentences baskets = documents containing sentences frequent sentence-groups = possible

plagiarism


Application 3 (Web Search)

Scenario 1 baskets = web pages items = outgoing links pages with similar references about same

topic Scenario 2

baskets = web pages items = incoming links pages with similar in-links mirrors, or same

topic


Mining Association RulesExample of Rules:

{Milk,Diaper} {Beer} (s=0.4, c=0.67){Milk,Beer} {Diaper} (s=0.4, c=1.0){Diaper,Beer} {Milk} (s=0.4, c=0.67){Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)

TID Items

1 Bread, Milk


3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer


Observations:• All the above rules are binary partitions of the same itemset:

{Milk, Diaper, Beer}

• Rules originating from the same itemset have identical support but can have different confidence

• Thus, we may decouple the support and confidence requirements


Mining Association Rules Goal – find all association rules such that

Support s confidence c

Reduction to Frequent Itemsets Problems Find all frequent itemsets X Given X={A1, …,Ak}, generate all rules X-Aj Aj Confidence = sup(X)/sup(X-Aj) Support = sup(X) Exclude rules whose confidence is too low

Observe X-Aj also frequent support known Finiding all frequent itemsets is the hard part!


Association Rule Mining: A Road Map

Boolean vs. quantitative associations (Based on the types of values handled) buys(x, “WINDOWS 2K”) ^ buys(x, “SQLServer”) buys(x,

“DBMiner”) [0.2%, 50%] age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”)

[1%, 75%] Single dimension vs. multiple dimensional

associations (see ex. Above) Single level vs. multiple-level analysis


How are association rules mined form large databases?

Association rule mining is a two-step process.1. Find all frequent itemsets:

By definition, each of these itemsets will occur at least as frequent as a pre-determined minimum support count.

2. Generate strong association rules form the frequent itemsets:By definition, these rules must satisfy minimum support and minimum confidence


Itemset Lattice: An Examplenull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given m items, there are 2m-1 possible candidate itemsets


Scale of Problem WalMart

sells m=100,000 items tracks n=1,000,000,000 baskets

Web several billion pages approximately one new “word” per page

Exponential number of itemsets m items → 2m-1 possible itemsets Cannot possibly example all itemsets for large m Even itemsets of size 2 may be too many m=100,000 → 5 trillion item pairs


Frequent Itemsets in SQL DBMSs are poorly suited to association rule mining Star schema

Sales Fact Transaction ID degenerate dimension Item dimension

Finding frequent 3-itemsets:SELECT Fact1.ItemID, Fact2.ItemID, Fact3.ItemID, COUNT(*)FROM Fact1 JOIN Fact2 ON Fact1.TID = Fact2.TID AND Fact1.ItemID < Fact2.ItemIDJOIN Fact3 ON Fact1.TID = Fact3.TID AND Fact1.ItemID < Fact2.ItemID AND Fact2.ItemID < Fact3.ItemIDGROUP BY Fact1.ItemID, Fact2.ItemID, Fact3.ItemIDHAVING COUNT(*) > 1000

Finding frequent k-itemsets requires joining k copies of fact table Joins are non-equijoins Impossibly expensive!


Association Rules and Data Warehouses

Typical procedure: Use data warehouse to apply filters

• Mine association rules for certain regions, dates Export all fact rows matching filters to flat file

• Sort by transaction ID• Items in same transaction are grouped together

Perform association rule mining on flat file An alternative:

Database vendors are beginning to add specialized data mining capabilities

Efficient algorithms for common data mining tasks are built in to the database system

• Decisions trees, association rules, clustering, etc. Not standardized yet


Finding Frequent Pairs

Frequent 2-Sets hard case already focus for now, later extend to k-sets

Naïve Algorithm Counters – all m(m–1)/2 item pairs (m = # of distinct

items) Single pass – scanning all baskets Basket of size b – increments b(b–1)/2 counters

Failure? if memory < m(m–1)/2 m=100,000 → 5 trillion item pairs Naïve algorithm is impractical for large m


Pruning Candidate ItemsetsMonotonicity principle:

If an itemset is frequent, then all of its subsets must also be frequent

Monotonicity principle holds due to the following property of the support measure:

Converse: If an itemset is infrequent, then all of its

supersets must also be infrequent

)()()(:, YsXsYXYX


Found to be Infrequent

null


A B C D E



ABCDE

Illustrating Monotonicity Principlenull


A B C D E



ABCDEPruned supersets


Mining Frequent Itemsets: the Key StepThe Apriori principle:Any subset of a frequent itemset must be frequent Find the frequent itemsets: the sets of items that

have minimum support A subset of a frequent itemset must also be a frequent

itemset• i.e., if {AB} is a frequent itemset, both {A} and {B} should be

a frequent itemset Iteratively find frequent itemsets with cardinality from 1 to

k (k-itemset) Use the frequent itemsets to generate association

rules.


The Apriori Algorithm Join Step: Ck is generated by joining Lk-1with itself Prune Step: Any (k-1)-itemset that is not frequent

cannot be a subset of a frequent k-itemset Pseudo-code:

Ck: Candidate itemset of size kLk : frequent itemset of size kL1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_supportendreturn k Lk;


The Apriori Algorithm — Example (sup_min=2)

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2

Scan D

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2


L2=(1,3)

1->3

Sup(1U3)=2

conf(1->3) = sup(1U3)/sup(1)=2/2=100%

3->1Sup(1U3)=2conf(3->1) = sup(1U3)/sup(3)=2/3=67%

Generateing Associatin Rules form Frequent Itemsets


L3 = (2,3,5)

2 U 3 -> 5sup (2U3U5) = 2, conf (2U3 -> 5) = sup(2U3U5)/sup(2U3) = 2/2 = 100%

2 -> 3 U 5sup (2U3U5) = 2, conf (2 -> 3 U 5) = sup(2U3U5)/sup(2) = 2/3 = 67%

2 U 5 -> 3sup (2U3U5) = 2, conf (2U5 -> 3) = sup(2U3U5)/sup(2U5) = 2/3 = 67%

3U5 -> 2sup (2U3U5) = 2, conf (3U5 -> 2) = sup(2U3U5)/sup(3U5) = 2/2 = 100%

3 -> 2U5sup (2U3U5) = 2, conf (3 ->2U 5) = sup(2U3U5)/sup(3) = 2/3 = 67%

5 -> 2U3sup (2U3U5) = 2, conf (5 -> 2U3) = sup(2U3U5)/sup(5) = 2/3 = 67%

Generateing Associatin Rules form Frequent Itemsets


How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

Step 2: pruningforall itemsets c in Ck do

forall (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from Ck


Example of Generating Candidates

L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3

abcd from abc and abd acde from acd and ace

Pruning: acde is removed because ade is not in L3

C4={abcd}


Iceberg Queries

Icerberg query: Compute aggregates over one or a set of attributes only for those whose aggregate values is above certain threshold

Example:select P.custID, P.itemID, sum(P.qty)from purchase Pgroup by P.custID, P.itemIDhaving sum(P.qty) >= 10

Compute iceberg queries efficiently by Apriori: First compute lower dimensions Then compute higher dimensions only when all the lower

ones are above the threshold


Iceberg Queries (Cont.)

Generate cust_list, a list of customer who bought three or more items in total, for example,

select P.cust_IDfrom Purchases Pgroup by P.cust_IDhaving SUM(P.qty)>=3;

Generate item_list, a list ofitems that were purchased by any customer in quantuties of three or more, for example,

select P.item_IDfrom Purchases Pgroup by P.item_IDhaving SUM(P.qty)>=3;


Is Apriori Fast Enough? — Performance Bottlenecks

The core of the Apriori algorithm: Use frequent (k – 1)-itemsets to generate candidate frequent k-

itemsets Use database scan and pattern matching to collect counts for

the candidate itemsets The bottleneck of Apriori: candidate generation

Huge candidate sets:• 104 frequent 1-itemset will generate 107 candidate 2-

itemsets• To discover a frequent pattern of size 100, e.g., {a1, a2, …,

a100}, one needs to generate 2100 1030 candidates. Multiple scans of database:

• Needs (n +1 ) scans, n is the length of the longest pattern


Methods to Improve Apriori’s Efficiency Transaction reduction:

A transaction that does not contain any frequent k-itemset is

useless in subsequent scans because it can not contain any

fewquent (k+1)-itemsets. Therefore, such a transaction can be

removed from further consideration.

Partitioning: Any itemset that is potentially frequent in DB must be frequent

in at least one of the partitions of DB


Partitioning

Transactions in D

Divide D into n partitions

Find the frequent itemsets local to each partition

(1 scan)

Combine all local frequent itemsets to form candidate itemset

Find global frequent itemsets among candidates

(1 scan)

Frequent itemsets in D

Phase IIPhase I


Scan once Algorithm (Support count: 3)

Item a Item b Item c Item d Item e

Transaction 1 1 1 0 1 1






Table – Boolean relational database D


Scan once Algorithm

Figure: A complete itemset tree for the five items a, b, c, d and e exemplified in database shown in the table

a b c d e

ab ac ad ae bc bd be cd ce de

abc abd abe acd ace ade bcd bce bde cde

abcd abce abde acde bcde

abcde

Level 0 (C15)

Level 1 (C25)

Level 2 (C35)

Level 3 (C45)

Level 4 (C55)

d

d d dc

a c

c

db

b d c d d


Support Count

T6 T5 T4 T3 T2 T1 Itemset T1-a T1-b T1-d T1-e … T6-b T6-c T6-d

45 1

11

11

11 1

11

ab

11 1

44

11

11

11

11

cd 1

11

54

11

11

11

1 11

eab 1/2 1/2

11/2

23

11

11 1

acad

1/21/2 1/2

1/21/2

44 1

11

11

11

1 aebc

1/21/2

1/21/2 1/2

4 1 1 1 1 bd 1/2 1/2 1/2 1/252 1

11

1 1 1 1 becd

1/21/2

1/2 1/21/2 1/2

33

11

11

11

cede 1/2

1/21/2

1/21/2

23

11

11 1

abcabd

1/31/3

1/31/3 1/3

1/31/3

1/31/3

41

11

1 1 1 abeacd

1/31/3

1/31/3

1/3 1/31/3 1/3

23

11

11 1

aceade

1/31/3 1/3

1/31/3

1/31/3

23

1 11 1 1

bcdbce

1/31/3

1/31/3

1/31/3

1/31/3

1/3

31

11

1 1 bdecde

1/31/3

1/31/3 1/3

1/31/3

12

11 1

abcdabce

1/41/4

1/41/4

1/41/4

1/41/4

1/41/4

1/4

31

11

1 1 abdeacde

1/41/4

1/4 1/41/4

1/41/4

1/41/4

1/41/4

11

11

bcdeabcde 1/5

1/41/5

1/41/5

1/41/5

1/41/5

1/41/5

1/41/5


Mining Frequent Patterns Without Candidate Generation

Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern

mining avoid costly database scans

Develop an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology: decompose mining

tasks into smaller ones Avoid candidate generation: sub-database test only!


Construct FP-tree from a Transaction DB

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

min_support = 0.5

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Steps:1. Scan DB once, find

frequent 1-itemset (single item pattern)

2. Order frequent items in frequency descending order

3. Scan DB again, construct FP-tree


Benefits of the FP-tree Structure

Completeness: preserves complete information for frequent pattern mining

Compactness reduce irrelevant information—infrequent items are gone frequency descending ordering: more frequent items are

more likely to be shared never be larger than the original database (if not count

node-links and counts)


Mining Frequent Patterns Using FP-tree

General idea (divide-and-conquer) Recursively grow frequent pattern path using the FP-tree

Method For each item, construct its conditional pattern-base, and

then its conditional FP-tree Repeat the process on each newly created conditional FP-

tree Until the resulting FP-tree is empty, or it contains only

one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)


Major Steps to Mine FP-tree

1) Construct conditional pattern base for each node in the FP-tree

2) Construct conditional FP-tree from each conditional pattern-base

3) Recursively mine conditional FP-trees and grow frequent patterns obtained so far


Step 1: From FP-tree to Conditional Pattern Base

Starting at the frequent header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of that item to form

a conditional pattern base

Conditional pattern basesitem cond. pattern basec f:3a fc:3b fca:1, f:1, c:1m fca:2, fcab:1p fcam:2, cb:1

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3


Step 2: Construct Conditional FP-tree

For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the

pattern base

m-conditional pattern base:

fca:2, fcab:1

{}

f:3

c:3

a:3m-conditional FP-tree

All frequent patterns concerning mm, fm, cm, am, fcm, fam, cam, fcam

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header TableItem frequency head f 4c 4a 3b 3m 3p 3


Mining Frequent Patterns by Creating Conditional Pattern-Bases

EmptyEmptyf{(f:3)}|c{(f:3)}c

{(f:3, c:3)}|a{(fc:3)}aEmpty{(fca:1), (f:1), (c:1)}b

{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m{(c:3)}|p{(fcam:2), (cb:1)}p

Conditional FP-treeConditional pattern-baseItem


Single FP-tree Path Generation

Suppose an FP-tree T has a single path P The complete set of frequent pattern of T can be

generated by enumeration of all the combinations of the sub-paths of P

{}

f:3

c:3

a:3

m-conditional FP-tree

All frequent patterns concerning mm, fm, cm, am, fcm, fam, cam, fcam


Principles of Frequent Pattern Growth

Pattern growth property Let be a frequent itemset in DB, B be 's conditional

pattern base, and be an itemset in B. Then is a frequent itemset in DB iff is frequent in B.

“abcdef ” is a frequent pattern, if and only if “abcde ” is a frequent pattern, and “f ” is frequent in the set of transactions containing

“abcde ”


Why Is Frequent Pattern Growth Fast?

Our performance study shows FP-growth is an order of magnitude faster than Apriori,

and is also faster than tree-projection Reasoning

No candidate generation, no candidate test Use compact data structure Eliminate repeated database scan Basic operation is counting and FP-tree building


Interestingness Measurements Objective measures

Two popular measurements: support; and confidence

Subjective measures (Silberschatz & Tuzhilin, KDD95)A rule (pattern) is interesting if it is unexpected (surprising to the user); and/or actionable (the user can do something with it)


Criticism to Support and Confidence

Example 1: (Aggarwal & Yu, PODS98) Among 5000 students

• 3000 play basketball• 3750 eat cereal• 2000 both play basket ball and eat cereal

play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.

play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence

basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000


Criticism to Support and Confidence (Cont.)

Example 2: X and Y: positively correlated, X and Z, negatively related support and confidence of X=>Z dominates

We need a measure of dependent or correlated events

P(B|A)/P(B) is also called the lift of rule A => B

X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1

Rule Support ConfidenceX=>Y 25% 50%X=>Z 37.50% 75%)()(

)(, BPAP

BAPcorr BA


Other Interestingness Measures: Interest

Interest (correlation, lift)

taking both P(A) and P(B) in consideration P(AUB)=P(B)*P(A), if A and B are independent events A and B negatively correlated, if the value is less than 1;

otherwise A and B positively correlated

)()()(

BPAPBAP

X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1

Itemset Support InterestX,Y 25% 2X,Z 37.50% 0.9Y,Z 12.50% 0.57


Multiple-Level Association Rules

Items often form hierarchy. Items at the lower level are

expected to have lower support.

Rules regarding itemsets at appropriate levels could be

quite useful. Transaction database can

be encoded based on dimensions and levels

We can explore shared multi-level mining

All

PrinterComputer

Desktop

CompaqIBM

Laptop B/WColor

TID Items T1 {111, 121, 211, 221} T2 {111, 211, 222, 323} T3 {112, 122, 221, 411} T4 {111, 121} T5 {111, 122, 211, 221, 413}

SonyHP


Mining Multi-Level Associations

A top_down, progressive deepening approach: First find high-level strong rules:

computer printer [20%, 60%]. Then find their lower-level “weaker” rules:

desktop printer [6%, 50%]. Variations at mining multiple-level association

rules. Level-crossed association rules:

desktop Sony color printer Association rules with multiple, alternative

hierarchies: desktop Color printer


Uniform SupportMulti-level mining with uniform support

Computer

[support = 10%]

Desktop

[support = 6%]

Laptop

[support = 4%]

Level 1min_sup = 5%

Level 2min_sup = 5%

Back


Reduced SupportMulti-level mining with reduced support

Desktop

[support = 6%]

Laptop

[support = 4%]

Level 1min_sup = 5%

Level 2min_sup = 3%

Computer

[support = 10%]


Multi-Dimensional Association: Concepts

Single-dimensional rules:buys(X, “milk”) buys(X, “bread”)

Multi-dimensional rules: Inter-dimension association rules (no repeated predicates)

age(X,”19-25”) occupation(X,“student”) buys(X,“coke”)

hybrid-dimension association rules (repeated predicates)age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)


Summary

Association rule mining probably the most significant contribution from

the database community in KDD A large number of papers have been published

Many interesting issues have been explored An interesting research direction

Association analysis in other types of data: spatial data, multimedia data, time series data, etc.


AkhirAkhirBab 5Bab 5

bab 05 - association mining

Documents