frequent pattern mining using fp trees (1)

7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

1/25

SEMINAR BY

NITHISH PAI B

UNDER THE GUIDANCE OF

MRS PUSHPALATA M P


2/25

OVERVIEW

Basic concepts and a roadmap Scalable frequent itemset mining methods

APRIORI ALGORITHM: A Candidate generation test and

PATTERN GROWTH APPROACH: Mining FrequentPatterns Without Candidate Generation

CHARM / ECLAT: Mining by Exploring Vertical Data

Format Summary and Conclusions


3/25

WHAT IS FREQUENT PATTERNANALYSIS?

Frequent pattern: a pattern (a set of items, subsequences,substructures, etc.) that occurs frequently in a data set

First proposed by Agrawal, Imielinski, and Swami [AIS93] in thecontext of frequent item sets and association rule mining

Motivation: Finding inherent regularities in data. What products were often purchased together? Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents?

Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis,

Web log (click stream) analysis, and DNA sequence analysis.


4/25

WHY IS FREQUENT PATTERNMINING IMPORTANT?

Freq. pattern: An intrinsic and important property of datasets

Foundation for many essential data mining tasks

Association, correlation, and causality analysis

Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, timeseries and stream data

Classification: discriminative, frequent pattern analysis

Cluster analysis: frequent pattern-based clustering

Data warehousing: iceberg cube and cube-gradient

Semantic data compression: fascicles

Broad applications


5/25

BASIC CONCEPTS: FREQUENTPATTERNS

Tid Items bought10 Tea, Nuts,Napkins

20 Tea, Coffee, Napkins

30 Tea, Napkins, Eggs

itemset: A set of one or moreitems

k-itemset X = {x1, , xk}

(absolute) support, or, support

40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk

coun o : requency oroccurrence of an itemset X

(relative) support, s, is thefraction of transactions thatcontains X (i.e., the probability

that a transaction contains X) An itemset X isfrequent if Xs

support is no less than a minsupthreshold


6/25

BASIC CONCEPTS: ASSOCIATIONRULES

Tid Items bought10 Tea, Nuts,Napkins

20 Tea, Coffee, Napkins

30 Tea, Napkins, Eggs

Find all the rulesX Ywithminimum support and confidence

support, s, probability that atransaction contains XY

confidence, c, conditional

40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk

probability that a transactionhaving X also contains Y

Let minsup = 50%, minconf = 50%

Freq. Pat.: Tea:3, Nuts:3,

Napkins:4,Eggs:3, {Tea, Napkins}:3

Association rules: (many more!)

Tea Napkins (60%, 100%)

Napkins Tea (60%, 75%)


7/25

CLOSED PATTERNS AND MAX-

PATTERNS

A long pattern contains a combinatorial number of subpatterns, e.g.,{a1, , a100} contains (1001) + (1002)+ +(100100) = 2100 1 = 1.27*1030

sub-patterns!

Solution:Mine closed patterns and max-patterns instead

n emse s c ose s requen an ere ex s s no super-pa ernY X, with the same support as X(proposed by Pasquier, et al. @ICDT99)

An itemset X is a max-pattern if X is frequent and there exists nofrequent super-pattern Y X (proposed by Bayardo @ SIGMOD98)

Closed pattern is a lossless compression of freq. patterns

Reducing the # of patterns and rules


8/25

COMPUTATIONAL COMPLEXITY OFFREQUENT ITEMSET MINING

How many itemsets are potentially to be generated in the worst case?

The number of frequent itemsets to be generated is senstive to the minsupthreshold

When minsup is low, there exist potentially an exponential number offre uent itemsets

The worst case: MNwhere M: # distinct items, and N: max length oftransactions

The worst case complexity vs. the expected probability

Ex. Suppose Walmart has 104 kinds of products

The chance to pick up one product 10-4

The chance to pick up a particular set of 10 products: ~10-40

What is the chance this particular set of 10 products to be frequent 10 times in 10transactions?


9/25

THE DOWNWARD CLOSURE PROPERTY

AND SCALABLE MINING METHODS

The downward closure property of frequent patterns states that:

Any subset of a frequent itemset must be frequent

If {tea, napkins, nuts} is frequent, so is {tea, napkins}

i.e., every transaction having {tea napkins nuts} also contains {tea napkins}

Scalable mining methods: Three major approaches

Apriori algorithm

Freq. pattern growth

Vertical data format approach


10/25

APRIORI ALGORITHM:: A CANDIDATEGENERATION & TEST APPROACH

Apriori pruning principle: If there is any itemset which is infrequent,its superset should not be generated/tested!

It uses a breadth-first search strategy to count the support of itemsets

and uses a candidate generation function which exploits the downwardc osure proper y o suppor .

Method: Initially, scan DB once to get frequent 1-itemset

Generate length (k+1) candidate itemsets from length k frequent itemsets

Test the candidates against DB

Terminate when no frequent or candidate set can be generate


11/25

APRIORI ALGORITHM: EXAMPLE


12/25

IMPROVEMENT OF APRIORIMETHOD

DISADVANTAGES:

Multiple scans of transaction database

Huge number of candidates

Tedious workload of support counting for candidates

Improving Apriori: general ideas

Reduce passes of transaction database scans

Shrink number of candidates

Facilitate support counting of candidates


13/25

Pattern-Growth Approach:

Mining Frequent Patterns WithoutCandidate Generation

The FPGrowth Approach:

Depth-first search

Avoid explicit candidate generation

Major philosophy: Grow long patterns from short ones using localrequen ems on y

abc is a frequent pattern

Get all transactions having abc, i.e., project DB on abc: DB|abc

d is a local frequent item in DB|abc abcd is a frequent pattern


14/25

CONSTRUCTING FP-TREE FROM ATRANSACTIONAL DATABASE

Consider the following set of transactions Let the minimum Support count be 3

TID Items bought (ordered) frequent items

100 { , a, c, d, g, i m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}

500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}


15/25

CONSTRUCTING FP-TREE FROM ATRANSACTIONAL DATABASE

ALGORITHM FOR CONSTRUCTION OF A FP-TREE:

1. Scan DB once, find frequent 1-itemset (single item pattern)

2. Sort frequent items in frequency descending order, f-list

3. Scan DB again, construct FP-tree

- ree or e xamp e ransac on:

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequencyheadf 4c 4a 3b 3m 3p 3

F-list=f-c-a-b-m-p


16/25

PARTITION PATTERNS ANDDATABASES

Frequent patterns can be partitioned into subsets according to f-list

F-list = f-c-a-b-m-p

Patterns containing p

Patterns having m but no p

Patterns having c but no a nor b, m, p

Pattern f

Completeness and non-redundancy


17/25

FINDIND PATTERNS HAVING PFROM P-CONDITIONAL DATABASE Starting at the frequent item header table in the FPtree

Traverse the FP-tree by following the link of each frequent item p

Accumulate all oftransformed prefix paths of item p to form psconditional pattern base

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency headf 4c 4a 3b 3m 3p 3

Conditionalpattern bases

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1


18/25

FROM CONDITIONAL PATTERNBASES TO CONDITIONAL FP-TREES For each pattern-base

Accumulate the count for each item in the base

Construct the FP-tree for the frequent items of the pattern base

m-conditionalpattern base:

Header TableItem frequency headf 4c 4

a 3b 3m 3p 3

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

fca:2, fcab:1

{}

f:3

c:3

a:3

m-conditionalFP-tree

All frequentpatterns relate tom

m,

fm, cm, am,

fcm, fam, cam,

fcam

-> associations


19/25

RECURSION: MINING EACHCONDITIONAL FP-TREE


20/25

ALGORITHM FOR MINING FREQUENT

ITEMSETS USING FP-TREE BY PATTERNFREQUENT GROWTH

Input: D, a transaction database;minsup, the minimum support count threshold.

Output: The complete set of frequent patterns.Method:

1. The FP-tree is constructed in the following steps:a) Scan the transaction database D once. Collect F, the set of frequent items, and their support

counts Sort F in support count descending order as L, the list of frequent items.b) Create the root of an FP-tree, and label it as null. For each transaction Trans in D

do the following:Select and sort the frequent items in Trans according to the order of L. Let the sorted frequentitem list in Trans be[p|P], where p is the first element and P is the remaining list.

Call insert tree([p|P], T which is performed as follows:If T has a child N such that N.item-name=p.item-name, then increment Ns countby 1;else create a new node N, and let its count be 1, its parent link be linked to T, andnode-link to the nodes with the same item-name via the node-link structure. If P isnonempty, call insert tree(P, N)recursively.


21/25

ALGORITHM FOR MINING FREQUENT

ITEMSETS USING FP-TREE BY PATTERNFREQUENT GROWTH

2. The FP-tree is mined by calling FPgrowth(FPtree, null), which is implemented as follows.procedure FPgrowth(Tree , )if Tree contains a single path P then{

for each combination (denoted as ) of the nodes in the path P{

generate pattern U with support count=minimum support_count of nodes in ;}

}else for each ai in the header of Tree {generate pattern = ai U with support_count = ai:support_count;Construct s conditional pattern base and then s conditional FP_tree Tree ;if Tree then

call FPgrowth(Tree, );}


22/25

BENEFITS OF FP-TREE STRUCTURE Completeness

Preserve complete information for frequent pattern mining

Never break a long pattern of any transaction

Compactness Reduce irrelevant infoinfrequent items are gone

Items in frequency descending order: the more

frequently occurring, the more likely to be shared

Never be larger than the original database (not count node-linksand the count field)


23/25

ADVANTAGES OF PATTERNGROWTH APPROACH

Divide-and-conquer:

Decompose both the mining task and DB according to the frequentpatterns obtained so far

Lead to focused search of smaller databases

No candidate generation, no candidate test

Compressed database: FP-tree structure

No repeated scan of entire database

Basic ops: counting local freq items and building sub FP-tree, no pattern

search and matching


24/25

EXTENSION OF PATTERN GROWTHMINING METHODOLOGY

Mining closed frequent itemsets and max-patterns

Mining sequential patterns

Mining graph patterns

Constraint-based mining of frequent patterns

Computing iceberg data cubes with complex measures Pattern-growth-based Clustering

Pattern-Growth-Based Classification


25/25

CHARM / ECLAT: Mining byExploring Vertical Data Format

Vertical format: t(AB) = {T11, T25, } tid-list: list of trans.-ids containing an itemset

Deriving closed patterns based on vertical intersections t(X) = t(Y): X and Y always happen together

t(X)

t(Y): transaction having X always has Y Using diffset to accelerate mining

Only keep track of differences of tids

t(X) = {T1, T2, T3}, t(XY) = {T1, T3}

Diffset (XY, X) = {T2}

frequent pattern mining using fp trees (1)

Documents