frequent pattern mining using fp trees (1)

Upload: venkataraman-kamath

Post on 05-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    1/25

    SEMINAR BY

    NITHISH PAI B

    UNDER THE GUIDANCE OF

    MRS PUSHPALATA M P

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    2/25

    OVERVIEW

    Basic concepts and a roadmap Scalable frequent itemset mining methods

    APRIORI ALGORITHM: A Candidate generation test and

    PATTERN GROWTH APPROACH: Mining FrequentPatterns Without Candidate Generation

    CHARM / ECLAT: Mining by Exploring Vertical Data

    Format Summary and Conclusions

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    3/25

    WHAT IS FREQUENT PATTERNANALYSIS?

    Frequent pattern: a pattern (a set of items, subsequences,substructures, etc.) that occurs frequently in a data set

    First proposed by Agrawal, Imielinski, and Swami [AIS93] in thecontext of frequent item sets and association rule mining

    Motivation: Finding inherent regularities in data. What products were often purchased together? Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents?

    Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis,

    Web log (click stream) analysis, and DNA sequence analysis.

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    4/25

    WHY IS FREQUENT PATTERNMINING IMPORTANT?

    Freq. pattern: An intrinsic and important property of datasets

    Foundation for many essential data mining tasks

    Association, correlation, and causality analysis

    Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, timeseries and stream data

    Classification: discriminative, frequent pattern analysis

    Cluster analysis: frequent pattern-based clustering

    Data warehousing: iceberg cube and cube-gradient

    Semantic data compression: fascicles

    Broad applications

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    5/25

    BASIC CONCEPTS: FREQUENTPATTERNS

    Tid Items bought10 Tea, Nuts,Napkins

    20 Tea, Coffee, Napkins

    30 Tea, Napkins, Eggs

    itemset: A set of one or moreitems

    k-itemset X = {x1, , xk}

    (absolute) support, or, support

    40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk

    coun o : requency oroccurrence of an itemset X

    (relative) support, s, is thefraction of transactions thatcontains X (i.e., the probability

    that a transaction contains X) An itemset X isfrequent if Xs

    support is no less than a minsupthreshold

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    6/25

    BASIC CONCEPTS: ASSOCIATIONRULES

    Tid Items bought10 Tea, Nuts,Napkins

    20 Tea, Coffee, Napkins

    30 Tea, Napkins, Eggs

    Find all the rulesX Ywithminimum support and confidence

    support, s, probability that atransaction contains XY

    confidence, c, conditional

    40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk

    probability that a transactionhaving X also contains Y

    Let minsup = 50%, minconf = 50%

    Freq. Pat.: Tea:3, Nuts:3,

    Napkins:4,Eggs:3, {Tea, Napkins}:3

    Association rules: (many more!)

    Tea Napkins (60%, 100%)

    Napkins Tea (60%, 75%)

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    7/25

    CLOSED PATTERNS AND MAX-

    PATTERNS

    A long pattern contains a combinatorial number of subpatterns, e.g.,{a1, , a100} contains (1001) + (1002)+ +(100100) = 2100 1 = 1.27*1030

    sub-patterns!

    Solution:Mine closed patterns and max-patterns instead

    n emse s c ose s requen an ere ex s s no super-pa ernY X, with the same support as X(proposed by Pasquier, et al. @ICDT99)

    An itemset X is a max-pattern if X is frequent and there exists nofrequent super-pattern Y X (proposed by Bayardo @ SIGMOD98)

    Closed pattern is a lossless compression of freq. patterns

    Reducing the # of patterns and rules

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    8/25

    COMPUTATIONAL COMPLEXITY OFFREQUENT ITEMSET MINING

    How many itemsets are potentially to be generated in the worst case?

    The number of frequent itemsets to be generated is senstive to the minsupthreshold

    When minsup is low, there exist potentially an exponential number offre uent itemsets

    The worst case: MNwhere M: # distinct items, and N: max length oftransactions

    The worst case complexity vs. the expected probability

    Ex. Suppose Walmart has 104 kinds of products

    The chance to pick up one product 10-4

    The chance to pick up a particular set of 10 products: ~10-40

    What is the chance this particular set of 10 products to be frequent 10 times in 10transactions?

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    9/25

    THE DOWNWARD CLOSURE PROPERTY

    AND SCALABLE MINING METHODS

    The downward closure property of frequent patterns states that:

    Any subset of a frequent itemset must be frequent

    If {tea, napkins, nuts} is frequent, so is {tea, napkins}

    i.e., every transaction having {tea napkins nuts} also contains {tea napkins}

    Scalable mining methods: Three major approaches

    Apriori algorithm

    Freq. pattern growth

    Vertical data format approach

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    10/25

    APRIORI ALGORITHM:: A CANDIDATEGENERATION & TEST APPROACH

    Apriori pruning principle: If there is any itemset which is infrequent,its superset should not be generated/tested!

    It uses a breadth-first search strategy to count the support of itemsets

    and uses a candidate generation function which exploits the downwardc osure proper y o suppor .

    Method: Initially, scan DB once to get frequent 1-itemset

    Generate length (k+1) candidate itemsets from length k frequent itemsets

    Test the candidates against DB

    Terminate when no frequent or candidate set can be generate

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    11/25

    APRIORI ALGORITHM: EXAMPLE

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    12/25

    IMPROVEMENT OF APRIORIMETHOD

    DISADVANTAGES:

    Multiple scans of transaction database

    Huge number of candidates

    Tedious workload of support counting for candidates

    Improving Apriori: general ideas

    Reduce passes of transaction database scans

    Shrink number of candidates

    Facilitate support counting of candidates

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    13/25

    Pattern-Growth Approach:

    Mining Frequent Patterns WithoutCandidate Generation

    The FPGrowth Approach:

    Depth-first search

    Avoid explicit candidate generation

    Major philosophy: Grow long patterns from short ones using localrequen ems on y

    abc is a frequent pattern

    Get all transactions having abc, i.e., project DB on abc: DB|abc

    d is a local frequent item in DB|abc abcd is a frequent pattern

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    14/25

    CONSTRUCTING FP-TREE FROM ATRANSACTIONAL DATABASE

    Consider the following set of transactions Let the minimum Support count be 3

    TID Items bought (ordered) frequent items

    100 { , a, c, d, g, i m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}

    500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    15/25

    CONSTRUCTING FP-TREE FROM ATRANSACTIONAL DATABASE

    ALGORITHM FOR CONSTRUCTION OF A FP-TREE:

    1. Scan DB once, find frequent 1-itemset (single item pattern)

    2. Sort frequent items in frequency descending order, f-list

    3. Scan DB again, construct FP-tree

    - ree or e xamp e ransac on:

    {}

    f:4 c:1

    b:1

    p:1

    b:1c:3

    a:3

    b:1m:2

    p:2 m:1

    Header Table

    Item frequencyheadf 4c 4a 3b 3m 3p 3

    F-list=f-c-a-b-m-p

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    16/25

    PARTITION PATTERNS ANDDATABASES

    Frequent patterns can be partitioned into subsets according to f-list

    F-list = f-c-a-b-m-p

    Patterns containing p

    Patterns having m but no p

    Patterns having c but no a nor b, m, p

    Pattern f

    Completeness and non-redundancy

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    17/25

    FINDIND PATTERNS HAVING PFROM P-CONDITIONAL DATABASE Starting at the frequent item header table in the FPtree

    Traverse the FP-tree by following the link of each frequent item p

    Accumulate all oftransformed prefix paths of item p to form psconditional pattern base

    {}

    f:4 c:1

    b:1

    p:1

    b:1c:3

    a:3

    b:1m:2

    p:2 m:1

    Header Table

    Item frequency headf 4c 4a 3b 3m 3p 3

    Conditionalpattern bases

    item cond. pattern base

    c f:3

    a fc:3

    b fca:1, f:1, c:1

    m fca:2, fcab:1

    p fcam:2, cb:1

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    18/25

    FROM CONDITIONAL PATTERNBASES TO CONDITIONAL FP-TREES For each pattern-base

    Accumulate the count for each item in the base

    Construct the FP-tree for the frequent items of the pattern base

    m-conditionalpattern base:

    Header TableItem frequency headf 4c 4

    a 3b 3m 3p 3

    {}

    f:4 c:1

    b:1

    p:1

    b:1c:3

    a:3

    b:1m:2

    p:2 m:1

    fca:2, fcab:1

    {}

    f:3

    c:3

    a:3

    m-conditionalFP-tree

    All frequentpatterns relate tom

    m,

    fm, cm, am,

    fcm, fam, cam,

    fcam

    -> associations

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    19/25

    RECURSION: MINING EACHCONDITIONAL FP-TREE

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    20/25

    ALGORITHM FOR MINING FREQUENT

    ITEMSETS USING FP-TREE BY PATTERNFREQUENT GROWTH

    Input: D, a transaction database;minsup, the minimum support count threshold.

    Output: The complete set of frequent patterns.Method:

    1. The FP-tree is constructed in the following steps:a) Scan the transaction database D once. Collect F, the set of frequent items, and their support

    counts Sort F in support count descending order as L, the list of frequent items.b) Create the root of an FP-tree, and label it as null. For each transaction Trans in D

    do the following:Select and sort the frequent items in Trans according to the order of L. Let the sorted frequentitem list in Trans be[p|P], where p is the first element and P is the remaining list.

    Call insert tree([p|P], T which is performed as follows:If T has a child N such that N.item-name=p.item-name, then increment Ns countby 1;else create a new node N, and let its count be 1, its parent link be linked to T, andnode-link to the nodes with the same item-name via the node-link structure. If P isnonempty, call insert tree(P, N)recursively.

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    21/25

    ALGORITHM FOR MINING FREQUENT

    ITEMSETS USING FP-TREE BY PATTERNFREQUENT GROWTH

    2. The FP-tree is mined by calling FPgrowth(FPtree, null), which is implemented as follows.procedure FPgrowth(Tree , )if Tree contains a single path P then{

    for each combination (denoted as ) of the nodes in the path P{

    generate pattern U with support count=minimum support_count of nodes in ;}

    }else for each ai in the header of Tree {generate pattern = ai U with support_count = ai:support_count;Construct s conditional pattern base and then s conditional FP_tree Tree ;if Tree then

    call FPgrowth(Tree, );}

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    22/25

    BENEFITS OF FP-TREE STRUCTURE Completeness

    Preserve complete information for frequent pattern mining

    Never break a long pattern of any transaction

    Compactness Reduce irrelevant infoinfrequent items are gone

    Items in frequency descending order: the more

    frequently occurring, the more likely to be shared

    Never be larger than the original database (not count node-linksand the count field)

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    23/25

    ADVANTAGES OF PATTERNGROWTH APPROACH

    Divide-and-conquer:

    Decompose both the mining task and DB according to the frequentpatterns obtained so far

    Lead to focused search of smaller databases

    No candidate generation, no candidate test

    Compressed database: FP-tree structure

    No repeated scan of entire database

    Basic ops: counting local freq items and building sub FP-tree, no pattern

    search and matching

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    24/25

    EXTENSION OF PATTERN GROWTHMINING METHODOLOGY

    Mining closed frequent itemsets and max-patterns

    Mining sequential patterns

    Mining graph patterns

    Constraint-based mining of frequent patterns

    Computing iceberg data cubes with complex measures Pattern-growth-based Clustering

    Pattern-Growth-Based Classification

  • 7/31/2019 Frequent Pattern Mining Using Fp Trees (1)

    25/25

    CHARM / ECLAT: Mining byExploring Vertical Data Format

    Vertical format: t(AB) = {T11, T25, } tid-list: list of trans.-ids containing an itemset

    Deriving closed patterns based on vertical intersections t(X) = t(Y): X and Y always happen together

    t(X)

    t(Y): transaction having X always has Y Using diffset to accelerate mining

    Only keep track of differences of tids

    t(X) = {T1, T2, T3}, t(XY) = {T1, T3}

    Diffset (XY, X) = {T2}