multilevel association

38
A Novel Algorithm for Cross Level Frequent Pattern Mining in Multi datasets:

Upload: shoni17

Post on 09-Feb-2016

34 views

Category:

Documents


2 download

DESCRIPTION

mining frequent data in a multi-level format

TRANSCRIPT

Page 1: Multilevel Association

A Novel Algorithm for Cross Level Frequent Pattern Mining in Multi datasets:

Page 2: Multilevel Association

Table of Contents:

Abstract

List of Keywords

Introduction

Literature Survey

Existing System

Drawbacks in Existing System

Proposed System

System Design

Advantage of Proposed System

Requirement Specification

Modules

Modules description

Conclusion

References

Page 3: Multilevel Association

Abstract

We consider the problem of discovering association rules between items

in a large database of sales transactions. We present two new algorithms

for solving this problem that is fundamentally different from the known

algorithms. Empirical evaluation shows that these algorithms outperform

the known algorithms by factors ranging from three for small problems

to more than an order of mag-nitude for large problems. We also show

how the best frequent pattern mining has become one of the most

popular data mining approaches for the analysis of purchasing patterns.

There are techniques such as Apriori and FP-Growth, which were

typically restricted to a single concept level. We extend our research to

discover cross - level frequent patterns in multi-level environments.

Unfortunately, little research has been paid to this research area. Mining

cross - level frequent pattern may lead to the discovery of mining

patterns at different levels of hierarchy. In this study a transaction

reduction technique with FP-tree based bottom up approach is used for

mining cross-level pattern. This method is using the concept of reduced

support.

Page 4: Multilevel Association

Introduction

This discusses the theories and algorithms for the maintenance of

frequent pattern space. Frequent patterns", also known as frequent item

sets, refer to patterns that appear frequently in a particular dataset.

Frequent patterns are denied based on a user-denied threshold, called the

support threshold". Given a dataset, we say a Pattern is a frequent

pattern if and only if its occurrence frequency is above or equals to the

support threshold. We also denies the collection of all frequent patterns

as the frequent pattern space" or the space of frequent patterns".

Frequent patterns are a very important type of patterns in data mining.

Frequent patterns play an essential role in various knowledge discovery

tasks, such as the discovery of association rules, correlations, causality,

sequential patterns, partial periodicity, emerging patterns, etc. In the last

decade, the discovery of frequent patterns has attracted tremendous

research attention, and a phenomenal number of discovery algorithms,

such as are proposed. The maintenance of the frequent pattern space is

as crucial as the discovery of the pattern space. This is because data is

dynamic in nature. Due to the advance in data generation and collection

technologies, databases are constantly updated with newly collected

Page 5: Multilevel Association

data. Data updates are also used as a means in interactive data mining, to

gauge the impact caused by hypothetical changes to the data and to

detect emergence and disappearance of trends. When a database is often

updated or modified for interactive mining, repeating the pattern

discovery process from scratch causes significant computational and I/O

overheads. Therefore, effective maintenance algorithms are needed to

update and maintain the frequent pattern space. This Thesis focuses on

the maintenance of frequent pattern space for transactional datasets. We

observe that most of the prior works in frequent pattern maintenance are

proposed as an extension of certain frequent pattern discovery

algorithms or the data structures they used. Unlike the prior works, this

Thesis lays a theoretical foundation for the development of effective

maintenance algorithms by analyzing the evolution of frequent pattern

space in response to data changes. We study the evolution of pattern

space using the concept of equivalence classes. Inspired by the evolution

analysis, novel maintenance algorithms are proposed to handle various

data updates.

Page 6: Multilevel Association

Apriori-based algorithms

Apriori is the most influential algorithm for frequent pattern discovery.

Many Discovery algorithms are inspired by Apriori. Apriori employs a

candidate- generation-verification" framework. The algorithm generates

its candidate patterns using a “level-wise" search. The essential idea of

the level-wise search is to iteratively enumerate the set of candidate

patterns of length (k + 1) from the set of frequent patterns of length k.

The support of candidate patterns will then be counted by scanning the

dataset. One major drawback of Apriori is that it leads to the

enumeration of a huge number of candidate patterns. For example, if a

dataset has 100 items, Apriori may need to generate candidates. Another

drawback of Apriori is that it requires multiple scans of the dataset to

count the support of candidate patterns. Different variations of Apriori

are proposed to address these limitations. Introduced a hash-based

technique in to reduce the size of candidate patterns. proposed to speed

up the support counting process by reducing the number of transactions

scanned in future iterations. The idea of is that a transaction that does

not contain any frequent pattern of length k cannot contain any frequent

pattern with length greater than k. Therefore, such transactions can be

ignored for subsequent iterations.

Page 7: Multilevel Association

FP-tree-based algorithms:

To address the shortcoming of the candidate-generation-verification

framework, Fp tree-Based algorithms, which involve no candidate

generation, are proposed. Examples of Fp tree-Based algorithms include

FP-growth described in is the state-of-the-art Fp tree-Based discovery

FP-growth mines frequent patterns based on a structure, Frequent

Pattern Tree (FP-tree). FP-tree is a compact representation of all relevant

frequency information in a database. Every branch of the FP-tree

represents a projected transaction" and also a candidate pattern. The

nodes along the branches are stored in descending order of the support

values of corresponding items, so leaves are representing the least

frequent items. Compression is achieved by building the tree in such a

way that overlapping transactions share prefixes of the corresponding

branches. Demonstrates how FP-tree is constructed for the sample

dataset given a support threshold First, the dataset is transformed into

the projected dataset". With FP-tree, FP-growth generates frequent

patterns using a fragment growth technique". The fragment growth

technique enumerates frequent patterns based on the support information

stored in FP-tree, which effectively avoids the generation of unnecessary

candidate patterns. Inspired by the idea of divide-and-conquer, the

fragment growth technique decomposes the mining tasks into subtasks

Page 8: Multilevel Association

that mines frequent patterns for conditional datasets, which greatly

reduces the search space.

Details of the technique can be referred to. FP-growth significantly

outperforms both the Apriori-based and partition-based algorithms. The

advantages of FP-growth are: rst, FP-tree effectively compresses and

summarizes the dataset so that multiple scans of dataset is no longer

needed to obtain the support of patterns; second, the fragment growth

technique ensures no un-necessary candidate patterns are enumerated;

lastly, the search task is simplified with a divide-and-conquer method.

However, FP-growth, like other pre x-tree based algorithms, still the

undesirable large size of the frequent pattern space. To break this

bottleneck, algorithms are proposed to discover the concise

representations of frequent pattern space

Data mining Technology:

Generally, data mining (sometimes called data or knowledge discovery)

is the process of analyzing data from different perspectives and

summarizing it into useful information - information that can be used to

increase revenue, cuts costs, or both. Data mining software is one of a

number of analytical tools for analyzing data. It allows users to analyze

data from many different dimensions or angles, categorize it, and

summarize the relationships identified. Technically, data mining is the

Page 9: Multilevel Association

process of finding correlations or patterns among dozens of fields in

large relational databases.

While large-scale information technology has been evolving separate

transaction and analytical systems, data mining provides the link

between the two. Data mining software analyzes relationships and

patterns in stored transaction data based on open-ended user queries.

Several types of analytical software are available: statistical, machine

learning, and neural networks. Generally, any of four types of

relationships are sought:

Classes: Stored data is used to locate data in predetermined

groups. For example, a restaurant chain could mine customer

purchase data to determine when customers visit and what they

typically order. This information could be used to increase traffic

by having daily specials.

Clusters: Data items are grouped according to logical relationships

or consumer preferences. For example, data can be mined to

identify market segments or consumer affinities.

Associations: Data can be mined to identify associations. The

beer-diaper example is an example of associative mining.

Sequential patterns: Data is mined to anticipate behavior patterns

and trends. For example, an outdoor equipment retailer could

Page 10: Multilevel Association

predict the likelihood of a backpack being purchased based on a

consumer's purchase of sleeping bags and hiking shoes.

Data mining consists of five major elements:

Extract, transform, and load transaction data onto the data

warehouse system.

Store and manage the data in a multidimensional database system.

Provide data access to business analysts and information

technology professionals.

Analyze the data by application software.

Present the data in a useful format, such as a graph or table.

Different levels of analysis are available:

Artificial neural networks: Non-linear predictive models that

learn through training and resemble biological neural networks in

structure.

Genetic algorithms: Optimization techniques that use processes

such as genetic combination, mutation, and natural selection in a

design based on the concepts of natural evolution.

Page 11: Multilevel Association

Decision trees: Tree-shaped structures that represent sets of

decisions. These decisions generate rules for the classification of a

dataset. Specific decision tree methods include Classification and

Regression Trees (CART) and Chi Square Automatic Interaction

Detection (CHAID) . CART and CHAID are decision tree

techniques used for classification of a dataset. They provide a set

of rules that you can apply to a new (unclassified) dataset to

predict which records will have a given outcome. CART segments

a dataset by creating 2-way splits while CHAID segments using

chi square tests to create multi-way splits. CART typically requires

less data preparation than CHAID.

Nearest neighbor method: A technique that classifies each record

in a dataset based on a combination of the classes of the k record(s)

most similar to it in a historical dataset (where k 1). Sometimes

called the k-nearest neighbor technique.

Rule induction: The extraction of useful if-then rules from data

based on statistical significance.

Data visualization: The visual interpretation of complex

relationships in multidimensional data. Graphics tools are used to

illustrate data relationships.

Page 12: Multilevel Association

EXISTING SYSTEM:

In the Existing system the Top-down approach is used. The

Existing has implemented to find large 1 frequent

pattern for all levels using new method CCB-tree.

Drawbacks in Existing System:

1) Top-down Approach

2)In this algorithm can’t reduce the search

spaces without losing any patterns.

3)There is no Reduction based frequent pattern

mining for single concept level.

Page 13: Multilevel Association

PROPOSED SYSTEM:

Level-Crossing:

One approach to multilevel mining would be to directly exploit the

standard algorithms in this area – Apriori and FP-Growth by iteratively

applying them in a level by level manner to each concept level. In this

paper, we introduce a new study in discovery of frequent patterns based

on the FP-tree. Our approach is different from FP-Growth algorithm

which needs to recursively generate conditional FP-trees such that a

large amount of memory space needs to be used.

Our approach minimizes I/O costs by applying transaction

reduction technique and applying the resulted transactions in FP-tree as

input to subsequent iterations of the mining process. Our method adopts

a bottom-up approach, with a leaf to root traversal, so as to identify

frequent patterns existing between arbitrary classification levels. Our

method reduces the search spaces without losing any patterns.

A new approach to mine frequent patterns for multi datasets has to

be considered. Work has been done in adopting approaches originally

made for single level datasets into techniques usable on multilevel

datasets.

Page 14: Multilevel Association

In this work, we attempt to reduce the unwanted patterns and

transactions using transaction reduction technique and applying the

resulted transactions in FP-tree as input to subsequent iterations of the

mining process. Our method adopts a bottom-up approach, with a leaf to

root traversal with single FP-tree generation, so as to identify frequent

patterns existing between arbitrary classification levels. Our method

reduces the I/O costs and search spaces without losing any patterns.

ADVANDAGES:

1)Bottom-Up Approach

2) In this algorithm we reduce the search spaces

without losing any patterns.

3)Here a new algorithm for transaction reduction

based frequent pattern mining in single concept

level.

Page 15: Multilevel Association

SYSTEM DESIGN

DATA SET

FINDING THE FREQUENT

PATTERN TREE

CCB TREE CONSTRUCTION

REDUCED TRANSACTION

TABLE

FP TREE GENERATION

FREQUENT PATTERN

GENERATION

PERFORMANCE EVALUATION

FIND SUPPORT AND COUNT

APPLY CROSS LEVEL SET

ANALYSIS ALGORITHM

EXTRACTION

APPLY ASSOCIATION

RULE

BY APRIORI

DELETE MIN SUPPORT COUNT

FIND FREQUENT ITEM SET

ORDERED ITEM SET

Page 16: Multilevel Association

Requirement Specification:

Hardware Requirements:

• System : Pentium IV 2.4 GHz

• Hard Disk : 160 GB

• Monitor : 15 VGA color

• Mouse : Logitech.

• Keyboard : 110 keys enhanced

• Ram : 1 GBSoftware Requirements:

• Os : Windows Xp,7

• Language : .Net

• Data Base : Sql server 2005

Page 17: Multilevel Association

Modules

Multilevel Association mining

Find frequent item set(Apriori algorithm)

CCB- Tree mining

FP-Tree generation

Frequent pattern generation

Performance evaluation

Modules Description

Multilevel Association mining:

In data mining, association rule learning is a popular and well

researched method for discovering interesting relations between

variables in large databases. It is intended to identify strong rules

discovered in databases using different measures of interestingness. e.g.,

promotional pricing or product placements. In addition to the above

Page 18: Multilevel Association

example from market basket analysis association rules are employed

today in many application areas including Web usage mining, intrusion

detection, Continuous production and bioinformatics. As opposed

to sequence mining, association rule learning typically does not consider

the order of items either within a transaction or across transactions.

Find frequent item set

Apriori is the most influential algorithm for frequent pattern discovery.

Many Discovery algorithms are inspired by Apriori. Apriori employs a

candidate- generation-verification" framework. The algorithm generates

its candidate patterns using a “level-wise" search. The essential idea of

the level-wise search is to iteratively enumerate the set of candidate

patterns of length (k + 1) from the set of frequent patterns of length k.

Using this we can find the frequent item set.

CCB- Tree mining :

CCB – Tree Algorithm has been used to find multilevel frequent 1

pattern for all levels. CCB – Tree starts from Left most initial node

and deletes the minimum support count to provide the reduced

transaction table.

Page 19: Multilevel Association

FP-Tree generation:

A FP-tree is a compact data structure that represents the data set in

tree form.  Each transaction is read and then mapped onto a path in

the FP-tree. This is done until all transactions have been read.

Different transactions that have common subsets allow the tree to

remain compact because their paths overlap. 

The diagram to the right is an example of a best-case scenario that

occurs when all transactions have exactly the same item set; the size

of the FP-tree will be only a single branch of nodes. 

Frequent pattern generation:

FP-tree, the next phase is to generate candidate item sets and find

frequent patterns. Cross-level frequent pattern with bottom up

approach starts from the leaf nodes of an existing FP-tree and

traverses each branch upwards until it reaches its root. We begin

by scanning the tree and identifying its leaf nodes. A pointer to

each leaf is then inserting into the leaf node array. We now

perform a bottom up scan of each leaf node until we reach the root.

Meanwhile each node visited is conserved into temporary buffer

for recording the passing path when a node with support count is

visited. Candidate Generation keeps the path from starting node.

Performance evaluation:

Page 20: Multilevel Association

In this module we can evaluate the performance of the result of this

process in the graph. In this graph we can conclude the result

perfectly. It’s easy to analyze bye the users.

Page 21: Multilevel Association

LITERATURE SURVEY:

Mining frequent item sets without candidate generation

Implements

In many cases, the Apriori algorithm significantly reduces the

size of candidate sets using the Apriori principle. However, it can

suffer from two-nontrivial costs:

(1) Generating a huge number of candidate sets,

(2) repeatedly scanning the database and checking the

candidates by pattern matching.

(3) Devised an FP-growth method that mines the complete

set of frequent item sets without candidate generation. FP-

growth works in a divide-and-conquer way. The first scan

of the database derives a list of frequent items in which

items are ordered by frequency descending order.

Page 22: Multilevel Association

Algorithm for Efficient Multilevel Association Rule Mining

Implements

Over the years, a variety of algorithms for finding frequent item sets in very large transaction databases have been developed. The problems of finding frequent item sets are basic in multi level association rule mining, fast algorithms for solving problems are needed. This paper presents an efficient version of apriori algorithm for mining multi-level association rules in large databases to finding maximum frequent item set at lower level of abstraction. We propose a new, fast and an efficient algorithm (SC-BF Multilevel) with single scan of database for mining complete frequent item sets. To reduce the execution time and increase throughput in new method. Our proposed algorithm works well comparison with general approach of multilevel association rules.

An Efficient Algorithm for Mining Multilevel Association Rule Based on Pincer Search

Implements

Discovering frequent item set is a key difficulty in significant data mining

applications, such as the discovery of association rules, strong rules, episodes, and

minimal keys. The problem of developing models and algorithms for multilevel

association mining poses for new challenges for mathematics and computer

science. In this paper, we present a model of mining multilevel association rules

which satisfies the different minimum support at each level, we have employed

princer search concepts, multilevel taxonomy and different minimum supports to

find multilevel association rules in a given transaction data set. This search is used

Page 23: Multilevel Association

only for maintaining and updating a new data structure. It is used to prune early

candidates that would normally encounter in the top-down search. A main

characteristic of the algorithms is that it does not require explicit examination of

every frequent item sets, an example is also given to demonstrate and support that

the proposed mining algorithm can derive the multiple-level association rules

under different supports in a simple and effective manner

Fast Algorithm for Mining Multi-Level Association Rules in Large Databases

Implements

Association rule mining finds interesting association among a large set of

data items. With massive amount of data continuously being collected and stored.

Many industries are becoming interested in mining association rules from their

databases. The discovery of interesting association relationship among huge

amount of business transaction records can help in much business decision making

process, such as catalogue design, cross marketing and loss leader analysis.

An Efficient Approach for Incremental Association Rule Mining

Implements

we study the issue of maintaining association rules in a large database of

sales transactions. The maintenance of association rules can be mapped into the

problem of maintaining large itemsets in the database. Because the mining of

association rules is time-consuming, we need an efficient approach to maintain the

large itemsets when the database is updated. In this paper, we present efficient

approaches to solve the problem. Our approaches store the itemsets that are not

large at present but may become large itemsets after updating the database, so that

Page 24: Multilevel Association

the cost of processing the updated database can be reduced. Moreover, we discuss

the cases where the large itemsets can be obtained without scanning the original

database. Experimental results show that our algorithms outperform other

algorithms, especially when the original database need not be scanned in our

algorithms.

Page 25: Multilevel Association

Conclusion

Transaction databases in many applications contain data that has built-in

hierarchy information. In such databases, users may be interested in finding

association among items only at the same level and we extended the scope of study

of mining level-crossing association rules from large databases. A transaction

reduction technique based method is used to reduce the unwanted candidates and

transactions and applying the resulted transactions in FP-tree as input to

subsequent iterations of the mining process. We adopted a bottom-up approach,

with a leaf to root traversal with single FP-tree generation, so as to identify

frequent patterns existing between arbitrary classification levels. Our method

reduces the I/O costs and search spaces without losing any patterns. Performance

Evaluation demonstrates the viability of our new method. In future, an efficient

algorithm can be generated to reduce the redundancy in cross-level association

rules.

Page 26: Multilevel Association

References

[1] T.Eavis and XI Zheng, Multi-Level Frequent Pattern Mining, in Springer-Verlag Berlin Heidelberg 2009, pp. 369 – 383.

[2] Dr.K.Duraiswamy and B.Jayanthi, a Novel preprocessing Algorithm for Frequent Pattern Mining in Mutidatasets, International Journal of Data Engineering,Vol. 2, No. 3, Aug 2011.

[3] Han, J., Fu, Y., Discovery of Multiple-Level Association Rules from Large Databases, in Proceedings of the 21st Very Large Data Bases Conference, Morgan Kaufmann, P. 420-431, 1995.

[4] Yinbo WAN, Yong LIANG, Liya DING, “Mining Multilevel Association Rules from Primitive Frequent Item sets”, Journal of Macau University of Science and Technology, Vol.3 No.1, 2009

[5] Thakur, R. S., Jain, R. C., Pardasani, K. R., Mining Level-Crossing Association Rules from Large Databases, in the Journal of Computer Science 2(1), P. 76-81, 2006.

[6] R.E.Thevar, R.Krishnamoorthy, A New Approach of Modified Transaction Reduction Algorithm For mining Frequent Item set, proceedings of IEEE Workshop on Data mining and Artificial Intelligence, 2008.

[7] Rajkumar.N, Karthik.M.R, Sivanada.S.N, “Fast Algorithm for mining multilevel Association Rules,”IEEE Trans. Knowledge and Data Engg., Vol.2 pp. 688-692, 2003.

[8] Pratima Gautham, Pardasani, K. R., “Algorithm for Efficient Multilevel Association Rule Mining”, International Journal of Computer Science and Engineering, Vol.2 pp. 1700-1704, 2010