sowmya review

7/29/2019 Sowmya Review

1/24

A New Two-Phase Sampling Algorithm for

Discovering Association Rules


2/24

Data mining techniques have been widely used in various applications. Data mining extract novel and

useful knowledge from large repositories of data and has become an effective analysis and decision

means in corporation. The sharing of data for data mining can bring a lot of advantages for research

and business collaboration. Data mining is becoming an increasingly important tool to transform the

data into information. The volume of electronically accessible data in warehouse and on the internet is

growing faster, scalability of mining is a major concern and classical mining algorithms require one or

more passes over the entire database can take one hours or even days to execute and in the future the

problem will become worse, to avoid this problem using a sample of data as the synopsis is a populartechnique that can scale very well as the data grow. Mining and analysis algorithms require one or

more computationally intensive passes over the entire database become slow and worse in future. In

Data Mining, Association Rule Mining is a popular and well researched method for discovering relations

between variables in a large database and the information can be used as the basis for decisions about

marketing activities such as market basket analysis, product placements etc.

This project is based on Apriori, SRS (Simple Random Sampling) and FAST (Finding Associations from

Sampled Transactions) algorithm to generate association rules and also for discovering the rules in alarge database. In a large database by applying Apriori, Simple Random Sampling and FAST algorithm

the user can find a best algorithm of calculating the strong and weak rule of the dataset. The user can

calculate the time difference and accuracy in order to find an efficient result of discovering the

association rules.


3/24

HARDWARE CONFIGURATION:

Processor : Pentium IV

Processor Speed : 1.7 GHz

Memory (RAM) : 256 MB

Hard Disk : 10 GB

Floppy Drive : 3 1.44 MB DriveMonitor : Samsung Color Monitor

Keyboard : 104 keys Intel Keyboard

Mouse : Intel Optical Mouse

SOFTWARE CONFIGURATION

Operating System : Windows XP

Front End Tool : Microsoft Visual Basic .Net 2008

Back End Tool : Microsoft SQL Server 2000


4/24

EXISTING SYSTEM:

The study of existing system has enlightened the limitation of the system and so it has paveda way for the proposed system. The Problem of finding a relationship between variables in a

large database is not as easy as possible.

LIMITATION OF EXISTING SYSTEM:

Limited amount of memory

Need complete list of database

Data may be scattered and poorly accessibleRequires many database scans

Expensive

Lossy compressed synopsis (sketch) of data

Scalability of mining algorithm is a major concern


5/24

PROPOSED SYSTEM:

The basis for the proposed system is the recognition of the need for improving the existing system. The proposed system

aims at overcoming the drawbacks of the existing system. An important aspect of the new system is that it should be easy to

incorporate change. The user should be able to make changes without any difficulty at any time. The proposed system of

association rules is done using the Apriori, Simple Random Sampling and FAST,EASE. The proposed system is developed

using Visual Basic.NET as the front end and MS SQL server as the background.

FEATURES OF PROPOSED SYSTEM:

Uses large item set property

Save memory space

Easily implemented

Reduced costs

Reduced field time

Increase accuracy

Provide security

Excellent user friendlinessSimple

Errors can be easily measured


6/24

Modules Description :This project is based on FAST, EASE, Apriori and Simple Random Sampling for discovering

association rules in large database.

Apriori Algorithm :

The Apriori algorithm is a classic algorithm for learning association rules and it is mainly used to

designed and operate on database containing the transactions.

Simple Random Sampling :

The Simple Random Sampling is considered separately and it randomly displays the database and

check for the support and confidence in order to find the best rule. Simple Random Sampling can

make sampling a viable means for attaining both high performance and acceptably accurate

results.


7/24

FAST Algorithm :

FAST (Finding Associations from Sampled Transactions), a refined sampling-based mining algorithm that is

distinguished from prior algorithms by its novel two phase approach to sample collection. In Phase I a large

sample is collected to quickly and accurately estimate the support of each item in the database. In Phase II, asmall final sample is obtained by excluding outlier transactions in such a manner that the support of each item in

the final sample is as close as possible to the estimated support of the item in the entire database. Indeed, our

numerical experiments indicate that for any fixed computing budget, FAST identify frequent itemsets and fewer

false itemsets than sampling-based algorithms. FAST can identify most frequent itemsets in a database at an

overall cost that is much lower than that of classical algorithms.In this project A New Two Phase Sampling Algorithm for Discovering Association Rules the user can

find out the best comparison time and variation between the algorithms. In a large dataset, first the Apriori

algorithm has been applied to find the support and confidence in order to find the strong rule and weak rule, and

then randomly display the dataset and find the strong rule and weak rule based on the support and confidence of

the dataset.At last the FAST (Finding Associations from Sampled Transactions) algorithm has been used in a large dataset to

find out the strong and weak rule based on the support and confidence of the dataset. By applying the three

algorithms the user can calculate the correct time and accuracy and also the user can find out the best algorithm

from calculating the time difference.


8/24

EASE Algorithm :

In this paper we introduce a novel data-reduction method, called ease (Epsilon

Approximation: Sampling Enabled), that is especially designed for categorical count

data. This algorithm is an outgrowth of earlier work by Chen, et al. on the fast data-

reduction method. Both ease and fast start with a relatively large simple random

sample of transactions and deterministically trim the sample to create a final

subsample whose distance" from the complete database is as small as possible. For

reasons of computational efficiency, both algorithms subsample as close" to the

original database if the high-level aggregates of the subsample normalized by the total

number of data points are close" to the normalized aggregates in the database. These

normalized aggregates typically correspond to 1-itemset or 2-itemset supports in the

association-rule setting or, in the setting of a contingency table, relative marginal or

cell frequencies


9/24


10/24

Apply EASE Algorithm

Highlight with Blue and RedColor


11/24

COLUMN NAME DATATYPE DESCRIPTION

DS_ID Numeric Dataset Identification

DS_TRANS Text Dataset Transaction data

TABLE NAME : Dataset_master | Primary Key : DS_ID


12/24

COLUMN NAME DATATYPE DESCRIPTION

TRAN_NO Numeric Transaction Number

TYPE Text Transaction Type

SNO Numeric Serial Number

STARTED_TIME Datetime Started Time

ELAPSED_TIME Datetime Elapsed time

RULES Text Rules

TABLE NAME : Result_analysis | Primary Key : Tran_no


13/24


14/24


15/24


16/24

APRIORI


17/24

FINDING RULES


18/24

SIMPLE RANDOM SAMPLE


19/24

FINDING RULES


20/24

FAST TESTING


21/24

FINDING RULES


22/24

APPLY EASE ALGORITHM


23/24

RESULT ANALYSIS


24/24

sowmya review

Documents