techniques and technologies1

Upload: enggmohitkumar

Post on 08-Jul-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/19/2019 Techniques and Technologies1

    1/18

    1. Introduction

    Big data as coined, by Roger

    Magoulas from O‟Reilly media in 2005 [1]

    represents massive data sets with large,

    more varied and complex structure with

    challenge of storing, analyzing and

    visualizing for extracting meaningful results.

    Big data analytics is the process of research

    into massive amounts of data to reveal

    hidden patterns and correlations. 

    Big data is generated from various

    factors like astronomy, atmospheric science,

    genomics, biogeochemical, biological

    science and research, life sciences, medical

    records, scientific research, government,

    natural disaster and resource management,

     private sector, military surveillance, private

    sector, financial services, retail, social

    networks, web logs, text, document,

     photography, audio, video, click streams,

    search indexing, call detail records, POS

    information, RFID, mobile phones, sensornetworks and telecommunications [2] .

    2. Overview of Big Data

    2.1 Benefits

    Following are the benefits of Big

    data in various fields: Better aimed

    marketing, more straight business insights,

    client based segmentation, recognition of

    sales and market chances, automateddecision making, definitions of customer

     behaviors, greater return on investments,

    quantification of risks and market trending,

    comprehension of business alteration, better

     planning and forecasting, identification of

    consumer behaviour and production yield

    extension, predictive analytics on traffic

    flows, identification of threats from different

    video, audio and data feeds [4].

    2.2 Potential of Big data:

    McKinsey Global Institute specified the

     potential of big data in following main

    topics [3].

     Healthcare:  It has three main pools

    of big data.

    (1) Clinical data: Optimal treatment pathway, computerized physician order-

    entry, transparency about medical data,

    remote patient monitoring, advanced

    analytics applied to patient profiles.

    (2) Pharmaceutical R&D data:

    Predictive modelling for new drugs,

    suggesting trial sites with large numbers of

     potentially eligible patients and strong track

    records, pharmacovigilance (discover

    adverse effects) , develop personalized

    medicine.

    (3) Activity (claims) and cost data:

    automated systems (e.g., machine learning

    techniques such as neural networks) for

    fraud detection and checking the accuracy

    and consistency of payors claims,  based on

    real-world patient outcomes data to arrive at

    fair economic compensation.

     Public sector:  In public sector five

    main categories for big data are:

    (1) Creating transparency: Making

    data more accessible

  • 8/19/2019 Techniques and Technologies1

    2/18

    (2) Enabling experimentation to

    discover needs, expose variability, and

    improve performance

    (3) Segmenting populations to

    customize actions

    (4) Replacing/supporting human

    decision making with automated algorithms

    (5) Innovating new business models,

     products, and services with big data

     Retail: Here, five main categories are

    (1) Marketing: Cross selling,

    location based marketing, customer

     behavioural segmentation, sentiment

    analysis, integrate promotion and pricing.(2) Merchandising: Placement and

    design optimization, price optimization.

    (3) Operations: Performance

    transparency, optimization of lab or inputs,

    automated time and attendance tracking, and

    improved labour scheduling.

    (4) Supply chain: Stock forecasting

     by combining multiple datasets such as sales

    histories, weather predictions, and seasonal

    sales cycles. 

    (5) New business models: Price

    comparison services, web based market.

     Manufacturing: 

    (1) Research and development and

     product design

    (2) Product lifecycle management.

    (3) Design to value.

    (4) Open innovation

    (5) Supply chain

    (6) Production: Digital factory,

    Sensor-driven operation.

     Personal location data:  Smart

    routing, geo targeted advertising or

    emergency response, urban planning, new

     business models.

    Social network analysis: 

    Understanding user intelligence for more

    targeted advertising, marketing campaigns

    and capacity planning, customer behavior

    and buying patterns, sentiment analytics.

    2.3. Challenges and Obstacles

    Major hurdles for implementation of

    Big data analytics are [4]

    (1)Data representation: Datarepresentation aims to make data more

    meaningful for computer analysis and user

    interpretation. Many datasets have certain

    levels of heterogeneity in type, structure,

    semantics, organization, granularity, and

    accessibility.

    (2)Redundancy reduction and data

    compression:  Effective to reduce theindirect cost of the entire system on the premise that the potential values of the data

    are not affected.

    (3) Data life cycle management: A

    data importance principle related to the

    analytical value should be developed to

    decide which data shall be stored and which

    data shall be discarded.

    (4) Data confidentiality: The

    transactional dataset generally includes a set

    of complete operating data to drive key

     business processes. Such data contains

    details of the lowest granularity and some

    sensitive information such as credit card

    numbers

  • 8/19/2019 Techniques and Technologies1

    3/18

  • 8/19/2019 Techniques and Technologies1

    4/18

    creation, and availability for access and

    delivery. It includes web site response time,

    inventory availability analysis, and

    transaction execution, and order tracking

    update, product / service delivery 

    3. Big data techniques and technologies: -

    3.1 Big data techniques

    S.No Technique Method Example

    1 A/B testing[3]

    A control group is compared with avariety of test groups in order to

    determine what changes willimprove a given objective variable.

    Determining what copy text, layouts,images, or colors will improve

    conversion rates on an e-commerceweb site

    2 Associationrule learning

    [5]

    A set of techniques to discoverinteresting relationship i.e.

    “association rules” among variablesin large databases

    Market basket analysis, in which aretailer can determine which products

    are frequently bought together and usethis information for marketing

    3 Data fusion

    and dataintegration

    [6]

    A set of techniques that integrate

    and analyze data from multiplesources in order to develop insights

    in ways that are more efficient and

     potentially more accurate than if

    they were developed by analyzing asingle source of data.

    Data from social media, analyzed by

    natural language processing, can becombined with real-time sales data, in

    order to determine what effect a

    marketing campaign is having on

    customer sentiment and purchasing behavior.

    4 Data mining[7]

    A set of techniques to extract patterns from large datasets by

    combining methods from statistics

    and machine learning with database

    management. These techniquesinclude cluster analysis,

    classification, and regression.

    Mining customer data to determinesegments most likely to respond to an

    offer, mining human resources data to

    identify characteristics of most

    successful employees

    5  Natural

    language

     processing

    [8]

    Uses computer algorithms to

    analyze human (natural) language

    Using sentiment analysis on social

    media to determine how prospective

    customers are reacting to a branding

    campaign.

    6 Predictivemodeling

    [9]

    A mathematical model is created orchosen to best predict the

     probability of an outcome

    Estimate the likelihood that a customercan be cross-sold other product.

    7 Spatialanalysis

    Techniques that analyze thetopological, geometric, or

    How is consumer willingness to purchase a product correlated with

  • 8/19/2019 Techniques and Technologies1

    5/18

    [10] geographic properties encoded in adata set

    location? How would a manufacturingsupply chain network perform with

    sites in different locations?

    3.2 Big data technologies

    S.No. Technology Overview Application1. Big Table

    [11]

    Big table is a distributed storage system

    for managing structured data at Google.It can reliably scale to petabytes of data

    and thousands of machines

    More than 60 Google

     products like GoogleEarth, Google Finance

    and Google web

    Indexing, Orkut, and

    Google Analytics etc.

    2. Cassandra

    [12]

    Cassandra is a massively scalable open

    source NoSQL database. Cassandra has amaster less “ring” design that is elegant,

    easy to setup, and easy to maintain.Cassandra delivers continuousavailability, linear scalability, and

    operational simplicity across many

    commodity servers with no single point

    of failure.

    Accenture, EBay,

     Netflix, Go daddy,Instagram , Reddit,

    Yahoo! Japan, NASA

    3. Hadoop[13]

    Hadoop is an Apache open sourceframework written in java that allows

    distributed processing of large datasets

    across clusters of computers using simple

     programming models. Hadoop isdesigned to scale up from single server to

    thousands of machines, each offering

    local computation and storage

    Amazon, AOL,Facebook, IBM, New

    York Times, Yahoo!,

    Microsoft, Google

    4. Association rule

    An association is a rule of the

    format: LHS -- RHS. The goal of association

    rule discovery is to find associations among

    items from a set of transactions, each of

    which contains a set of items. [5] Generally

    the algorithm finds a subset of association

    rules that satisfy certain constraints.

    (1) Minimum support: - The support

    of a rule is defined as the support of the

    item-set consisting of both the LHS and the

    RHS. The support of an item-set is the

     percentage of transactions in the transaction

    set that contain the item-set. An item-set

    with a support higher than a given minimum

    support is called frequent item-set. 

    (2)  Minimum confidence: -  It is the

    minimum ratio of the support of the rule and

    the support of the LHS. 

  • 8/19/2019 Techniques and Technologies1

    6/18

    Most association rule algorithms

    generate association rules in two steps:

    (1) Generate all frequent item-sets,

    Fig.4.1 General block diagram of Association rule.

    (2) Construct all rules using these

    item-sets.

    4.1. Association rule in Big data

    It has been experimentally

    demonstrated that for support levels that

    generate less than 100,000 rules, which is a

    very conservative upper bound for humans

    to sift through even considering pruning un-

    interesting rules, Apriori finishes on all

    datasets in less than 1 minute.) For support

    levels that generate less than 1,000,000

    rules,  which are sufficient for prediction

     purposes where data is loaded into RAM,

    Apriori finishes processing in less than 10

    minutes. 

    [14]

    4.2. Association rule Algorithms

    4.2.1 Apriori Algorithm

    The Apriori algorithm finds frequent

    item-sets from databases by iteration. For

    each iteration  I the algorithm attempts to

    determine the set of frequent patterns with  I  

    items and this set is engaged to generate the

    set of candidate item-sets of the next

    iteration. The iteration is repetitively performed until no candidate patterns can be

    discovered. It uses a bottom up approach,

    where frequent subsets are extended one

    item at a time. In the input datasets are

    referred as sequences composed of more or

    less items. The output of Apriori is a set of

    rules explaining the links these items have in

    their sets. [15]

    Apriori is an algorithm for findingfrequent item-sets using candidate

    generation. Given minimum required

    support ‘S’  as interestingness criterion [18]:

  • 8/19/2019 Techniques and Technologies1

    7/18

      (1) Search for all individual elements

    (1-element item-set) that have a minimum

    support of „S‟. 

    (2) From the results of the previous

    search for „i‟ element item-set, search for all

    „i+1‟ element item-sets that have a

    minimum support of „S‟. This becomes the

    set of all frequent „(i+1)‟ item-sets that are

    interesting.

    (3) Repeat step 2 until item-set size

    reaches maximum.

    Association rules are of the form

    AB where A  B is different from B  

    A. A   B implies that if a customer

     purchase item A then he also purchase itemB. For the association rule mining two

    threshold values are required: -

    (1) Minimum support: - Support is

    the percentage of the population which

    satisfies the rule or in the other words the

    support for a rule R is the ratio of the

    number of occurrence of R, given all

    occurrences of all rules.

    The support of an association pattern

    is the percentage of task-relevant datatransactions for which the pattern is true.

    (2) Minimum confidence: - The

    confidence of a rule A  B, is the ratio of

    the number of occurrences of B given A,

    among all other occurrences given A.

    Confidence is defined as the measure ofcertainty or trustworthiness associated with

    each discovered pattern A B.

    Association rules are generated as

     per following method: -

    (1) Use Apriori to generate item-sets

    of different sizes.

    (2) At each iteration divide each

    frequent item-set X into two parts

    antecedent (LHS) and consequent (RHS)

    this represents a rule of the form

    LHSRHS.

    (3) Discard all rules whose

    confidence is less than minimum

    confidence.

    4.2.2 FP Algorithm

    It generates all frequent item-sets

    satisfying a given minimum support by

    growing a frequent pattern tree structure that

    stores compressed information about the

    frequent patterns. In this way, FP-growth

    can avoid repeated database scans and also

    avoid the generation of a large number of

    candidate item-sets.  FP-growth takes

    transactional data in the form of one row foreach single complete transaction.

    Implementations of FP-growth only generate

    the frequent item-sets, and not the

    association rules [16]. The mining task as

    well as the database are decomposed using a

    divide and conquer system and finally it

    uses a fragment pattern method to avoid the

    costly process of candidate generation and

    testing opposed to the Apriori algorithm.

    [15]

    A frequent pattern tree is a structure

    consisting of [17]

    (1) One root labeled as “null”,

    (2) A set of item-prefix subtrees as

    the children of the root,

     Number of tuples with both A and BSupport(A B) =

    Total number of tuples

     Number of tuples with both A and BConfidence (A B) =

     Number of tuples with A

  • 8/19/2019 Techniques and Technologies1

    8/18

    (3) A frequent-item-header table.

     Item-prefix subtrees: -Each node in

    the item-prefix subtree consists of three

    fields: item-name, count, and node-link.

    (1) Item-name: -It registers which

    item this node represents.

    (2) Count: - It registers the number

    of transactions represented by the

     portion of the path reaching this node.

    (3) Node link: - Links to the next

    node in the FP-tree carrying the same

    item-name, or null if there is none.

     Frequent-item-header table: -Each

    entry consists of two fields: - 

    (1) Item name 

    (2) Head of node link (a pointer

     pointing to the first node in the FP tree

    carrying the item-name).

    Fig. 4.2.FP tree structure

    Algorithm for FP tree construction:

     Input:  A transaction database DB

    and a minimum support threshold ξ.

    Output: FP-tree, the frequent-pattern

    tree of DB.

    Steps: (1) Scan the transaction

    database DB once. Collect F, the set of

    frequent items, and the support of each

    frequent item. Then we sort F as per support

    descending order.

    (2) Create the root of an FP-tree, T,

    and label it as “null”. For each transaction

    Trans in DB do the following.

    (3) If T has a child N , then

    increment N‟s count by 1; else create a new

    node N, with its count initialized to 1, its

     parent link linked to T , and its node-link

    linked to the nodes with the same item-name

    via the node-link structure.

    FP-growth is about an order of

    magnitude faster than Apriori, especially

    when the data set is dense (containing many

     patterns) and/or when the frequent patterns

    are long.

    4.2.3 Charm

    Charm is an algorithm for generating

    closed frequent item-sets for association

    rules from transactional data. The closed

    frequent item-sets are the smallest

    representative subset of frequent item-sets

    without loss of information. Charm takes

    transactional data in the form of one row for

    each single complete transaction. [16]

    4.2.4 Magnum Opus

    The main unique technique used in

    Magnum Opus is the search algorithm based

    on OPUS, a systematic search method with

     pruning. It considers the whole search space, 

  • 8/19/2019 Techniques and Technologies1

    9/18

     but during the search, effectively prunes a

    large area of search space without missing

    search targets provided that the targets can

     be measured using certain criteria. [16]

    4.3 Real life application of association rule

    Field of work Problem Statement Method applied Outcome

    Government

    sector.

    Researchers of

    King’s Cllege

    London [19]

    Fraud at Consignia ,

    UK’s Pst ffice grup 

    Use f “if…then” assciatin

    rule.

    E.g. Normal behavir rule “IF

    time < 1200 AND item =

    stamps THEN $2 < cst < $4.” 

    Detectors that

    successfully spot

    abnormal transactions.

    They also copy

    themselves, so CIFD

    adapts itself to create

    detectors that

    correspond to the most

    prevalent patterns offraud.

    [20] Issues concerning

    accessibility of an

    urban area

    Spatial association rule mining

    to geo-referenced U.K. census

    data of 1991

    Helped in

    transportation planning

    in area near a local

    Stepping Hill Hospital.

    Health care sector.

    [21]

    Anomaly detection

    and classification in

    Breast Cancer.

    In training Apriori algorithm

    was applied and association

    rules were extracted. The

    support was set to 10% and

    the confidence to 0%.

    Success rate of

    classifier was 69.11%.

    Time required for

    training was much less

    then neural network.

    Retail Sector.

    [22]

    Purchasing behavior

    of customer

    On a dataset of 353,421

    records from 1903

    households about 1,022,812

    association rules were

    generated for promotion

    sensitivity analysis i.e.,

    analysis of customerresponses to various types of

    promotions, including

    advertisements, coupons, and

    various types of discounts.

    In a time duration of

    1.5 hours about 2.6% of

    discovered rules were

    accepted and rest were

    rejected.

    Thus total rulesreduced to about 14

    rules per household

    from 537 rules per

    household.

    Telecom Sector. Which country pairs Use of association rule by Successful in detecting

  • 8/19/2019 Techniques and Technologies1

    10/18

    11 

    [23] or triples or

    quadruples customers

    are currently calling

    treating the top-k country

    item set as a market basket

    for each of account.

    Exploiting temporal nature ofdata by using traffic from last

    month as a baseline for

    current month.

    high rate of fraud calls

    trends associated with

    adult entertainment

    services, that move

    from country tocountry through time.

    Manufacturing

    sector.

    VAM Drilling

    industries France

    [24]

    Setting up a system

    which provides result

    identical to human

    observation related to

    performance and

    dysfunctions during

    forging.

    Use of Rule-Growth that

    mines sequential rules by FP-

    growth with varying the

    parameters minimum support

    and minimum confidence

    Found the main

    dysfunction responsible

    for delay.

    Finding that generator

    is cause for exceeding

    maximum time in

    starting phase

    The third major

    problem was the lack of

    effectiveness of metal

    strippers

    5. Big Table [11]

    5.1 Data Model

    Data is organized into three

    dimensions: rows, columns, and timestamps.

    We refer to the storage referenced by a

     particular row key, column key, and

    timestamp as a cell. In Web-table, we woulduse URLs as row keys, various aspects of

    web pages as column names, and store the

    contents of the web pages in the contents:

    column under the timestamps when they

  • 8/19/2019 Techniques and Technologies1

    11/18

  • 8/19/2019 Techniques and Technologies1

    12/18

     

    Fig.5.1 Big table architecture

    are fetched. Rows with consecutive keys are

    grouped into tablets.

    Fig.5.2 Data model for Big table

    5.2 Building Blocks

    Big-Table depends on a Google

    cluster management system for scheduling

     jobs, managing resources on shared

    machines, monitoring machine status, and

    dealing with machine failures.

    The Google SSTable immutable-file

    format is used internally to store Big-Table

    data files. An SSTable provides a persistent,

    ordered immutable map from keys to values,

    where both keys and values are arbitrary

     byte strings.

    Big-Table uses Chubby for a variety

    of tasks: to ensure that there is at most one

    active master at any time; to store the

     bootstrap location of Big-Table data; to

    discover tablet servers and finalize tablet

    server deaths; and to store Big-Tableschemas. Chubby is a distributed lock

    service. A Chubby service consists of five

    active replicas, one of which is elected to be

    the master and actively serve requests. The

    service is live when a majority of the

    replicas are running and can communicate

    with each other.

    5.3 Big Table Implementation

    The Big-Table implementation has

    three major components: a library that is

    linked into every client, one master server,

    and many tablet servers.

  • 8/19/2019 Techniques and Technologies1

    13/18

    13 

    The master is responsible for

    assigning tablets to tablet servers, detecting

    the addition and expiration of tablet servers,

     balancing tablet-server load, and garbage

    collecting files. In addition, it handlesschema changes such as table and column

    family creations and deletions. Each tablet

    server manages a set of tablets. The tablet

    server handles read and write requests to the

    tablets that it has loaded, and also splits

    tablets that have grown too large. A Bigtable

    cluster stores a number of tables. Each table

    consists of a set of tablets, and each tablet

    contains all of the data associated with a row

    range.

    5.3.1 Tablet Location: -

    It uses a three-level hierarchy, the

    first level is a file stored in Chubby that

    contains the location of the root tablet . The

    root tablet contains the locations of all of the

    tablets of a special METADATA table. Each

    METADATA tablet contains the location of

    a set of user tablets. Secondary informationlike a log of all events pertaining to each

    tablet (such as when a server begins serving

    it) is also stored in METADATA. This

    information is helpful for debugging and

     performance analysis.

    Fig. 5.2 Tablet location

    5.3.2 Tablet assignment

    Each tablet is assigned to at most one

    tablet server at a time. The master keeps

    track of the set of live tablet servers, and thecurrent assignment of tablets to tablet

    servers, including which tablets are

    unassigned. When a tablet is unassigned,

    and a tablet server with sufficient room for

    the tablet is available, the master assigns the

    tablet by sending a tablet load request to the

    tablet server. Bigtable uses Chubby to keep

    track of tablet servers. When a tablet server

    starts, it creates and acquires an exclusive

    lock on a uniquely named file in a specificchubby directory. The master monitors this

    directory (the server’s directory) to discover

    tablet servers.

    The set of existing tablets changes only

    when a table is created or deleted, two

    existing tablets are merged to form one

    larger tablet, or an existing tablet is split into

    two smaller tablets. The master is able to

    keep track of these changes because itinitiates all but the last. Tablet splits are

    treated specially since they are initiated by

    tablet servers. A tablet server commits a

    split by recording information for the new

    tablet in the METADATA table. After

    committing the split, the tablet server

    notifies the master.

    5.3.3 Tablet Serving: -

    The persistent state of a tablet is

    stored in GFS. Updates are committed to a

    commit log that stores redo records. The

    recently committed ones are stored in

    memory in a sorted buffer called a

  • 8/19/2019 Techniques and Technologies1

    14/18

    memtable. Older updates are stored in a

    sequence of SSTables.

    To recover a tablet, a tablet server

    reads its metadata from the METADATA

    table. This metadata contains the list ofSSTables that comprise a tablet and a set of

    redo points, which are pointers into any

    commit logs that may contain data for the

    tablet. The server reads the indices of the

    SSTables into memory and reconstructs the

    memtable by applying all of the updates that

    have committed since the redo points.

    When a write operation arrives at a

    tablet server, the server checks that it iswell-formed (i.e., not sent from a buggy or

    obsolete client), and that the sender is

    authorized to perform the mutation.

    Authorization is performed by reading the

    list of permitted writers from a chubby file.

    A valid mutation is written to the commit

    log. After the write has been committed, its

    contents are inserted into the memtable

    5.3.4. Schema Management

    Bigtable schemas are stored in

    Chubby. Chubby is an effective

    communication substrate for Bigtable

    schemas because it provides atomic whole-

    file writes and consistent caching of small

    files. For example, suppose a client wants to

    delete some column families from a table.

    The master performs access control checks,

    verifies that the resulting schema is wellformed, and then installs the new schema by

    rewriting the corresponding schema file in

    Chubby. Whenever tablet servers need to

    determine what column families exist, they

    simply read the appropriate schema file from

    Chubby, which is almost always available in

    the server‟s chubby client cache. Because

    chubby caches are consistent, tablet servers

    are guaranteed to see all changes to that file.

    6. Market Basket Analysis: -

    Implementation and Results

    In Retail each customer purchases different

    set of products, different quantities, and

    different times. Retailers use this

    information to:

    (1) Gain insight about its

    merchandise (products):

    Fast and slow movers

    Products which are purchased

    together

    Products which might benefit

    from promotion

    (2)Take action:

    Store layouts

    Which products to put on

    specials, promote, coupons.

    6.1 Apriori Algorithm: -

    The small database used to test this

    algorithm is [18]

    S.No. Item 1  Item 21  Item 3I3

    1 Bread  Butter  Milk 

    2 Ice-cream  Bread  Butter 

    3 Bread  Butter  Noodles 

    4 Bread  Noodles  Ice-cream 

    5 Butter  Milk  Bread 6 Bread  Noodles  Ice-cream 

    7 Milk  Butter  Bread 

    8 Ice-cream  Milk  Bread 

    9 Butter  Milk  Noodles 

    10 Noodles  Butter  Ice-cream 

    Table 6.1. Database for testing Apriori algorithm

  • 8/19/2019 Techniques and Technologies1

    15/18

    15 

    In the given dataset every item occurs three

    or more than three times and total number of

    transaction is ten so,

    Minimum Support = 0.3

    Item-set Support

    Bread 0.8

    Butter 0.7

     Noodles 0.5Ice-cream 0.5

    Milk 0.5

    Table 6.2 Interestingness of 1-element item sets

    Item-sets Support

    {Bread, Butter} 0.5{Bread, Milk} 0.4

    {Bread, Noodles} 0.3{Bread, Ice-cream} 0.4

    {Butter, Milk} 0.4

    {Butter, Noodles} 0.3{Butter, Ice-cream} 0.2

    {Noodles, Milk} 0.1

    {Noodles, Ice-cream } 0.3

    {Milk, Ice-cream} 0.1Table 6.3 Interestingness of 2-element item sets

    Item-sets Support

    {Bread, Butter, Milk} 0.3

    {Bread, Ice-cream, Noodles} 0.2

    {Bread, Butter, Noodles} 0.1

    Table 6.4 Interestingness of 3-element item sets

    The main advantage of the Apriori

    algorithm is that it only takes data from

     previous iteration not from the whole data.

    Rule Confidence (%)

    {Bread} {Butter, Milk} 37

    {Butter} {Bread, Milk} 42

    {Milk} {Bread, Butter} 75

    {Bread, Butter} {Milk} 60

    {Bread, Milk} {Butter} 75

    {Butter, Milk} {Bread} 75

    Table 6.5 Rules based on Apriori algorithm

    If the minimum confidence threshold is 70

     percentages, and the minimum support is 30

     percentages, then discovered rules are

    {Bread, Milk} {Butter}

    {Butter, Milk} {Bread}

    {Milk}{Bread, Butter}

    Then the algorithm was run on bakery

    database. The database consisted of 50

    different items and with 75000 receipts. The

    minimum support was found to be 0.04. The

    items were named as alphabet of English

    literature.

    Item-sets Support

    {A,AU} 0.0440

    {D,S} 0.0434

    {D,AJ} 0.0430

    {E,J} 0.0431

    {F,W} 0.0439

    {Q,AG} 0.0435

    {S,AH} 0.0531

    {AB,AC} 0.0509

    {AH,AQ} 0.0431

    Table 6.6. Interestingness of 2-element item sets

    Item-sets Support

    {D,S,AJ} 0.0411

  • 8/19/2019 Techniques and Technologies1

    16/18

    Table 6.7. Interestingness of 3-element item sets

    6.2 FP Growth Algorithm: -

    This algorithm was implemented on three

    datasets, previous two and a new one. Thesmallest dataset used was [19]

    S. No. Item1 Item2 Item 3 Item 4

    1 A B

    2 B C D

    3 A C D E

    4 A D E

    5 A B C

    Table 6.8 Small dataset for FP algorithm

    S. No. E B C D A

    1 1 1

    2 1 1 1

    3 1 1 1 1

    4 1 1 1

    5 1 1 1

    FREQ. 2 3 3 3 4

    Table 6.9 Ascending order arrangement w.r.t.

    frequency

    Fig. 6.1 FP Tree construction

    Table 6.10 Conditional pattern base and conditional

    FP tree generation

    Frequent Pattern Support Count

    2-Item set

    E,A 2

    E,D 2

    B,A 2C,A 2

    C,B 2

    D,A 2D,C 2

    3 –  Item set

    E,A,D 2

    Table 6.11 Item set generated

  • 8/19/2019 Techniques and Technologies1

    17/18

    17 

    Similarly, the FP growth algorithm

    was implemented on other two databases

    and identical results as given by the Apriori

    algorithm were obtained.

    REFERENCES: -

    1. G. Halevi, H. Moed, The evolution of big data as a

    research and scientific topic: Overview of the

    literature, Res. Trends(2012) 3 –6.

    2. http://en.wikipedia.org/wiki/Big_data

    3. J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs,

    C. Roxburgh and A.H. Byers, "Big data: The next

    frontier for innovation, competition, and

    productivity", McKinsey Global Institute, 2011.

    4. Chen, Min, Shiwen Mao, and Yunhao Liu. "Big

    data: A survey." Mobile Networks and Applications 

    19.2 (2014): 171-209

    5. Agrawal, Rakesh, Tmasz Imielioski, and Arun

    Swami. "Mining association rules between sets of

    items in large databases." In  ACM SIGMOD Record ,

    vol. 22, no. 2, pp. 207-216. ACM, 1993.

    6. Lohr, Steve. "The age of big data." New York Times 

    11 (2012).

    7. Rygielski, Chris, Jyun-Cheng Wang, and David C.

    Yen. "Data mining techniques for customer

    relationship management." Technology in society  24,

    no. 4 (2002): 483-502.

    8. Hennig-Thurau, Thorsten, Edward C. Malthouse,

    Christian Friege, Sonja Gensler, Lara Lobschat, Arvind

    Rangaswamy, and Bernd Skiera. "The impact of new

    media on customer relationships." Journal of serviceresearch 13, no. 3 (2010): 311-330.

    9. Kamakura, Wagner A., Michel Wedel, Fernando

    De Rosa, and Jose Afonso Mazzon. "Cross-selling

    through database marketing: A mixed data factor

    analyzer for data augmentation and prediction."

    International Journal of Research in marketing 20,

    no. 1 (2003): 45-65.

    10. Meixell, Mary J., and Vidyaranya B. Gargeya.

    "Global supply chain design: A literature review and

    critique." Transportation Research Part E: Logistics

    and Transportation Review  41, no. 6 (2005): 531-

    550.

    11. Chang, Fay, Jeffrey Dean, Sanjay Ghemawat,

    Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,

    Tushar Chandra, Andrew Fikes, and Robert E.

    Gruber. "Bigtable: A distributed storage system for

    structured data." ACM Transactions on Computer

    Systems (TOCS) 26, no. 2 (2008): 4.

    12. Apache Cassandra 2.1DocumentationOctober 27, 2015.

    13. Shvachko, Konstantin, Hairong Kuang, Sanjay

    Radia, and Robert Chansler. "The hadoop distributed

    file system." In Mass Storage Systems and

    Technologies (MSST), 2010 IEEE 26th Symposium on,

    pp. 1-10. IEEE, 2010.

    14. Liu, B., Hsu, W., & Ma, Y. (1999, August). Pruning

    and summarizing the discovered associations. In

    Proceedings of the fifth ACM SIGKDD internationalconference on Knowledge discovery and data mining 

    (pp. 125-134). ACM.

    15. Kamsu-Foguem, Bernard, Fabien Rigal, and Félix

    Mauget. "Mining association rules for the quality

    improvement of the production process." Expert

    Systems with Applications 40.4 (2013): 1034-1045

    16. Zheng, Z., Kohavi, R., & Mason, L. (2001, August). 

    Real world performance of association rule

    algorithms. In Proceedings of the seventh ACMSIGKDD international conference on Knowledge

    discovery and data mining (pp. 401-406). ACM.

    17. Han, Jiawei, Jian Pei, Yiwen Yin, and Runying

    Mao. "Mining frequent patterns without candidate

    generation: A frequent-pattern tree approach." Data

  • 8/19/2019 Techniques and Technologies1

    18/18

    mining and knowledge discovery  8, no. 1 (2004): 53-

    87.

    18. Dongre, Jugendra, Gend Lal Prajapati, and S. V.

    Tokekar. "The role of Apriori algorithm for finding

    the association rules in Data mining." In Issues and

    Challenges in Intelligent Computing Techniques

    (ICICT), 2014 International Conference on, pp. 657-

    660. IEEE, 2014.

    19. Weatherford, M. (2002). Mining for fraud.

    Intelligent Systems, IEEE , 17 (4), 4-6.

    20. Appice A, Ceci M, Lanza A, et al. Discovery of

    spatial association rules in geo-referenced census

    data: a relational mining approach. Intell Data Anal2003; 7:541 –566.

    21. M.-L. Antnie, O. R. Za¨ıane, and A. Cman.

    Application of data mining techniques for medical

    image classification. In Second International ACM

    SIGKDD Workshop on Multimedia Data Mining,

    pages 94 –101, San Francisco, USA, August 2001.

    22. Adomavicius, G., & Tuzhilin, A. (2001). Expert-

    driven validation of rule-based user models in

    personalization applications. Data Mining and

    Knowledge Discovery , 5(1-2), 33-58

    23. Cortes, C., & Pregibon, D. (2001). Signature-

    based methods for data streams. Data Mining and

    Knowledge Discovery , 5(3), 167-182.

    24. Kamsu-Foguem, Bernard, Fabien Rigal, and Félix

    Mauget. "Mining association rules for the quality

    improvement of the production process." Expert

    Systems with Applications 40.4 (2013): 1034-1045.