techniques and technologies1

8/19/2019 Techniques and Technologies1

1/18

1

1. Introduction

Big data as coined, by Roger

Magoulas from O‟Reilly media in 2005 [1]

represents massive data sets with large,

more varied and complex structure with

challenge of storing, analyzing and

visualizing for extracting meaningful results.

Big data analytics is the process of research

into massive amounts of data to reveal

hidden patterns and correlations.

Big data is generated from various

factors like astronomy, atmospheric science,

genomics, biogeochemical, biological

science and research, life sciences, medical

records, scientific research, government,

natural disaster and resource management,

private sector, military surveillance, private

sector, financial services, retail, social

networks, web logs, text, document,

photography, audio, video, click streams,

search indexing, call detail records, POS

information, RFID, mobile phones, sensornetworks and telecommunications [2] .

2. Overview of Big Data

2.1 Benefits

Following are the benefits of Big

data in various fields: Better aimed

marketing, more straight business insights,

client based segmentation, recognition of

sales and market chances, automateddecision making, definitions of customer

behaviors, greater return on investments,

quantification of risks and market trending,

comprehension of business alteration, better

planning and forecasting, identification of

consumer behaviour and production yield

extension, predictive analytics on traffic

flows, identification of threats from different

video, audio and data feeds [4].

2.2 Potential of Big data:

McKinsey Global Institute specified the

potential of big data in following main

topics [3].

Healthcare: It has three main pools

of big data.

(1) Clinical data: Optimal treatment pathway, computerized physician order-

entry, transparency about medical data,

remote patient monitoring, advanced

analytics applied to patient profiles.

(2) Pharmaceutical R&D data:

Predictive modelling for new drugs,

suggesting trial sites with large numbers of

potentially eligible patients and strong track

records, pharmacovigilance (discover

adverse effects) , develop personalized

medicine.

(3) Activity (claims) and cost data:

automated systems (e.g., machine learning

techniques such as neural networks) for

fraud detection and checking the accuracy

and consistency of payors claims, based on

real-world patient outcomes data to arrive at

fair economic compensation.

Public sector: In public sector five

main categories for big data are:

(1) Creating transparency: Making

data more accessible


2/18

(2) Enabling experimentation to

discover needs, expose variability, and

improve performance

(3) Segmenting populations to

customize actions

(4) Replacing/supporting human

decision making with automated algorithms

(5) Innovating new business models,

products, and services with big data

Retail: Here, five main categories are

(1) Marketing: Cross selling,

location based marketing, customer

behavioural segmentation, sentiment

analysis, integrate promotion and pricing.(2) Merchandising: Placement and

design optimization, price optimization.

(3) Operations: Performance

transparency, optimization of lab or inputs,

automated time and attendance tracking, and

improved labour scheduling.

(4) Supply chain: Stock forecasting

by combining multiple datasets such as sales

histories, weather predictions, and seasonal

sales cycles.

(5) New business models: Price

comparison services, web based market.

Manufacturing:

(1) Research and development and

product design

(2) Product lifecycle management.

(3) Design to value.

(4) Open innovation

(5) Supply chain

(6) Production: Digital factory,

Sensor-driven operation.

Personal location data: Smart

routing, geo targeted advertising or

emergency response, urban planning, new

business models.

Social network analysis:

Understanding user intelligence for more

targeted advertising, marketing campaigns

and capacity planning, customer behavior

and buying patterns, sentiment analytics.

2.3. Challenges and Obstacles

Major hurdles for implementation of

Big data analytics are [4]

(1)Data representation: Datarepresentation aims to make data more

meaningful for computer analysis and user

interpretation. Many datasets have certain

levels of heterogeneity in type, structure,

semantics, organization, granularity, and

accessibility.

(2)Redundancy reduction and data

compression: Effective to reduce theindirect cost of the entire system on the premise that the potential values of the data

are not affected.

(3) Data life cycle management: A

data importance principle related to the

analytical value should be developed to

decide which data shall be stored and which

data shall be discarded.

(4) Data confidentiality: The

transactional dataset generally includes a set

of complete operating data to drive key

business processes. Such data contains

details of the lowest granularity and some

sensitive information such as credit card

numbers


3/18


4/18

creation, and availability for access and

delivery. It includes web site response time,

inventory availability analysis, and

transaction execution, and order tracking

update, product / service delivery

3. Big data techniques and technologies: -

3.1 Big data techniques

S.No Technique Method Example

1 A/B testing[3]

A control group is compared with avariety of test groups in order to

determine what changes willimprove a given objective variable.

Determining what copy text, layouts,images, or colors will improve

conversion rates on an e-commerceweb site

2 Associationrule learning

[5]

A set of techniques to discoverinteresting relationship i.e.

“association rules” among variablesin large databases

Market basket analysis, in which aretailer can determine which products

are frequently bought together and usethis information for marketing

3 Data fusion

and dataintegration

[6]

A set of techniques that integrate

and analyze data from multiplesources in order to develop insights

in ways that are more efficient and

potentially more accurate than if

they were developed by analyzing asingle source of data.

Data from social media, analyzed by

natural language processing, can becombined with real-time sales data, in

order to determine what effect a

marketing campaign is having on

customer sentiment and purchasing behavior.

4 Data mining[7]

A set of techniques to extract patterns from large datasets by

combining methods from statistics

and machine learning with database

management. These techniquesinclude cluster analysis,

classification, and regression.

Mining customer data to determinesegments most likely to respond to an

offer, mining human resources data to

identify characteristics of most

successful employees

5 Natural

language

processing

[8]

Uses computer algorithms to

analyze human (natural) language

Using sentiment analysis on social

media to determine how prospective

customers are reacting to a branding

campaign.

6 Predictivemodeling

[9]

A mathematical model is created orchosen to best predict the

probability of an outcome

Estimate the likelihood that a customercan be cross-sold other product.

7 Spatialanalysis

Techniques that analyze thetopological, geometric, or

How is consumer willingness to purchase a product correlated with


5/18

[10] geographic properties encoded in adata set

location? How would a manufacturingsupply chain network perform with

sites in different locations?

3.2 Big data technologies

S.No. Technology Overview Application1. Big Table

[11]

Big table is a distributed storage system

for managing structured data at Google.It can reliably scale to petabytes of data

and thousands of machines

More than 60 Google

products like GoogleEarth, Google Finance

and Google web

Indexing, Orkut, and

Google Analytics etc.

2. Cassandra

[12]

Cassandra is a massively scalable open

source NoSQL database. Cassandra has amaster less “ring” design that is elegant,

easy to setup, and easy to maintain.Cassandra delivers continuousavailability, linear scalability, and

operational simplicity across many

commodity servers with no single point

of failure.

Accenture, EBay,

Netflix, Go daddy,Instagram , Reddit,

Yahoo! Japan, NASA

3. Hadoop[13]

Hadoop is an Apache open sourceframework written in java that allows

distributed processing of large datasets

across clusters of computers using simple

programming models. Hadoop isdesigned to scale up from single server to

thousands of machines, each offering

local computation and storage

Amazon, AOL,Facebook, IBM, New

York Times, Yahoo!,

Microsoft, Google

4. Association rule

An association is a rule of the

format: LHS -- RHS. The goal of association

rule discovery is to find associations among

items from a set of transactions, each of

which contains a set of items. [5] Generally

the algorithm finds a subset of association

rules that satisfy certain constraints.

(1) Minimum support: - The support

of a rule is defined as the support of the

item-set consisting of both the LHS and the

RHS. The support of an item-set is the

percentage of transactions in the transaction

set that contain the item-set. An item-set

with a support higher than a given minimum

support is called frequent item-set.

(2) Minimum confidence: - It is the

minimum ratio of the support of the rule and

the support of the LHS.


6/18

7

Most association rule algorithms

generate association rules in two steps:

(1) Generate all frequent item-sets,

Fig.4.1 General block diagram of Association rule.

(2) Construct all rules using these

item-sets.

4.1. Association rule in Big data

It has been experimentally

demonstrated that for support levels that

generate less than 100,000 rules, which is a

very conservative upper bound for humans

to sift through even considering pruning un-

interesting rules, Apriori finishes on all

datasets in less than 1 minute.) For support

levels that generate less than 1,000,000

rules, which are sufficient for prediction

purposes where data is loaded into RAM,

Apriori finishes processing in less than 10

minutes.

[14]

4.2. Association rule Algorithms

4.2.1 Apriori Algorithm

The Apriori algorithm finds frequent

item-sets from databases by iteration. For

each iteration I the algorithm attempts to

determine the set of frequent patterns with I

items and this set is engaged to generate the

set of candidate item-sets of the next

iteration. The iteration is repetitively performed until no candidate patterns can be

discovered. It uses a bottom up approach,

where frequent subsets are extended one

item at a time. In the input datasets are

referred as sequences composed of more or

less items. The output of Apriori is a set of

rules explaining the links these items have in

their sets. [15]

Apriori is an algorithm for findingfrequent item-sets using candidate

generation. Given minimum required

support ‘S’ as interestingness criterion [18]:


7/18

(1) Search for all individual elements

(1-element item-set) that have a minimum

support of „S‟.

(2) From the results of the previous

search for „i‟ element item-set, search for all

„i+1‟ element item-sets that have a

minimum support of „S‟. This becomes the

set of all frequent „(i+1)‟ item-sets that are

interesting.

(3) Repeat step 2 until item-set size

reaches maximum.

Association rules are of the form

AB where A B is different from B

A. A B implies that if a customer

purchase item A then he also purchase itemB. For the association rule mining two

threshold values are required: -

(1) Minimum support: - Support is

the percentage of the population which

satisfies the rule or in the other words the

support for a rule R is the ratio of the

number of occurrence of R, given all

occurrences of all rules.

The support of an association pattern

is the percentage of task-relevant datatransactions for which the pattern is true.

(2) Minimum confidence: - The

confidence of a rule A B, is the ratio of

the number of occurrences of B given A,

among all other occurrences given A.

Confidence is defined as the measure ofcertainty or trustworthiness associated with

each discovered pattern A B.

Association rules are generated as

per following method: -

(1) Use Apriori to generate item-sets

of different sizes.

(2) At each iteration divide each

frequent item-set X into two parts

antecedent (LHS) and consequent (RHS)

this represents a rule of the form

LHSRHS.

(3) Discard all rules whose

confidence is less than minimum

confidence.

4.2.2 FP Algorithm

It generates all frequent item-sets

satisfying a given minimum support by

growing a frequent pattern tree structure that

stores compressed information about the

frequent patterns. In this way, FP-growth

can avoid repeated database scans and also

avoid the generation of a large number of

candidate item-sets. FP-growth takes

transactional data in the form of one row foreach single complete transaction.

Implementations of FP-growth only generate

the frequent item-sets, and not the

association rules [16]. The mining task as

well as the database are decomposed using a

divide and conquer system and finally it

uses a fragment pattern method to avoid the

costly process of candidate generation and

testing opposed to the Apriori algorithm.

[15]

A frequent pattern tree is a structure

consisting of [17]

(1) One root labeled as “null”,

(2) A set of item-prefix subtrees as

the children of the root,

Number of tuples with both A and BSupport(A B) =

Total number of tuples

Number of tuples with both A and BConfidence (A B) =

Number of tuples with A


8/18

9

(3) A frequent-item-header table.

Item-prefix subtrees: -Each node in

the item-prefix subtree consists of three

fields: item-name, count, and node-link.

(1) Item-name: -It registers which

item this node represents.

(2) Count: - It registers the number

of transactions represented by the

portion of the path reaching this node.

(3) Node link: - Links to the next

node in the FP-tree carrying the same

item-name, or null if there is none.

Frequent-item-header table: -Each

entry consists of two fields: -

(1) Item name

(2) Head of node link (a pointer

pointing to the first node in the FP tree

carrying the item-name).

Fig. 4.2.FP tree structure

Algorithm for FP tree construction:

Input: A transaction database DB

and a minimum support threshold ξ.

Output: FP-tree, the frequent-pattern

tree of DB.

Steps: (1) Scan the transaction

database DB once. Collect F, the set of

frequent items, and the support of each

frequent item. Then we sort F as per support

descending order.

(2) Create the root of an FP-tree, T,

and label it as “null”. For each transaction

Trans in DB do the following.

(3) If T has a child N , then

increment N‟s count by 1; else create a new

node N, with its count initialized to 1, its

parent link linked to T , and its node-link

linked to the nodes with the same item-name

via the node-link structure.

FP-growth is about an order of

magnitude faster than Apriori, especially

when the data set is dense (containing many

patterns) and/or when the frequent patterns

are long.

4.2.3 Charm

Charm is an algorithm for generating

closed frequent item-sets for association

rules from transactional data. The closed

frequent item-sets are the smallest

representative subset of frequent item-sets

without loss of information. Charm takes

transactional data in the form of one row for

each single complete transaction. [16]

4.2.4 Magnum Opus

The main unique technique used in

Magnum Opus is the search algorithm based

on OPUS, a systematic search method with

pruning. It considers the whole search space,


9/18

but during the search, effectively prunes a

large area of search space without missing

search targets provided that the targets can

be measured using certain criteria. [16]

4.3 Real life application of association rule

Field of work Problem Statement Method applied Outcome

Government

sector.

Researchers of

King’s Cllege

London [19]

Fraud at Consignia ,

UK’s Pst ffice grup

Use f “if…then” assciatin

rule.

E.g. Normal behavir rule “IF

time < 1200 AND item =

stamps THEN $2 < cst < $4.”

Detectors that

successfully spot

abnormal transactions.

They also copy

themselves, so CIFD

adapts itself to create

detectors that

correspond to the most

prevalent patterns offraud.

[20] Issues concerning

accessibility of an

urban area

Spatial association rule mining

to geo-referenced U.K. census

data of 1991

Helped in

transportation planning

in area near a local

Stepping Hill Hospital.

Health care sector.

[21]

Anomaly detection

and classification in

Breast Cancer.

In training Apriori algorithm

was applied and association

rules were extracted. The

support was set to 10% and

the confidence to 0%.

Success rate of

classifier was 69.11%.

Time required for

training was much less

then neural network.

Retail Sector.

[22]

Purchasing behavior

of customer

On a dataset of 353,421

records from 1903

households about 1,022,812

association rules were

generated for promotion

sensitivity analysis i.e.,

analysis of customerresponses to various types of

promotions, including

advertisements, coupons, and

various types of discounts.

In a time duration of

1.5 hours about 2.6% of

discovered rules were

accepted and rest were

rejected.

Thus total rulesreduced to about 14

rules per household

from 537 rules per

household.

Telecom Sector. Which country pairs Use of association rule by Successful in detecting


10/18

11

[23] or triples or

quadruples customers

are currently calling

treating the top-k country

item set as a market basket

for each of account.

Exploiting temporal nature ofdata by using traffic from last

month as a baseline for

current month.

high rate of fraud calls

trends associated with

adult entertainment

services, that move

from country tocountry through time.

Manufacturing

sector.

VAM Drilling

industries France

[24]

Setting up a system

which provides result

identical to human

observation related to

performance and

dysfunctions during

forging.

Use of Rule-Growth that

mines sequential rules by FP-

growth with varying the

parameters minimum support

and minimum confidence

Found the main

dysfunction responsible

for delay.

Finding that generator

is cause for exceeding

maximum time in

starting phase

The third major

problem was the lack of

effectiveness of metal

strippers

5. Big Table [11]

5.1 Data Model

Data is organized into three

dimensions: rows, columns, and timestamps.

We refer to the storage referenced by a

particular row key, column key, and

timestamp as a cell. In Web-table, we woulduse URLs as row keys, various aspects of

web pages as column names, and store the

contents of the web pages in the contents:

column under the timestamps when they


11/18


12/18

Fig.5.1 Big table architecture

are fetched. Rows with consecutive keys are

grouped into tablets.

Fig.5.2 Data model for Big table

5.2 Building Blocks

Big-Table depends on a Google

cluster management system for scheduling

jobs, managing resources on shared

machines, monitoring machine status, and

dealing with machine failures.

The Google SSTable immutable-file

format is used internally to store Big-Table

data files. An SSTable provides a persistent,

ordered immutable map from keys to values,

where both keys and values are arbitrary

byte strings.

Big-Table uses Chubby for a variety

of tasks: to ensure that there is at most one

active master at any time; to store the

bootstrap location of Big-Table data; to

discover tablet servers and finalize tablet

server deaths; and to store Big-Tableschemas. Chubby is a distributed lock

service. A Chubby service consists of five

active replicas, one of which is elected to be

the master and actively serve requests. The

service is live when a majority of the

replicas are running and can communicate

with each other.

5.3 Big Table Implementation

The Big-Table implementation has

three major components: a library that is

linked into every client, one master server,

and many tablet servers.


13/18

13

The master is responsible for

assigning tablets to tablet servers, detecting

the addition and expiration of tablet servers,

balancing tablet-server load, and garbage

collecting files. In addition, it handlesschema changes such as table and column

family creations and deletions. Each tablet

server manages a set of tablets. The tablet

server handles read and write requests to the

tablets that it has loaded, and also splits

tablets that have grown too large. A Bigtable

cluster stores a number of tables. Each table

consists of a set of tablets, and each tablet

contains all of the data associated with a row

range.

5.3.1 Tablet Location: -

It uses a three-level hierarchy, the

first level is a file stored in Chubby that

contains the location of the root tablet . The

root tablet contains the locations of all of the

tablets of a special METADATA table. Each

METADATA tablet contains the location of

a set of user tablets. Secondary informationlike a log of all events pertaining to each

tablet (such as when a server begins serving

it) is also stored in METADATA. This

information is helpful for debugging and

performance analysis.

Fig. 5.2 Tablet location

5.3.2 Tablet assignment

Each tablet is assigned to at most one

tablet server at a time. The master keeps

track of the set of live tablet servers, and thecurrent assignment of tablets to tablet

servers, including which tablets are

unassigned. When a tablet is unassigned,

and a tablet server with sufficient room for

the tablet is available, the master assigns the

tablet by sending a tablet load request to the

tablet server. Bigtable uses Chubby to keep

track of tablet servers. When a tablet server

starts, it creates and acquires an exclusive

lock on a uniquely named file in a specificchubby directory. The master monitors this

directory (the server’s directory) to discover

tablet servers.

The set of existing tablets changes only

when a table is created or deleted, two

existing tablets are merged to form one

larger tablet, or an existing tablet is split into

two smaller tablets. The master is able to

keep track of these changes because itinitiates all but the last. Tablet splits are

treated specially since they are initiated by

tablet servers. A tablet server commits a

split by recording information for the new

tablet in the METADATA table. After

committing the split, the tablet server

notifies the master.

5.3.3 Tablet Serving: -

The persistent state of a tablet is

stored in GFS. Updates are committed to a

commit log that stores redo records. The

recently committed ones are stored in

memory in a sorted buffer called a


14/18

memtable. Older updates are stored in a

sequence of SSTables.

To recover a tablet, a tablet server

reads its metadata from the METADATA

table. This metadata contains the list ofSSTables that comprise a tablet and a set of

redo points, which are pointers into any

commit logs that may contain data for the

tablet. The server reads the indices of the

SSTables into memory and reconstructs the

memtable by applying all of the updates that

have committed since the redo points.

When a write operation arrives at a

tablet server, the server checks that it iswell-formed (i.e., not sent from a buggy or

obsolete client), and that the sender is

authorized to perform the mutation.

Authorization is performed by reading the

list of permitted writers from a chubby file.

A valid mutation is written to the commit

log. After the write has been committed, its

contents are inserted into the memtable

5.3.4. Schema Management

Bigtable schemas are stored in

Chubby. Chubby is an effective

communication substrate for Bigtable

schemas because it provides atomic whole-

file writes and consistent caching of small

files. For example, suppose a client wants to

delete some column families from a table.

The master performs access control checks,

verifies that the resulting schema is wellformed, and then installs the new schema by

rewriting the corresponding schema file in

Chubby. Whenever tablet servers need to

determine what column families exist, they

simply read the appropriate schema file from

Chubby, which is almost always available in

the server‟s chubby client cache. Because

chubby caches are consistent, tablet servers

are guaranteed to see all changes to that file.

6. Market Basket Analysis: -

Implementation and Results

In Retail each customer purchases different

set of products, different quantities, and

different times. Retailers use this

information to:

(1) Gain insight about its

merchandise (products):

Fast and slow movers

Products which are purchased

together

Products which might benefit

from promotion

(2)Take action:

Store layouts

Which products to put on

specials, promote, coupons.

6.1 Apriori Algorithm: -

The small database used to test this

algorithm is [18]

S.No. Item 1 Item 21 Item 3I3

1 Bread Butter Milk

2 Ice-cream Bread Butter

3 Bread Butter Noodles

4 Bread Noodles Ice-cream

5 Butter Milk Bread 6 Bread Noodles Ice-cream

7 Milk Butter Bread

8 Ice-cream Milk Bread

9 Butter Milk Noodles

10 Noodles Butter Ice-cream

Table 6.1. Database for testing Apriori algorithm


15/18

15

In the given dataset every item occurs three

or more than three times and total number of

transaction is ten so,

Minimum Support = 0.3

Item-set Support

Bread 0.8

Butter 0.7

Noodles 0.5Ice-cream 0.5

Milk 0.5

Table 6.2 Interestingness of 1-element item sets

Item-sets Support

{Bread, Butter} 0.5{Bread, Milk} 0.4

{Bread, Noodles} 0.3{Bread, Ice-cream} 0.4

{Butter, Milk} 0.4

{Butter, Noodles} 0.3{Butter, Ice-cream} 0.2

{Noodles, Milk} 0.1

{Noodles, Ice-cream } 0.3

{Milk, Ice-cream} 0.1Table 6.3 Interestingness of 2-element item sets

Item-sets Support

{Bread, Butter, Milk} 0.3

{Bread, Ice-cream, Noodles} 0.2

{Bread, Butter, Noodles} 0.1

Table 6.4 Interestingness of 3-element item sets

The main advantage of the Apriori

algorithm is that it only takes data from

previous iteration not from the whole data.

Rule Confidence (%)

{Bread} {Butter, Milk} 37

{Butter} {Bread, Milk} 42

{Milk} {Bread, Butter} 75

{Bread, Butter} {Milk} 60

{Bread, Milk} {Butter} 75

{Butter, Milk} {Bread} 75

Table 6.5 Rules based on Apriori algorithm

If the minimum confidence threshold is 70

percentages, and the minimum support is 30

percentages, then discovered rules are

{Bread, Milk} {Butter}

{Butter, Milk} {Bread}

{Milk}{Bread, Butter}

Then the algorithm was run on bakery

database. The database consisted of 50

different items and with 75000 receipts. The

minimum support was found to be 0.04. The

items were named as alphabet of English

literature.

Item-sets Support

{A,AU} 0.0440

{D,S} 0.0434

{D,AJ} 0.0430

{E,J} 0.0431

{F,W} 0.0439

{Q,AG} 0.0435

{S,AH} 0.0531

{AB,AC} 0.0509

{AH,AQ} 0.0431

Table 6.6. Interestingness of 2-element item sets

Item-sets Support

{D,S,AJ} 0.0411


16/18

Table 6.7. Interestingness of 3-element item sets

6.2 FP Growth Algorithm: -

This algorithm was implemented on three

datasets, previous two and a new one. Thesmallest dataset used was [19]

S. No. Item1 Item2 Item 3 Item 4

1 A B

2 B C D

3 A C D E

4 A D E

5 A B C

Table 6.8 Small dataset for FP algorithm

S. No. E B C D A

1 1 1

2 1 1 1

3 1 1 1 1

4 1 1 1

5 1 1 1

FREQ. 2 3 3 3 4

Table 6.9 Ascending order arrangement w.r.t.

frequency

Fig. 6.1 FP Tree construction

Table 6.10 Conditional pattern base and conditional

FP tree generation

Frequent Pattern Support Count

2-Item set

E,A 2

E,D 2

B,A 2C,A 2

C,B 2

D,A 2D,C 2

3 – Item set

E,A,D 2

Table 6.11 Item set generated


17/18

17

Similarly, the FP growth algorithm

was implemented on other two databases

and identical results as given by the Apriori

algorithm were obtained.

REFERENCES: -

1. G. Halevi, H. Moed, The evolution of big data as a

research and scientific topic: Overview of the

literature, Res. Trends(2012) 3 –6.

2. http://en.wikipedia.org/wiki/Big_data

3. J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs,

C. Roxburgh and A.H. Byers, "Big data: The next

frontier for innovation, competition, and

productivity", McKinsey Global Institute, 2011.

4. Chen, Min, Shiwen Mao, and Yunhao Liu. "Big

data: A survey." Mobile Networks and Applications

19.2 (2014): 171-209

5. Agrawal, Rakesh, Tmasz Imielioski, and Arun

Swami. "Mining association rules between sets of

items in large databases." In ACM SIGMOD Record ,

vol. 22, no. 2, pp. 207-216. ACM, 1993.

6. Lohr, Steve. "The age of big data." New York Times

11 (2012).

7. Rygielski, Chris, Jyun-Cheng Wang, and David C.

Yen. "Data mining techniques for customer

relationship management." Technology in society 24,

no. 4 (2002): 483-502.

8. Hennig-Thurau, Thorsten, Edward C. Malthouse,

Christian Friege, Sonja Gensler, Lara Lobschat, Arvind

Rangaswamy, and Bernd Skiera. "The impact of new

media on customer relationships." Journal of serviceresearch 13, no. 3 (2010): 311-330.

9. Kamakura, Wagner A., Michel Wedel, Fernando

De Rosa, and Jose Afonso Mazzon. "Cross-selling

through database marketing: A mixed data factor

analyzer for data augmentation and prediction."

International Journal of Research in marketing 20,

no. 1 (2003): 45-65.

10. Meixell, Mary J., and Vidyaranya B. Gargeya.

"Global supply chain design: A literature review and

critique." Transportation Research Part E: Logistics

and Transportation Review 41, no. 6 (2005): 531-

550.

11. Chang, Fay, Jeffrey Dean, Sanjay Ghemawat,

Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,

Tushar Chandra, Andrew Fikes, and Robert E.

Gruber. "Bigtable: A distributed storage system for

structured data." ACM Transactions on Computer

Systems (TOCS) 26, no. 2 (2008): 4.

12. Apache Cassandra 2.1DocumentationOctober 27, 2015.

13. Shvachko, Konstantin, Hairong Kuang, Sanjay

Radia, and Robert Chansler. "The hadoop distributed

file system." In Mass Storage Systems and

Technologies (MSST), 2010 IEEE 26th Symposium on,

pp. 1-10. IEEE, 2010.

14. Liu, B., Hsu, W., & Ma, Y. (1999, August). Pruning

and summarizing the discovered associations. In

Proceedings of the fifth ACM SIGKDD internationalconference on Knowledge discovery and data mining

(pp. 125-134). ACM.

15. Kamsu-Foguem, Bernard, Fabien Rigal, and Félix

Mauget. "Mining association rules for the quality

improvement of the production process." Expert

Systems with Applications 40.4 (2013): 1034-1045

16. Zheng, Z., Kohavi, R., & Mason, L. (2001, August).

Real world performance of association rule

algorithms. In Proceedings of the seventh ACMSIGKDD international conference on Knowledge

discovery and data mining (pp. 401-406). ACM.

17. Han, Jiawei, Jian Pei, Yiwen Yin, and Runying

Mao. "Mining frequent patterns without candidate

generation: A frequent-pattern tree approach." Data


18/18

mining and knowledge discovery 8, no. 1 (2004): 53-

87.

18. Dongre, Jugendra, Gend Lal Prajapati, and S. V.

Tokekar. "The role of Apriori algorithm for finding

the association rules in Data mining." In Issues and

Challenges in Intelligent Computing Techniques

(ICICT), 2014 International Conference on, pp. 657-

660. IEEE, 2014.

19. Weatherford, M. (2002). Mining for fraud.

Intelligent Systems, IEEE , 17 (4), 4-6.

20. Appice A, Ceci M, Lanza A, et al. Discovery of

spatial association rules in geo-referenced census

data: a relational mining approach. Intell Data Anal2003; 7:541 –566.

21. M.-L. Antnie, O. R. Za¨ıane, and A. Cman.

Application of data mining techniques for medical

image classification. In Second International ACM

SIGKDD Workshop on Multimedia Data Mining,

pages 94 –101, San Francisco, USA, August 2001.

22. Adomavicius, G., & Tuzhilin, A. (2001). Expert-

driven validation of rule-based user models in

personalization applications. Data Mining and

Knowledge Discovery , 5(1-2), 33-58

23. Cortes, C., & Pregibon, D. (2001). Signature-

based methods for data streams. Data Mining and

Knowledge Discovery , 5(3), 167-182.

24. Kamsu-Foguem, Bernard, Fabien Rigal, and Félix

Mauget. "Mining association rules for the quality

improvement of the production process." Expert

Systems with Applications 40.4 (2013): 1034-1045.

techniques and technologies1

Documents