techniques and technologies1
TRANSCRIPT
-
8/19/2019 Techniques and Technologies1
1/18
1
1. Introduction
Big data as coined, by Roger
Magoulas from O‟Reilly media in 2005 [1]
represents massive data sets with large,
more varied and complex structure with
challenge of storing, analyzing and
visualizing for extracting meaningful results.
Big data analytics is the process of research
into massive amounts of data to reveal
hidden patterns and correlations.
Big data is generated from various
factors like astronomy, atmospheric science,
genomics, biogeochemical, biological
science and research, life sciences, medical
records, scientific research, government,
natural disaster and resource management,
private sector, military surveillance, private
sector, financial services, retail, social
networks, web logs, text, document,
photography, audio, video, click streams,
search indexing, call detail records, POS
information, RFID, mobile phones, sensornetworks and telecommunications [2] .
2. Overview of Big Data
2.1 Benefits
Following are the benefits of Big
data in various fields: Better aimed
marketing, more straight business insights,
client based segmentation, recognition of
sales and market chances, automateddecision making, definitions of customer
behaviors, greater return on investments,
quantification of risks and market trending,
comprehension of business alteration, better
planning and forecasting, identification of
consumer behaviour and production yield
extension, predictive analytics on traffic
flows, identification of threats from different
video, audio and data feeds [4].
2.2 Potential of Big data:
McKinsey Global Institute specified the
potential of big data in following main
topics [3].
Healthcare: It has three main pools
of big data.
(1) Clinical data: Optimal treatment pathway, computerized physician order-
entry, transparency about medical data,
remote patient monitoring, advanced
analytics applied to patient profiles.
(2) Pharmaceutical R&D data:
Predictive modelling for new drugs,
suggesting trial sites with large numbers of
potentially eligible patients and strong track
records, pharmacovigilance (discover
adverse effects) , develop personalized
medicine.
(3) Activity (claims) and cost data:
automated systems (e.g., machine learning
techniques such as neural networks) for
fraud detection and checking the accuracy
and consistency of payors claims, based on
real-world patient outcomes data to arrive at
fair economic compensation.
Public sector: In public sector five
main categories for big data are:
(1) Creating transparency: Making
data more accessible
-
8/19/2019 Techniques and Technologies1
2/18
(2) Enabling experimentation to
discover needs, expose variability, and
improve performance
(3) Segmenting populations to
customize actions
(4) Replacing/supporting human
decision making with automated algorithms
(5) Innovating new business models,
products, and services with big data
Retail: Here, five main categories are
(1) Marketing: Cross selling,
location based marketing, customer
behavioural segmentation, sentiment
analysis, integrate promotion and pricing.(2) Merchandising: Placement and
design optimization, price optimization.
(3) Operations: Performance
transparency, optimization of lab or inputs,
automated time and attendance tracking, and
improved labour scheduling.
(4) Supply chain: Stock forecasting
by combining multiple datasets such as sales
histories, weather predictions, and seasonal
sales cycles.
(5) New business models: Price
comparison services, web based market.
Manufacturing:
(1) Research and development and
product design
(2) Product lifecycle management.
(3) Design to value.
(4) Open innovation
(5) Supply chain
(6) Production: Digital factory,
Sensor-driven operation.
Personal location data: Smart
routing, geo targeted advertising or
emergency response, urban planning, new
business models.
Social network analysis:
Understanding user intelligence for more
targeted advertising, marketing campaigns
and capacity planning, customer behavior
and buying patterns, sentiment analytics.
2.3. Challenges and Obstacles
Major hurdles for implementation of
Big data analytics are [4]
(1)Data representation: Datarepresentation aims to make data more
meaningful for computer analysis and user
interpretation. Many datasets have certain
levels of heterogeneity in type, structure,
semantics, organization, granularity, and
accessibility.
(2)Redundancy reduction and data
compression: Effective to reduce theindirect cost of the entire system on the premise that the potential values of the data
are not affected.
(3) Data life cycle management: A
data importance principle related to the
analytical value should be developed to
decide which data shall be stored and which
data shall be discarded.
(4) Data confidentiality: The
transactional dataset generally includes a set
of complete operating data to drive key
business processes. Such data contains
details of the lowest granularity and some
sensitive information such as credit card
numbers
-
8/19/2019 Techniques and Technologies1
3/18
-
8/19/2019 Techniques and Technologies1
4/18
creation, and availability for access and
delivery. It includes web site response time,
inventory availability analysis, and
transaction execution, and order tracking
update, product / service delivery
3. Big data techniques and technologies: -
3.1 Big data techniques
S.No Technique Method Example
1 A/B testing[3]
A control group is compared with avariety of test groups in order to
determine what changes willimprove a given objective variable.
Determining what copy text, layouts,images, or colors will improve
conversion rates on an e-commerceweb site
2 Associationrule learning
[5]
A set of techniques to discoverinteresting relationship i.e.
“association rules” among variablesin large databases
Market basket analysis, in which aretailer can determine which products
are frequently bought together and usethis information for marketing
3 Data fusion
and dataintegration
[6]
A set of techniques that integrate
and analyze data from multiplesources in order to develop insights
in ways that are more efficient and
potentially more accurate than if
they were developed by analyzing asingle source of data.
Data from social media, analyzed by
natural language processing, can becombined with real-time sales data, in
order to determine what effect a
marketing campaign is having on
customer sentiment and purchasing behavior.
4 Data mining[7]
A set of techniques to extract patterns from large datasets by
combining methods from statistics
and machine learning with database
management. These techniquesinclude cluster analysis,
classification, and regression.
Mining customer data to determinesegments most likely to respond to an
offer, mining human resources data to
identify characteristics of most
successful employees
5 Natural
language
processing
[8]
Uses computer algorithms to
analyze human (natural) language
Using sentiment analysis on social
media to determine how prospective
customers are reacting to a branding
campaign.
6 Predictivemodeling
[9]
A mathematical model is created orchosen to best predict the
probability of an outcome
Estimate the likelihood that a customercan be cross-sold other product.
7 Spatialanalysis
Techniques that analyze thetopological, geometric, or
How is consumer willingness to purchase a product correlated with
-
8/19/2019 Techniques and Technologies1
5/18
[10] geographic properties encoded in adata set
location? How would a manufacturingsupply chain network perform with
sites in different locations?
3.2 Big data technologies
S.No. Technology Overview Application1. Big Table
[11]
Big table is a distributed storage system
for managing structured data at Google.It can reliably scale to petabytes of data
and thousands of machines
More than 60 Google
products like GoogleEarth, Google Finance
and Google web
Indexing, Orkut, and
Google Analytics etc.
2. Cassandra
[12]
Cassandra is a massively scalable open
source NoSQL database. Cassandra has amaster less “ring” design that is elegant,
easy to setup, and easy to maintain.Cassandra delivers continuousavailability, linear scalability, and
operational simplicity across many
commodity servers with no single point
of failure.
Accenture, EBay,
Netflix, Go daddy,Instagram , Reddit,
Yahoo! Japan, NASA
3. Hadoop[13]
Hadoop is an Apache open sourceframework written in java that allows
distributed processing of large datasets
across clusters of computers using simple
programming models. Hadoop isdesigned to scale up from single server to
thousands of machines, each offering
local computation and storage
Amazon, AOL,Facebook, IBM, New
York Times, Yahoo!,
Microsoft, Google
4. Association rule
An association is a rule of the
format: LHS -- RHS. The goal of association
rule discovery is to find associations among
items from a set of transactions, each of
which contains a set of items. [5] Generally
the algorithm finds a subset of association
rules that satisfy certain constraints.
(1) Minimum support: - The support
of a rule is defined as the support of the
item-set consisting of both the LHS and the
RHS. The support of an item-set is the
percentage of transactions in the transaction
set that contain the item-set. An item-set
with a support higher than a given minimum
support is called frequent item-set.
(2) Minimum confidence: - It is the
minimum ratio of the support of the rule and
the support of the LHS.
-
8/19/2019 Techniques and Technologies1
6/18
7
Most association rule algorithms
generate association rules in two steps:
(1) Generate all frequent item-sets,
Fig.4.1 General block diagram of Association rule.
(2) Construct all rules using these
item-sets.
4.1. Association rule in Big data
It has been experimentally
demonstrated that for support levels that
generate less than 100,000 rules, which is a
very conservative upper bound for humans
to sift through even considering pruning un-
interesting rules, Apriori finishes on all
datasets in less than 1 minute.) For support
levels that generate less than 1,000,000
rules, which are sufficient for prediction
purposes where data is loaded into RAM,
Apriori finishes processing in less than 10
minutes.
[14]
4.2. Association rule Algorithms
4.2.1 Apriori Algorithm
The Apriori algorithm finds frequent
item-sets from databases by iteration. For
each iteration I the algorithm attempts to
determine the set of frequent patterns with I
items and this set is engaged to generate the
set of candidate item-sets of the next
iteration. The iteration is repetitively performed until no candidate patterns can be
discovered. It uses a bottom up approach,
where frequent subsets are extended one
item at a time. In the input datasets are
referred as sequences composed of more or
less items. The output of Apriori is a set of
rules explaining the links these items have in
their sets. [15]
Apriori is an algorithm for findingfrequent item-sets using candidate
generation. Given minimum required
support ‘S’ as interestingness criterion [18]:
-
8/19/2019 Techniques and Technologies1
7/18
(1) Search for all individual elements
(1-element item-set) that have a minimum
support of „S‟.
(2) From the results of the previous
search for „i‟ element item-set, search for all
„i+1‟ element item-sets that have a
minimum support of „S‟. This becomes the
set of all frequent „(i+1)‟ item-sets that are
interesting.
(3) Repeat step 2 until item-set size
reaches maximum.
Association rules are of the form
AB where A B is different from B
A. A B implies that if a customer
purchase item A then he also purchase itemB. For the association rule mining two
threshold values are required: -
(1) Minimum support: - Support is
the percentage of the population which
satisfies the rule or in the other words the
support for a rule R is the ratio of the
number of occurrence of R, given all
occurrences of all rules.
The support of an association pattern
is the percentage of task-relevant datatransactions for which the pattern is true.
(2) Minimum confidence: - The
confidence of a rule A B, is the ratio of
the number of occurrences of B given A,
among all other occurrences given A.
Confidence is defined as the measure ofcertainty or trustworthiness associated with
each discovered pattern A B.
Association rules are generated as
per following method: -
(1) Use Apriori to generate item-sets
of different sizes.
(2) At each iteration divide each
frequent item-set X into two parts
antecedent (LHS) and consequent (RHS)
this represents a rule of the form
LHSRHS.
(3) Discard all rules whose
confidence is less than minimum
confidence.
4.2.2 FP Algorithm
It generates all frequent item-sets
satisfying a given minimum support by
growing a frequent pattern tree structure that
stores compressed information about the
frequent patterns. In this way, FP-growth
can avoid repeated database scans and also
avoid the generation of a large number of
candidate item-sets. FP-growth takes
transactional data in the form of one row foreach single complete transaction.
Implementations of FP-growth only generate
the frequent item-sets, and not the
association rules [16]. The mining task as
well as the database are decomposed using a
divide and conquer system and finally it
uses a fragment pattern method to avoid the
costly process of candidate generation and
testing opposed to the Apriori algorithm.
[15]
A frequent pattern tree is a structure
consisting of [17]
(1) One root labeled as “null”,
(2) A set of item-prefix subtrees as
the children of the root,
Number of tuples with both A and BSupport(A B) =
Total number of tuples
Number of tuples with both A and BConfidence (A B) =
Number of tuples with A
-
8/19/2019 Techniques and Technologies1
8/18
9
(3) A frequent-item-header table.
Item-prefix subtrees: -Each node in
the item-prefix subtree consists of three
fields: item-name, count, and node-link.
(1) Item-name: -It registers which
item this node represents.
(2) Count: - It registers the number
of transactions represented by the
portion of the path reaching this node.
(3) Node link: - Links to the next
node in the FP-tree carrying the same
item-name, or null if there is none.
Frequent-item-header table: -Each
entry consists of two fields: -
(1) Item name
(2) Head of node link (a pointer
pointing to the first node in the FP tree
carrying the item-name).
Fig. 4.2.FP tree structure
Algorithm for FP tree construction:
Input: A transaction database DB
and a minimum support threshold ξ.
Output: FP-tree, the frequent-pattern
tree of DB.
Steps: (1) Scan the transaction
database DB once. Collect F, the set of
frequent items, and the support of each
frequent item. Then we sort F as per support
descending order.
(2) Create the root of an FP-tree, T,
and label it as “null”. For each transaction
Trans in DB do the following.
(3) If T has a child N , then
increment N‟s count by 1; else create a new
node N, with its count initialized to 1, its
parent link linked to T , and its node-link
linked to the nodes with the same item-name
via the node-link structure.
FP-growth is about an order of
magnitude faster than Apriori, especially
when the data set is dense (containing many
patterns) and/or when the frequent patterns
are long.
4.2.3 Charm
Charm is an algorithm for generating
closed frequent item-sets for association
rules from transactional data. The closed
frequent item-sets are the smallest
representative subset of frequent item-sets
without loss of information. Charm takes
transactional data in the form of one row for
each single complete transaction. [16]
4.2.4 Magnum Opus
The main unique technique used in
Magnum Opus is the search algorithm based
on OPUS, a systematic search method with
pruning. It considers the whole search space,
-
8/19/2019 Techniques and Technologies1
9/18
but during the search, effectively prunes a
large area of search space without missing
search targets provided that the targets can
be measured using certain criteria. [16]
4.3 Real life application of association rule
Field of work Problem Statement Method applied Outcome
Government
sector.
Researchers of
King’s Cllege
London [19]
Fraud at Consignia ,
UK’s Pst ffice grup
Use f “if…then” assciatin
rule.
E.g. Normal behavir rule “IF
time < 1200 AND item =
stamps THEN $2 < cst < $4.”
Detectors that
successfully spot
abnormal transactions.
They also copy
themselves, so CIFD
adapts itself to create
detectors that
correspond to the most
prevalent patterns offraud.
[20] Issues concerning
accessibility of an
urban area
Spatial association rule mining
to geo-referenced U.K. census
data of 1991
Helped in
transportation planning
in area near a local
Stepping Hill Hospital.
Health care sector.
[21]
Anomaly detection
and classification in
Breast Cancer.
In training Apriori algorithm
was applied and association
rules were extracted. The
support was set to 10% and
the confidence to 0%.
Success rate of
classifier was 69.11%.
Time required for
training was much less
then neural network.
Retail Sector.
[22]
Purchasing behavior
of customer
On a dataset of 353,421
records from 1903
households about 1,022,812
association rules were
generated for promotion
sensitivity analysis i.e.,
analysis of customerresponses to various types of
promotions, including
advertisements, coupons, and
various types of discounts.
In a time duration of
1.5 hours about 2.6% of
discovered rules were
accepted and rest were
rejected.
Thus total rulesreduced to about 14
rules per household
from 537 rules per
household.
Telecom Sector. Which country pairs Use of association rule by Successful in detecting
-
8/19/2019 Techniques and Technologies1
10/18
11
[23] or triples or
quadruples customers
are currently calling
treating the top-k country
item set as a market basket
for each of account.
Exploiting temporal nature ofdata by using traffic from last
month as a baseline for
current month.
high rate of fraud calls
trends associated with
adult entertainment
services, that move
from country tocountry through time.
Manufacturing
sector.
VAM Drilling
industries France
[24]
Setting up a system
which provides result
identical to human
observation related to
performance and
dysfunctions during
forging.
Use of Rule-Growth that
mines sequential rules by FP-
growth with varying the
parameters minimum support
and minimum confidence
Found the main
dysfunction responsible
for delay.
Finding that generator
is cause for exceeding
maximum time in
starting phase
The third major
problem was the lack of
effectiveness of metal
strippers
5. Big Table [11]
5.1 Data Model
Data is organized into three
dimensions: rows, columns, and timestamps.
We refer to the storage referenced by a
particular row key, column key, and
timestamp as a cell. In Web-table, we woulduse URLs as row keys, various aspects of
web pages as column names, and store the
contents of the web pages in the contents:
column under the timestamps when they
-
8/19/2019 Techniques and Technologies1
11/18
-
8/19/2019 Techniques and Technologies1
12/18
Fig.5.1 Big table architecture
are fetched. Rows with consecutive keys are
grouped into tablets.
Fig.5.2 Data model for Big table
5.2 Building Blocks
Big-Table depends on a Google
cluster management system for scheduling
jobs, managing resources on shared
machines, monitoring machine status, and
dealing with machine failures.
The Google SSTable immutable-file
format is used internally to store Big-Table
data files. An SSTable provides a persistent,
ordered immutable map from keys to values,
where both keys and values are arbitrary
byte strings.
Big-Table uses Chubby for a variety
of tasks: to ensure that there is at most one
active master at any time; to store the
bootstrap location of Big-Table data; to
discover tablet servers and finalize tablet
server deaths; and to store Big-Tableschemas. Chubby is a distributed lock
service. A Chubby service consists of five
active replicas, one of which is elected to be
the master and actively serve requests. The
service is live when a majority of the
replicas are running and can communicate
with each other.
5.3 Big Table Implementation
The Big-Table implementation has
three major components: a library that is
linked into every client, one master server,
and many tablet servers.
-
8/19/2019 Techniques and Technologies1
13/18
13
The master is responsible for
assigning tablets to tablet servers, detecting
the addition and expiration of tablet servers,
balancing tablet-server load, and garbage
collecting files. In addition, it handlesschema changes such as table and column
family creations and deletions. Each tablet
server manages a set of tablets. The tablet
server handles read and write requests to the
tablets that it has loaded, and also splits
tablets that have grown too large. A Bigtable
cluster stores a number of tables. Each table
consists of a set of tablets, and each tablet
contains all of the data associated with a row
range.
5.3.1 Tablet Location: -
It uses a three-level hierarchy, the
first level is a file stored in Chubby that
contains the location of the root tablet . The
root tablet contains the locations of all of the
tablets of a special METADATA table. Each
METADATA tablet contains the location of
a set of user tablets. Secondary informationlike a log of all events pertaining to each
tablet (such as when a server begins serving
it) is also stored in METADATA. This
information is helpful for debugging and
performance analysis.
Fig. 5.2 Tablet location
5.3.2 Tablet assignment
Each tablet is assigned to at most one
tablet server at a time. The master keeps
track of the set of live tablet servers, and thecurrent assignment of tablets to tablet
servers, including which tablets are
unassigned. When a tablet is unassigned,
and a tablet server with sufficient room for
the tablet is available, the master assigns the
tablet by sending a tablet load request to the
tablet server. Bigtable uses Chubby to keep
track of tablet servers. When a tablet server
starts, it creates and acquires an exclusive
lock on a uniquely named file in a specificchubby directory. The master monitors this
directory (the server’s directory) to discover
tablet servers.
The set of existing tablets changes only
when a table is created or deleted, two
existing tablets are merged to form one
larger tablet, or an existing tablet is split into
two smaller tablets. The master is able to
keep track of these changes because itinitiates all but the last. Tablet splits are
treated specially since they are initiated by
tablet servers. A tablet server commits a
split by recording information for the new
tablet in the METADATA table. After
committing the split, the tablet server
notifies the master.
5.3.3 Tablet Serving: -
The persistent state of a tablet is
stored in GFS. Updates are committed to a
commit log that stores redo records. The
recently committed ones are stored in
memory in a sorted buffer called a
-
8/19/2019 Techniques and Technologies1
14/18
memtable. Older updates are stored in a
sequence of SSTables.
To recover a tablet, a tablet server
reads its metadata from the METADATA
table. This metadata contains the list ofSSTables that comprise a tablet and a set of
redo points, which are pointers into any
commit logs that may contain data for the
tablet. The server reads the indices of the
SSTables into memory and reconstructs the
memtable by applying all of the updates that
have committed since the redo points.
When a write operation arrives at a
tablet server, the server checks that it iswell-formed (i.e., not sent from a buggy or
obsolete client), and that the sender is
authorized to perform the mutation.
Authorization is performed by reading the
list of permitted writers from a chubby file.
A valid mutation is written to the commit
log. After the write has been committed, its
contents are inserted into the memtable
5.3.4. Schema Management
Bigtable schemas are stored in
Chubby. Chubby is an effective
communication substrate for Bigtable
schemas because it provides atomic whole-
file writes and consistent caching of small
files. For example, suppose a client wants to
delete some column families from a table.
The master performs access control checks,
verifies that the resulting schema is wellformed, and then installs the new schema by
rewriting the corresponding schema file in
Chubby. Whenever tablet servers need to
determine what column families exist, they
simply read the appropriate schema file from
Chubby, which is almost always available in
the server‟s chubby client cache. Because
chubby caches are consistent, tablet servers
are guaranteed to see all changes to that file.
6. Market Basket Analysis: -
Implementation and Results
In Retail each customer purchases different
set of products, different quantities, and
different times. Retailers use this
information to:
(1) Gain insight about its
merchandise (products):
Fast and slow movers
Products which are purchased
together
Products which might benefit
from promotion
(2)Take action:
Store layouts
Which products to put on
specials, promote, coupons.
6.1 Apriori Algorithm: -
The small database used to test this
algorithm is [18]
S.No. Item 1 Item 21 Item 3I3
1 Bread Butter Milk
2 Ice-cream Bread Butter
3 Bread Butter Noodles
4 Bread Noodles Ice-cream
5 Butter Milk Bread 6 Bread Noodles Ice-cream
7 Milk Butter Bread
8 Ice-cream Milk Bread
9 Butter Milk Noodles
10 Noodles Butter Ice-cream
Table 6.1. Database for testing Apriori algorithm
-
8/19/2019 Techniques and Technologies1
15/18
15
In the given dataset every item occurs three
or more than three times and total number of
transaction is ten so,
Minimum Support = 0.3
Item-set Support
Bread 0.8
Butter 0.7
Noodles 0.5Ice-cream 0.5
Milk 0.5
Table 6.2 Interestingness of 1-element item sets
Item-sets Support
{Bread, Butter} 0.5{Bread, Milk} 0.4
{Bread, Noodles} 0.3{Bread, Ice-cream} 0.4
{Butter, Milk} 0.4
{Butter, Noodles} 0.3{Butter, Ice-cream} 0.2
{Noodles, Milk} 0.1
{Noodles, Ice-cream } 0.3
{Milk, Ice-cream} 0.1Table 6.3 Interestingness of 2-element item sets
Item-sets Support
{Bread, Butter, Milk} 0.3
{Bread, Ice-cream, Noodles} 0.2
{Bread, Butter, Noodles} 0.1
Table 6.4 Interestingness of 3-element item sets
The main advantage of the Apriori
algorithm is that it only takes data from
previous iteration not from the whole data.
Rule Confidence (%)
{Bread} {Butter, Milk} 37
{Butter} {Bread, Milk} 42
{Milk} {Bread, Butter} 75
{Bread, Butter} {Milk} 60
{Bread, Milk} {Butter} 75
{Butter, Milk} {Bread} 75
Table 6.5 Rules based on Apriori algorithm
If the minimum confidence threshold is 70
percentages, and the minimum support is 30
percentages, then discovered rules are
{Bread, Milk} {Butter}
{Butter, Milk} {Bread}
{Milk}{Bread, Butter}
Then the algorithm was run on bakery
database. The database consisted of 50
different items and with 75000 receipts. The
minimum support was found to be 0.04. The
items were named as alphabet of English
literature.
Item-sets Support
{A,AU} 0.0440
{D,S} 0.0434
{D,AJ} 0.0430
{E,J} 0.0431
{F,W} 0.0439
{Q,AG} 0.0435
{S,AH} 0.0531
{AB,AC} 0.0509
{AH,AQ} 0.0431
Table 6.6. Interestingness of 2-element item sets
Item-sets Support
{D,S,AJ} 0.0411
-
8/19/2019 Techniques and Technologies1
16/18
Table 6.7. Interestingness of 3-element item sets
6.2 FP Growth Algorithm: -
This algorithm was implemented on three
datasets, previous two and a new one. Thesmallest dataset used was [19]
S. No. Item1 Item2 Item 3 Item 4
1 A B
2 B C D
3 A C D E
4 A D E
5 A B C
Table 6.8 Small dataset for FP algorithm
S. No. E B C D A
1 1 1
2 1 1 1
3 1 1 1 1
4 1 1 1
5 1 1 1
FREQ. 2 3 3 3 4
Table 6.9 Ascending order arrangement w.r.t.
frequency
Fig. 6.1 FP Tree construction
Table 6.10 Conditional pattern base and conditional
FP tree generation
Frequent Pattern Support Count
2-Item set
E,A 2
E,D 2
B,A 2C,A 2
C,B 2
D,A 2D,C 2
3 – Item set
E,A,D 2
Table 6.11 Item set generated
-
8/19/2019 Techniques and Technologies1
17/18
17
Similarly, the FP growth algorithm
was implemented on other two databases
and identical results as given by the Apriori
algorithm were obtained.
REFERENCES: -
1. G. Halevi, H. Moed, The evolution of big data as a
research and scientific topic: Overview of the
literature, Res. Trends(2012) 3 –6.
2. http://en.wikipedia.org/wiki/Big_data
3. J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs,
C. Roxburgh and A.H. Byers, "Big data: The next
frontier for innovation, competition, and
productivity", McKinsey Global Institute, 2011.
4. Chen, Min, Shiwen Mao, and Yunhao Liu. "Big
data: A survey." Mobile Networks and Applications
19.2 (2014): 171-209
5. Agrawal, Rakesh, Tmasz Imielioski, and Arun
Swami. "Mining association rules between sets of
items in large databases." In ACM SIGMOD Record ,
vol. 22, no. 2, pp. 207-216. ACM, 1993.
6. Lohr, Steve. "The age of big data." New York Times
11 (2012).
7. Rygielski, Chris, Jyun-Cheng Wang, and David C.
Yen. "Data mining techniques for customer
relationship management." Technology in society 24,
no. 4 (2002): 483-502.
8. Hennig-Thurau, Thorsten, Edward C. Malthouse,
Christian Friege, Sonja Gensler, Lara Lobschat, Arvind
Rangaswamy, and Bernd Skiera. "The impact of new
media on customer relationships." Journal of serviceresearch 13, no. 3 (2010): 311-330.
9. Kamakura, Wagner A., Michel Wedel, Fernando
De Rosa, and Jose Afonso Mazzon. "Cross-selling
through database marketing: A mixed data factor
analyzer for data augmentation and prediction."
International Journal of Research in marketing 20,
no. 1 (2003): 45-65.
10. Meixell, Mary J., and Vidyaranya B. Gargeya.
"Global supply chain design: A literature review and
critique." Transportation Research Part E: Logistics
and Transportation Review 41, no. 6 (2005): 531-
550.
11. Chang, Fay, Jeffrey Dean, Sanjay Ghemawat,
Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Tushar Chandra, Andrew Fikes, and Robert E.
Gruber. "Bigtable: A distributed storage system for
structured data." ACM Transactions on Computer
Systems (TOCS) 26, no. 2 (2008): 4.
12. Apache Cassandra 2.1DocumentationOctober 27, 2015.
13. Shvachko, Konstantin, Hairong Kuang, Sanjay
Radia, and Robert Chansler. "The hadoop distributed
file system." In Mass Storage Systems and
Technologies (MSST), 2010 IEEE 26th Symposium on,
pp. 1-10. IEEE, 2010.
14. Liu, B., Hsu, W., & Ma, Y. (1999, August). Pruning
and summarizing the discovered associations. In
Proceedings of the fifth ACM SIGKDD internationalconference on Knowledge discovery and data mining
(pp. 125-134). ACM.
15. Kamsu-Foguem, Bernard, Fabien Rigal, and Félix
Mauget. "Mining association rules for the quality
improvement of the production process." Expert
Systems with Applications 40.4 (2013): 1034-1045
16. Zheng, Z., Kohavi, R., & Mason, L. (2001, August).
Real world performance of association rule
algorithms. In Proceedings of the seventh ACMSIGKDD international conference on Knowledge
discovery and data mining (pp. 401-406). ACM.
17. Han, Jiawei, Jian Pei, Yiwen Yin, and Runying
Mao. "Mining frequent patterns without candidate
generation: A frequent-pattern tree approach." Data
-
8/19/2019 Techniques and Technologies1
18/18
mining and knowledge discovery 8, no. 1 (2004): 53-
87.
18. Dongre, Jugendra, Gend Lal Prajapati, and S. V.
Tokekar. "The role of Apriori algorithm for finding
the association rules in Data mining." In Issues and
Challenges in Intelligent Computing Techniques
(ICICT), 2014 International Conference on, pp. 657-
660. IEEE, 2014.
19. Weatherford, M. (2002). Mining for fraud.
Intelligent Systems, IEEE , 17 (4), 4-6.
20. Appice A, Ceci M, Lanza A, et al. Discovery of
spatial association rules in geo-referenced census
data: a relational mining approach. Intell Data Anal2003; 7:541 –566.
21. M.-L. Antnie, O. R. Za¨ıane, and A. Cman.
Application of data mining techniques for medical
image classification. In Second International ACM
SIGKDD Workshop on Multimedia Data Mining,
pages 94 –101, San Francisco, USA, August 2001.
22. Adomavicius, G., & Tuzhilin, A. (2001). Expert-
driven validation of rule-based user models in
personalization applications. Data Mining and
Knowledge Discovery , 5(1-2), 33-58
23. Cortes, C., & Pregibon, D. (2001). Signature-
based methods for data streams. Data Mining and
Knowledge Discovery , 5(3), 167-182.
24. Kamsu-Foguem, Bernard, Fabien Rigal, and Félix
Mauget. "Mining association rules for the quality
improvement of the production process." Expert
Systems with Applications 40.4 (2013): 1034-1045.