covers the following pages chapter 4: data...
Post on 26-Apr-2018
220 Views
Preview:
TRANSCRIPT
1
Dr. Chen, Data Base Management
Chapter 4:
Data Mining
Jason C. H. Chen, Ph.D.
Professor of MIS
School of Business Administration
Gonzaga University
Spokane, WA 99258
chen@jepson.gonzaga.edu
Dr. Chen, Business Intelligence
2
Covers the following pages
• pp. 145 – 171
• p. 176 (Decision Trees only)
• pp. 186 – 195
• AC 4.6 (p.189-192)
• NO ECC
Dr. Chen, Business Intelligence 3
Learning Objectives
• Define data mining as an enabling technology for
business intelligence
• Understand the objectives and benefits of business
analytics and data mining
• Recognize the wide range of applications of data
mining
• Learn the standardized data mining processes
CRISP-DM
SEMMA
KDD (Continued…) Dr. Chen, Business Intelligence
4
Learning Objectives
• Understand the steps involved in data
preprocessing for data mining
• Learn different methods and algorithms of data
mining
• Build awareness of the existing data mining
software tools
Commercial versus free/open source
• Understand the pitfalls and myths of data mining
Dr. Chen, Business Intelligence 5
Questions for the Opening Vignette
1. Why should retailers, especially omni-channel retailers, pay extra
attention to advanced analytics and data mining?
2. What are the top challenges for multi-channel retailers? Can you think
of other industry segments that face similar problems/challenges?
3. What are the sources of data that retailers such as Cabela’s use for
their data mining projects?
4. What does it mean to have a “single view of the customer”? How can
it be accomplished?
5. What type of analytics help did Cabela’s get from their efforts? Can
you think of any other potential benefits of analytics for large-scale
retailers like Cabela’s?
6. What was the reason for Cabela’s to bring together SAS and
Teradata, the two leading vendors in the analytics marketplace?
7. What is in-database analytics, and why would you need it?
Dr. Chen, Business Intelligence 6
• 1. Why should retailers, especially omni-channel retailers, pay extra
attention to advanced analytics and data mining?
• Utilizing large and information-rich transactional and customer data
(that they collect on a daily basis) to optimize their business processes is
not a choice for large-scale retailers anymore, but a necessity to stay
competitive.
• 2. What are the top challenges for multi-channel retailers? Can you
think of other industry segments that face similar problems?
• The retail industry is amongst the most challenging because of the
change that they have to deal with constantly. Understanding customer
needs, wants, likes, and dislikes is an ongoing challenge. As the volume
and complexity of data increase, so does the time spent on preparing and
analyzing it.
• Prior to the integration of SAS and Teradata, data for modeling and
scoring customers was stored in a data mart. This process required a
large amount of time to construct, bringing together disparate data
sources and keeping statisticians from working on analytics.
2
Dr. Chen, Business Intelligence 7
• 3. What are the sources of data that retailers such as
Cabela’s use for their data mining projects?
• Cabela’s uses large and information-rich transactional and
customer data (that they collect on a daily basis). In
addition, through Web mining they track clickstream
patterns of customers shopping online.
• 4. What does it mean to have “a single view of the
customer”? How can it be accomplished?
• Having a single view of the customer means treating the
customer as a single entity across whichever channels the
customer utilizes. Shopping channels include brick-and-
mortar, television, catalog, and e-commerce (through
computers and mobile devices). Achieving this single view
helps to better focus marketing efforts and drive increased
sales.
Dr. Chen, Business Intelligence
8
• 5. What type of analytics help did Cabela’s get from their efforts?
Can you think of any other potential benefits of analytics for large-
scale retailers like Cabela’s?
• Cabela’s has long relied on SAS statistics and data mining tools to
help analyze the data it gathers from sales transactions, market
research, and demographic data associated with its large database of
customers. Using SAS data mining tools, Cabela’s analysts create
predictive models to optimize customer selection for all customer
contacts. Cabela’s uses these prediction scores to maximize
marketing spending across channels and within each customer’s
personal contact strategy.
• These efforts have allowed Cabela’s to continue its growth in a
profitable manner. In addition, dismantling the information silos, and
integration of SAS and Teradata, enabled them to create “a holistic
view of the customer.” Since this works so well for the
sales/customer side of the business, it could also work in other areas
as well. Supply chain is one example. Analytics could help produce a
“holistic view of the vendor” as well.
Dr. Chen, Business Intelligence 9
• 6. What was the reason for Cabela’s to bring
together SAS and Teradata, the two leading
vendors in the analytics marketplace?
• Cabela’s was already using both for different
elements of their business. Each of the two systems
was producing actionable analysis of data.
• But by being separate, too much time was required
to construct data marts, bringing together disparate
data sources and keeping statisticians from working
on analytics. Now, with the integration of the two
systems, statisticians can leverage the power of
SAS using the Teradata warehouse as one source of
information.
Dr. Chen, Business Intelligence 10
• 7. What is in-database analytics, and why would
you need it?
• In-database analytics refers to the practice of
applying analytics directly to a database or data
warehouse rather than the traditional practice of
first transforming into the analytics application’s
data format. The time it takes to transform
production data into a data warehouse format can
be very long. In-database analytics eliminates this
need.
Dr. Chen, Business Intelligence 11
Event Driven Alert - A Scenario
• Three transactions were posted in a credit
account while the account holder was
traveling in summer 2010:
Ranch 99 San Jose, California $102.33 Aug. 1, 2010
Exxon Austin, Texas $99.12 August 3, 2010
Exxon Houston, Texas $120.44 August 5, 2010
• What action will you (the credit company)
take and why?
• How the scenario can be detected?
Dr. Chen, Business Intelligence 12
Q/Quotes
• “What will be the killer applications in the
corporation?” – an “age-old” question
• “Data Mining”
replied by “Dr. Penzias” (Nobel laureate and former chief
scientist of Bell Labs
• “Data mining will become much more important and
companies will throw away nothing about their
customers because it will be so valuable. If you’re
not doing this, you’re out of business”
• “The latest strategic weapon for companies is
analytical decision making (e.g., data mining) …”
Thomas Davenport (2006) in Harvard Business Review.
3
Dr. Chen, Business Intelligence 13
Database vs. Datawarehouse
DBMS Database
Datawarehouse ???
Dr. Chen, Business Intelligence 14
Database vs. Datawarehouse
DBMS Database
Datawarehouse
Data Mining
Dr. Chen, Business Intelligence 15
What is Data Mining?
• Data mining – the process of analyzing data to extract
information (unknown patterns) not offered by the raw
data alone
• To perform data mining users need data-mining tools
Data-mining tool – uses a variety of techniques to find patterns
and relationships in large volumes of information and infers
rules that predict future behavior and guide decision making
A wide range of data mining techniques are being used by
organization to gain a better understanding of their customers
and their operations and to solve complex organizational
problems.
• An example
Grocery Store in UK (see next slide) Dr. Chen, Business Intelligence
16
CRM and Data Mining (BI)Example • A Grocery store in U.K. with the following “patterns” found:
• Every Thursday afternoon
• Young Fathers (why?) shopping at store
• Two of the followings are always included in their shopping list
Diapers and
Beers
• What other decisions should be made as a store manager (in terms
of store layout)?
• Short term vs. Long term
This is an example of cross-selling
Other types of promotion: up-sell, bundled-sell
• IT (e.g., BI) helps to find valuable information then decision
makers make a timely/right decision for improving/creating
competitive advantages.
Dr. Chen, Business Intelligence 17
Q/A
• Can the “patterns” in the grocery store
example be produced from its Database?
• Y/N
• Why?
• It only can be produced from its “Data
Warehouse” using a kind of “data mining”
software.
17
Dr. Chen, Business Intelligence 18
Why Data Mining?
• More intense competition at the global scale
• Recognition of the value in data sources
• Availability of quality data on customers, vendors,
transactions, Web, etc.
• Consolidation and integration of data repositories
into data warehouses
• The exponential increase in data processing and
storage capabilities; and decrease in cost
• Movement toward conversion of information
resources into nonphysical form
4
Dr. Chen, Business Intelligence 19
Definitions of Data Mining –
Summary • Technical def.: a process that uses statistical, mathematical,
and artificial intelligence techniques to extract and identify
useful information and subsequent knowledge (or patterns)
from large sets of data.
• General def.: the nontrivial process of identifying valid,
novel, potentially useful, and ultimately understandable
patterns in data stored in structured databases - Fayyad et al., (1996)
• Keywords in this definition: Process, nontrivial, valid, novel,
potentially useful, understandable
• Other names: knowledge extraction, pattern analysis,
knowledge discovery, information harvesting, pattern
searching, data dredging.
Dr. Chen, Business Intelligence 20
Sta
tistic
s
Management Science &
Information Systems
Artificial Intelligence
Databases
Pattern
Recognition
Machine
Learning
Mathematical
Modeling
DATA
MINING
Data Mining at the Intersection of Many Disciplines
Fig 4. 1 Data Mining as a Blend of Multiple Disciplines
Dr. Chen, Business Intelligence 21
• Source of data for DM is often a consolidated data
warehouse (not always!).
• DM environment is usually a client-server or a Web-
based information systems architecture.
• Data is the most critical ingredient for DM which may
include soft/unstructured data.
• The miner is often an end user.
• Striking it rich requires creative thinking.
• Data mining tools’ capabilities and ease of use are
essential (Web, Parallel processing, etc.).
Data Mining
Characteristics/Objectives
Dr. Chen, Business Intelligence 22
Data in Data Mining
• Data: a collection of facts usually obtained as the result of
experiences, observations, or experiments.
• Data may consist of numbers, words, images, …
• Data: lowest level of abstraction (from which information and
knowledge are derived). - DM with different
data types?
- Other data types? Data
Categorical Numerical
Nominal Ordinal Interval Ratio
StructuredUnstructured or
Semi-Structured
MultimediaTextual HTML/XML
martial status: (1)
single, (2)
married, (3)
divorced
rank order: credit
score: (1) low, (2)
medium, (3) high
continuous data
temperature on the
Celsius scale: 1/100 ratio between a
magnitude of a
continuous quantity
discrete data
Fig 4.3 A Simple Taxonomy of Data in
Data Mining
Dr. Chen, Business Intelligence 23
What Does DM Do?
How Does it Work?
• DM builds models to extract/identify patterns
from data
Pattern?
A mathematical (numeric and/or symbolic) relationship
among data items
• Types of patterns
1. Prediction
2. Association
3. Cluster (segmentation)
4. Sequential (or time series) relationships Dr. Chen, Business Intelligence
24
Fig 4.3 A Simple Taxonomy for Data Mining Tasks
Data Mining
Prediction
Classification
Regression
Clustering
Association
Link analysis
Sequence analysis
Learning Method Popular Algorithms
Supervised
Supervised
Supervised
Unsupervised
Unsupervised
Unsupervised
Unsupervised
Decision trees, ANN/MLP, SVM, Rough
sets, Genetic Algorithms
Linear/Nonlinear Regression, Regression
trees, ANN/MLP, SVM
Expectation Maximization, Apriory
Algorithm, Graph-based Matching
Apriory Algorithm, FP-Growth technique
K-means, ANN/SOM
Outlier analysis Unsupervised K-means, Expectation Maximization (EM)
Apriory, OneR, ZeroR, Eclat
Classification and Regression Trees,
ANN, SVM, Genetic Algorithms
No models or hypothesis exists (e.g., with training data)
With models or hypothesis
e.g., decision trees
e.g., Market-Basket Analysis
5
Dr. Chen, Business Intelligence 25
How do BI Tools Obtain Data?
On
lin
e
Da
taS
tore
BS
tore
A
Operational/
external DB
Database
in Data
Warehouse
Metadata
Data warehouse
Extraction
Transformation
Cleansing
Loading
Summarization
Data reconciliation
process
Basic Reporting
RFM Analysis
OLAP
Data mining
Web clickstream
Web analytics
Decision support
Data
mart
Data
mart
Data
mart
Data marts
Dr. Chen, Business Intelligence 26 Fig Extra Convergence Disciplines for Data Mining
Businesses use statistical techniques to find patterns and relationships
among data and use it for classification and prediction. Data mining
techniques are a blend of statistics and mathematics, artificial intelligence
(AI) and machine-learning.
What are typical data-mining applications?
Data Warehouse
Dr. Chen, Business Intelligence 27
What are typical data-mining applications?
• Data mining is an automated process of discovery and extraction of hidden
and/or unexpected patterns of collected data in order to create models for decision
making that predict future behavior based on analyses of past activity.
• There are two types of data-mining techniques:
Unsupervised data-mining characteristics: No model or hypothesis exists before running the analysis (with “training data” set)
Analysts apply data-mining techniques and then observe the results
Analysts create a hypotheses after analysis is completed
Cluster analysis, a common technique in this category groups entities together that
have similar characteristics
Supervised data-mining characteristics:
Analysts develop a model prior to their analysis
Classification and Decision tree
Apply statistical techniques to estimate parameters of a model
Regression analysis is a technique in this category that measures the impact of a set
of variables on another variable
Neural networks predict values and make classifications.
Used for making predictions
Dr. Chen, Business Intelligence 28
Decision Tree Example for MIS Classes (hypothetical data)
• A decision tree is a hierarchical arrangement of criteria that predicts a
classification or value. It’s an unsupervised data-mining technique that
selects the most useful attributes for classifying entities on some criterion.
It uses if…then rules in the decision process. Here are two examples.
Fig Decision Tree Examples for MIS Class (Hypothetical Data)
If student is a junior and works in a
restaurant, then predict grade 3.0
If student is a senior and is a nonbusiness
major, then predict grade 3.0
If student is a junior and does not work in
a restaurant, then predict grade 3.0
If student is a senior and is a business
major, then make _____ prediction
>
< ---
< ---
no
Dr. Chen, Business Intelligence 29
What are typical data-mining applications?
• A decision tree is a hierarchical arrangement of criteria that predicts a
classification or value. It’s a supervised data-mining technique that selects
the most useful attributes for classifying entities on some criterion. It uses
if…then rules in the decision process. Here are two examples.
Extra Figure: Grades of Students from Past MIS Class (Hypothetical Data)
Fig Credit Score Decision Tree
>
Dr. Chen, Business Intelligence 30
Market-Basket Analysis is a unsupervised data-mining tool for determining sales
patterns. It helps businesses create cross-selling opportunities (i.e., buying relevant
products together). Two terms used with this type of analysis are:
Support: the probability that two items will be purchased together (e.g., Fins and Mask will be
purchased together)
Confidence: a conditional probability estimate (e.g., proportion of the customers who bought a
mask also bought fins)
Lift: ratio of confidence to the base probability (e.g., ratio between customers of buying fins
after buying mask and those buying fins of walking into the store) The more the lift value is
close to 1 the more independent between these item (i.e., unrelated).
A: Fins
B: Mask
Lift is almost double
Extra Figure: Market-Basket Example
P(Fins)=280/1000=.28
(customers walking into
the store and buying fins)
You will use
RapidMiner to
explore this
scenario in the
HW.
6
Dr. Chen, Business Intelligence 31
Other Market Basket Application
• 75% of customers who buy Coke also by
corn chips
Good for CRM analysis
Dr. Chen, Business Intelligence 32
DM Capabilities Description Example
Associations/Affinity
(Unsupervised):
Association between items
Discover rules that
correlate one set of events
or items with another set
of events or items.
Market Basket Analysis:
75% of customers who buy Coke also buy corn
chips (good for CRM analysis)
Clustering:
Grouping items according to
statistical similarities
(Unsupervised)
Create partitions so that
all members of each set
are similar according to
some metric or set of
metrics.
Customer Segmentation:
Meals charged on a business-issued gold card are
typically purchased on weekdays and have a
mean value of greater than $250, whereas meals
purchased using a personal platinum card occur
predominately on weekends, have a mean value
of $175 and include a bottle of wine more than
65% of the time.
Sequence/Temporal
Patterns (Supervised):
Time-based Affinity
(Statistical Analysis)
Relate events in time
based on a series of
preceding events.
Time-Based Analysis:
60% of customers buy TVs followed by digital
camcorders
Classification:
Assigns new records to
existing classes
(Supervised)
Discover rules that define
whether an item or event
belongs to a particular
subset or class of data
Decision Tree Analysis (Customer
Segmentation):
Customers with excellent credit history have a
debt/equity ratio of less than 10%
What are typical data-mining applications?
Dr. Chen, Business Intelligence 33
Other Data Mining Tasks
• These are in addition to the primary DM tasks (prediction,
association, clustering)
Time-series forecasting
Data can be used to develop models to extrapolate the future values
of the same phenomenon.
Part of sequence or link analysis?
Visualization
Another data mining task that can be used with other DM
techniques to gain a clearer understanding of underlying
relationships.
• Types of DM
Hypothesis-driven data mining
– begin with a proposition by the user.
Discovery-driven data mining
– Uncover hidden patterns, associations, relationships by different
means Dr. Chen, Business Intelligence
34
Data Mining Applications
• Customer Relationship Management (CRM)
Grocery example in UK
• Banking & Other Financial
• Retailing and Logistics
• Manufacturing and Maintenance
• Brokerage and Securities Trading
• Insurance
Dr. Chen, Business Intelligence 35
Data Mining Applications (cont.)
• Computer hardware and software
• Science and engineering
• Government and defense
• Homeland security and law enforcement
• Travel industry
• Healthcare
• Medicine
• Entertainment industry
• Sports
• Etc.
Highly popular application
areas for data mining
Dr. Chen, Business Intelligence 36
Data Mining Process
• A manifestation of best practices
• A systematic way to conduct DM projects
• Different groups have different versions
• Most common standard processes:
1. CRISP-DM (Cross-Industry Standard Process for Data
Mining)
2. SEMMA (Sample, Explore, Modify, Model, and
Assess)
3. KDD (Knowledge Discovery in Databases)
7
Dr. Chen, Business Intelligence 37
Data Mining Process
Fig. 4.7 Ranking of Data Mining Methodologies/Processes. Source: KDNuggets.com
Dr. Chen, Business Intelligence 38
1. Data Mining Process: CRISP-DM
Data Sources
Business
Understanding
Data
Preparation
Model
Building
Testing and
Evaluation
Deployment
Data
Understanding
6
1 2
3
5
4
Fig 4.4 The Six Step CRISP-DM Data Mining Process
Dr. Chen, Business Intelligence 39
Data Mining Process: CRISP-DM
Step 1: Business Understanding
Step 2: Data Understanding
Step 3: Data Preparation (!)
Step 4: Model Building
Step 5: Testing and Evaluation
Step 6: Deployment
• The process is highly repetitive and
experimental (DM: art versus science?)
Accounts for
~85% of total
project time
How is it
related to
SDLC?
Dr. Chen, Business Intelligence 40
Data Preparation – A Critical DM Task
Data Consolidation
Data Cleaning
Data Transformation
Data Reduction
Well-formed
Data
Real-world
Data
· Collect data
· Select data
· Integrate data
· Impute missing values
· Reduce noise in data
· Eliminate inconsistencies
· Normalize data
· Discretize/aggregate data
· Construct new attributes
· Reduce number of variables
· Reduce number of cases
· Balance skewed data
See Table 4.4 for more details (p.153).
Fig 4.5 Data Preprocessing Steps
Dr. Chen, Business Intelligence 41
2. Data Mining Process: SEMMA
Sample(Generate a representative
sample of the data)
Modify(Select variables, transform
variable representations)
Explore(Visualization and basic
description of the data)
Model(Use variety of statistical and
machine learning models )
Assess(Evaluate the accuracy and
usefulness of the models)
SEMMA
Fig 4.6 SEMMA Data Mining Process Dr. Chen, Business Intelligence
42
3. Knowledge Discovery in Databases (KDD)
• KDD is a process of using data mining methods to find
useful information and patterns in the data, as opposed
to data mining, which involves using algorithms to
identify patterns in data derived through the KDDS
PROCESS.
It is a comprehensive process that encompasses data mining.
• KDD consists of the following steps (Dunham (2003):
1) data selection,
2) data preprocessing,
3) data transformation,
4) data mining, and
5) interpretation/evaluation.
8
Dr. Chen, Business Intelligence 43
Data Mining Methods: Classification
• Most frequently used DM method
• Part of the machine-learning family
• Employ supervised learning
• Learn from past data, classify new data
• The output variable is categorical (nominal or
ordinal) in nature
• Classification versus regression?
• Classification versus clustering?
Dr. Chen, Business Intelligence 44
• Predictive accuracy
Hit rate
• Speed
Model building; predicting
• Robustness
• Scalability
• Interpretability
Transparency, explainability
Assessment Methods for Classification
Dr. Chen, Business Intelligence 45
Accuracy of Classification Models
• In classification problems, the primary source for
accuracy estimation is the confusion matrix
True
Positive
Count (TP)
False
Positive
Count (FP)
True
Negative
Count (TN)
False
Negative
Count (FN)
True Class
Positive Negative
Po
sitiv
eN
eg
ativ
e
Pre
dic
ted
Cla
ss
FNTP
TPRatePositiveTrue
FPTN
TNRateNegativeTrue
FNFPTNTP
TNTPAccuracy
FPTP
TPrecision
P
FNTP
TPcallRe
Fig 4.8 A simple Confusion Matrix for Tabulation of Two-Class Classification Results
Dr. Chen, Business Intelligence 46
Estimation Methodologies for Classification
• Simple split (or holdout or test sample estimation)
Split the data into 2 mutually exclusive sets training
(~70%) and testing (30%)
For ANN, the data is split into three sub-sets (training
[~60%], validation [~20%], testing [~20%])
Preprocessed
Data
Training Data
Testing Data
Model
Development
Model
Assessment
(scoring)
2/3
1/3
Classifier
Prediction
Accuracy
Fig 4.9 Simple Random Data Splitting
Dr. Chen, Business Intelligence 47
Estimation Methodologies for Classification
• k-Fold Cross Validation (rotation estimation)
Split the data into k mutually exclusive subsets
Use each subset as testing while using the rest of the
subsets as training
Repeat the experimentation for k times
Aggregate the test results for true estimation of
prediction accuracy training
• Other estimation methodologies
Leave-one-out, bootstrapping, jackknifing
Area under the ROC curve
Dr. Chen, Business Intelligence 48
Estimation Methodologies for
Classification – ROC Curve
10.90.80.70.60.50.40.30.20.10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1
0.9
0.8
False Positive Rate (1 - Specificity)
Tru
e P
ositi
ve R
ate
(Sen
sitiv
ity) A
B
C
9
Dr. Chen, Business Intelligence 49
Classification Techniques
• Decision tree analysis
• Statistical analysis
• Neural networks
• Support vector machines
• Case-based reasoning
• Bayesian classifiers
• Genetic algorithms
• Rough sets
Dr. Chen, Business Intelligence 50
Decision Trees
• Employs the and method
• Recursively divides a training set until each
division consists of examples from one class
1. Create a root node and assign all of the training data
to it.
2. Select the best splitting attribute.
3. Add a branch to the root node for each value of the
split. Split the data into mutually exclusive subsets
along the lines of the specific split.
4. Repeat the steps 2 and 3 for each and every leaf
node until the stopping criteria is reached.
A general
algorithm
for
decision
tree
building
conquer divide
Dr. Chen, Business Intelligence 51
Decision Trees
• DT algorithms mainly differ on
Splitting criteria
Which variable to split first?
What values to use to split?
How many splits to form for each node?
Stopping criteria
When to stop building the tree
Pruning (generalization method)
Pre-pruning versus post-pruning
• Most popular DT algorithms include
ID3, C4.5, C5; CART; CHAID; M5
Dr. Chen, Business Intelligence 52
Decision Trees
• Alternative splitting criteria
Gini index determines the purity of a specific
class as a result of a decision to branch along a
particular attribute/value
Used in CART
Information gain uses entropy to measure the
extent of uncertainty or randomness of a particular
attribute/value split
Used in ID3, C4.5, C5
Chi-square statistics (used in CHAID)
Dr. Chen, Business Intelligence 53
Cluster Analysis for Data Mining
• Used for automatic identification of natural groupings of things
• Part of the machine-learning family
• Employs unsupervised learning
• Learns the clusters of things from past data, then assigns new instances
• There is not an output variable
• Also known as segmentation
Dr. Chen, Business Intelligence 54
Cluster Analysis for Data Mining
• Clustering results may be used to
Identify natural groupings of customers
Identify rules for assigning new cases to classes for targeting/diagnostic purposes
Provide characterization, definition, labeling of populations
Decrease the size and complexity of problems for other data mining methods
Identify outliers in a specific domain (e.g., rare-event detection)
10
Dr. Chen, Business Intelligence 55
Cluster Analysis for Data Mining
• Analysis methods
Statistical methods (including both hierarchical and
nonhierarchical), such as k-means, k-modes, and so
on
Neural networks (adaptive resonance theory [ART],
self-organizing map [SOM])
Fuzzy logic (e.g., fuzzy c-means algorithm)
Genetic algorithms
Dr. Chen, Business Intelligence 56
Cluster Analysis for Data Mining
• How many clusters?
There is no “truly optimal” way to calculate it
Heuristics are often used
Look at the sparseness of clusters
Number of clusters = (n/2)1/2 (n: no of data points)
Use Akaike information criterion (AIC)
Use Bayesian information criterion (BIC)
• Most cluster analysis methods involve the use of a
distance measure to calculate the closeness
between pairs of items.
Euclidian versus Manhattan (rectilinear) distance
Dr. Chen, Business Intelligence 57
Cluster Analysis for Data Mining
• k-Means Clustering Algorithm
k : pre-determined number of clusters
Algorithm (Step 0: determine value of k)
Step 1: Randomly generate k random points as initial
cluster centers.
Step 2: Assign each point to the nearest cluster center.
Step 3: Re-compute the new cluster centers.
Repetition step: Repeat steps 3 and 4 until some
convergence criterion is met (usually that the
assignment of points to clusters becomes stable).
Dr. Chen, Business Intelligence 58
Cluster Analysis for Data Mining -
k-Means Clustering Algorithm
Step 1 Step 2 Step 3
Fig 4.11 A Graphical illustration of the Steps in k-Means Algorithm
Dr. Chen, Business Intelligence 59
Association Rule Mining
• A very popular DM method in business
• Finds interesting relationships (affinities) between variables (items or events)
• Part of machine learning family
• Employs unsupervised learning
• There is no output variable
• Also known as market basket analysis
• Often used as an example to describe DM to ordinary people, such as the famous “relationship between diapers and beers!”
Dr. Chen, Business Intelligence 60
Association Rule Mining
• Input: the simple point-of-sale transaction data
• Output: Most frequent affinities among items
• Example: according to the transaction data…
“Customer who bought a laptop computer and a virus
protection software, also bought extended service plan 70
percent of the time"
• How do you use such a pattern/knowledge?
Put the items next to each other for ease of finding
Promote the items as a package (do not put one on sale if the other(s)
are on sale)
Place items far apart from each other so that the customer has to
walk the aisles to search for it, and by doing so potentially see and
buy other items
11
Dr. Chen, Business Intelligence 61
Association Rule Mining
• A representative application of association rule mining
includes
In business: cross-marketing, cross-selling, store design,
catalog design, e-commerce site design, optimization of
online advertising, product pricing, and sales/promotion
configuration
In medicine: relationships between symptoms and
illnesses; diagnosis and patient characteristics and
treatments (to be used in medical DSS); and genes and
their functions (to be used in genomics projects)
…
Dr. Chen, Business Intelligence 62
Association Rule Mining
• Are all association rules interesting and useful?
A Generic Rule: X Y [S%, C%]
X, Y: products and/or services
X: Left-hand-side (LHS)
Y: Right-hand-side (RHS)
S: Support: how often X and Y go together
C: Confidence: how often Y goes together with X
Example: {Laptop Computer, Antivirus Software}
{Extended Service Plan} [30%, 70%]
Dr. Chen, Business Intelligence 63
Association Rule Mining
• Algorithms are available for generating association rules
Apriori
Eclat
FP-Growth
+ Derivatives and hybrids of the three
• The algorithms help identify the frequent item sets, which are then converted to association rules
Dr. Chen, Business Intelligence 64
Association Rule Mining
• Apriori Algorithm
Finds subsets that are common to at least a minimum number of the itemsets
Uses a bottom-up approach
frequent subsets are extended one item at a time (the size of frequent subsets increases from one-item subsets to two-item subsets, then three-item subsets, and so on), and
groups of candidates at each level are tested against the data for minimum support.
(see the figure) --
Dr. Chen, Business Intelligence 65
Association Rule Mining
Apriori Algorithm
Itemset
(SKUs)Support
Transaction
No
SKUs
(Item No)
1
1
1
1
1
1
1, 2, 3, 4
2, 3, 4
2, 3
1, 2, 4
1, 2, 3, 4
2, 4
Raw Transaction Data
1
2
3
4
3
6
4
5
Itemset
(SKUs)Support
1, 2
1, 3
1, 4
2, 3
3
2
3
4
3, 4
5
3
2, 4
Itemset
(SKUs)Support
1, 2, 4
2, 3, 4
3
3
One-item Itemsets Two-item Itemsets Three-item Itemsets
1001
1002
1003
1004
1005
1006
Fig 4.12 Identification of Frequent Itemsets in Apriori Algorithm
Dr. Chen, Business Intelligence 66
Artificial Neural Networks
for Data Mining
• Artificial neural networks (ANN or NN) are a brain
metaphor for information processing
• a.k.a. Neural Computing
• Very good at capturing highly complex non-linear
functions!
• Many uses – prediction (regression, classification),
clustering/segmentation
• Many application areas - finance, medicine, marketing,
manufacturing, service operations, information systems, …
12
Dr. Chen, Business Intelligence 67
Biological
versus
Artificial
Neural
Networks
Neuron
Axon
Axon
SynapseSynapse
Dendrites
Dendrites Neuron
w1
w2
wn
x1
x2
xn
.
.
.
Y
Y1
Yn
Y2
Inputs
Weights
Outputs
.
.
.
Processing
Element (PE)
n
i
iiWXS1
)( Sf
Summation
Transfer
Function
Biological Artificial
Neuron
Dendrites
Axon
Synapse
Slow
Many (109)
Node (or PE)
Input
Output
Weight
Fast
Few (102)
Biological NN
Artificial NN
Dr. Chen, Business Intelligence 68
Elements/Concepts of ANN
• Processing element (PE)
• Information processing
• Network structure
Feedforward vs. recurrent vs. multi-layer…
• Learning parameters
Supervised/unsupervised, backpropagation, learning rate, momentum
• ANN Software – NN shells, integrated modules in comprehensive DM software, …
Dr. Chen, Business Intelligence 69
Data Mining
Software
• Commercial
IBM SPSS Modeler (formerly Clementine)
SAS - Enterprise Miner
IBM - Intelligent Miner
StatSoft – Statistica Data Miner
… many more
• Free and/or Open Source
R
RapidMiner
Weka…
0 50 100 150 200 250 300
Predixion Software (3)
WordStat (3)
11 Ants Analytics (4)
Teradata Miner (4)
RapidInsight/Veera (5)
Angoss (7)
SAP (BusinessObjects/Sybase/Hana)(7)
XLSTAT (7)
Salford SPM/CART/MARS/TreeNet/RF (9)
Revolution Computing (11)
C4.5/C5.0/See5 (13)
Bayesia (14)
KXEN (14)
Zementis (14)
Stata (15)
IBM Cognos (16)
Miner3D (19)
Mathematica (23)
JMP (32)
Other commercial software (32)
Oracle Data Miner (35)
Tableau (35)
TIBCO Spotfire / S+ / Miner (37)
Other free software (39)
Microsoft SQL Server (40)
Orange (42)
SAS Enterprise Miner (46)
IBM SPSS Modeler (54)
IBM SPSS Statistics (62)
MATLAB (80)
Rapid-I RapidAnalytics (83)
SAS (101)
StatSoft Statistica (112)
Weka / Pentaho (118)
KNIME (174)
Rapid-I RapidMiner (213)
Excel (238)
R (245)
Fig. 4.13 Popular Data Mining Software Tools (Poll Results). Source: KDNuggets.com Dr. Chen, Business Intelligence 70
Big Data Software Tools and Platforms
0 10 20 30 40 50 60 70 80
Other Hadoop-based tools (10)
Other Big Data software (21)
NoSQL databases (33)
Amazon Web Services (AWS) (36)
Apache Hadoop/Hbase/Pig/Hive (67)
0 50 100 150 200 250 300
F# (5)
Awk/Gawk/Shell (31)
Perl (37)
Other languages (57)
C/C++ (66)
Python (119)
Java (138)
SQL (185)
R (245)
Dr. Chen, Business Intelligence 71
Application Case 4.6
Data Mining Goes to Hollywood:
Predicting Financial Success of Movies
Questions for Discussion
• Decision situation
• Problem
• Proposed solution
• Results
• Answer & discuss the case questions. Dr. Chen, Business Intelligence
72
Application Case 4.6
Data Mining Goes to Hollywood!
Independent Variable Number of
Values Possible Values
MPAA Rating 5 G, PG, PG-13, R, NR
Competition 3 High, Medium, Low
Star value 3 High, Medium, Low
Genre 10
Sci-Fi, Historic Epic Drama,
Modern Drama, Politically
Related, Thriller, Horror,
Comedy, Cartoon, Action,
Documentary
Special effects 3 High, Medium, Low
Sequel 1 Yes, No
Number of screens 1 Positive integer
Class No. 1 2 3 4 5 6 7 8 9
Range
(in $Millions)
< 1
(Flop)
> 1
< 10
> 10
< 20
> 20
< 40
> 40
< 65
> 65
< 100
> 100
< 150
> 150
< 200
> 200
(Blockbuster)
Dependent
Variable
Independent
Variables
A Typical
Classification
Problem
13
Dr. Chen, Business Intelligence 73
Application Case 4.6
Data Mining Goes to Hollywood!
Model
Development
process
Model
Assessment
process
The DM
Process
Map in
IBM
SPSS
Modeler
Dr. Chen, Business Intelligence 74
Application Case 4.6
Data Mining Goes to Hollywood!
Prediction Models
Individual Models Ensemble Models
Performance
Measure SVM ANN C&RT
Random
Forest
Boosted
Tree
Fusion
(Average)
Count (Bingo) 192 182 140 189 187 194
Count (1-Away) 104 120 126 121 104 120
Accuracy (% Bingo) 55.49% 52.60% 40.46% 54.62% 54.05% 56.07%
Accuracy (% 1-Away) 85.55% 87.28% 76.88% 89.60% 84.10% 90.75%
Standard deviation 0.93 0.87 1.05 0.76 0.84 0.63
* Training set: 1998 – 2005 movies; Test set: 2006 movies
Dr. Chen, Business Intelligence 75
• 1. Why is it important for many Hollywood
professionals to predict the financial success of
movies?
•
• The movie industry is the “land of hunches and
wild guesses” due to the difficulty associated with
forecasting product demand, making the movie
business in Hollywood a risky endeavor.
• If Hollywood could better predict financial
success, this would mitigate some of the financial
risk.
Dr. Chen, Business Intelligence 76
• 2. How can data mining be used for predicting financial
success of movies before the start of their production
process?
•
• The way Sharda and Delen did it was to use data from movies
between 1998 and 2005 as training data, and movies of 2006
as test data. They applied individual and ensemble prediction
models, and were able to identify significant variables
impacting financial success.
• They also showed that by using sensitivity analysis, decision
makers can predict with fairly high accuracy how much value
a specific actor (or a specific release date, or the addition of
more technical effects, etc.) brings to the financial success of a
film, making the underlying system an invaluable decision aid.
Dr. Chen, Business Intelligence 77
• 3. How do you think Hollywood did, and perhaps
still is performing, this task without the help of
data mining tools and techniques?
•
• Most is done by gut feel and trial-and-error. This
may keep the movie business as a financially risky
endeavor, but also allows for creativity.
Sometimes uncertainty is a good thing.
Dr. Chen, Business Intelligence 78
Data Mining Myths
• Data mining …
provides instant solutions/predictions
is not yet viable for business applications
requires a separate, dedicated database
can only be done by those with advanced degrees
is only for large firms that have lots of customer data
is another name for the good-old statistics
14
Dr. Chen, Business Intelligence 79
Data Mining Myths
Myth Reality
DM provides instant, crystal-ball-like
solutions/predictions
DM is a multi-step process that requires
deliberate, proactive design and use.
DM is not yet viable for business
applications
The current state of the art is ready to go for
almost any business.
DM requires a separate, dedicated
database.
Because of advances in database technology,
a dedicated database is not required, even
though it may be desirable.
DM is only for large firms that have lots
of customer data.
Newer Web-based tools enable managers at
all educational levels to do data mining.
DM can only be done by those with
advanced degrees.
If the data accurately reflect the business or
its customers, a company can use data
mining.
DM is another name for the good-old
statistics.
DM is now available for easy use.
Data mining is a powerful analytical tool that enables business executives to
advance from describing the nature of the past to predicting the future. DM’s
myths and realities are listed below:
Dr. Chen, Business Intelligence 80
Common Data Mining Blunders
1. Selecting the wrong problem for data mining
2. Ignoring what your sponsor thinks data mining is and
what it really can/cannot do
3. Not leaving sufficient time for data acquisition,
selection, and preparation
4. Looking only at aggregated results and not at
individual records/predictions
5. Being sloppy about keeping track of the data mining
procedure and results
6. …more in the book
Dr. Chen, Business Intelligence 81
Extra Example on Correlation
Business Scenario:
Sarah is a regional sales manager for a nationwide supplier of
fossil fuels for home heating. Recent volatility in market
prices for heating oil specifically, coupled with wide
variability in the size of each order for home heating oil, has
Sarah concerned. She feels a need to understand the types of
behaviors and other factors that may influence the demand for
heating oil in the domestic market. What factors are related to
heating oil usage, and how might she use a knowledge of
such factors to better manage her inventory, and anticipate
demand? Sarah believes that data mining can help her begin
to formulate an understanding of these factors and
interactions.
Dr. Chen, Business Intelligence 82
The following attributes are included in the data set:
Insulation: This is a density rating, ranging from one to ten, indicating the
thickness of each home’s insulation. A home with a density rating of one is poorly
insulated, while a home with a density of ten has excellent insulation.
Temperature: This is the average outdoor ambient temperature at each home for
the most recent year, measure in degree Fahrenheit.
Heating_Oil: This is the total number of units of heating oil purchased by the
owner of each home in the most recent year.
Num_Occupants: This is the total number of occupants living in each home.
Avg_Age: This is the average age of those occupants.
Home_Size: This is a rating, on a scale of one to eight, of the home’s overall size.
The higher the number, the larger the home.
Dr. Chen, Business Intelligence 83
Step 1. File New Process “Read CSV” then select
“Chapter04DataSet”
Dr. Chen, Business Intelligence 84
Step 2. Change from “Semicolon” to “Comma”
15
Dr. Chen, Business Intelligence 85
Dr. Chen, Business Intelligence 86
Dr. Chen, Business Intelligence 87
The “Read CSV” with input data has been added to the process.
Dr. Chen, Business Intelligence 88
Next, add “correlation matrix” operator and connect all ports
as shown. Click “Run”
Dr. Chen, Business Intelligence 89
The result is obtained (with “Meta Data View” shown.)
Dr. Chen, Business Intelligence 90
Result:
What do we obtain from this result?
16
Dr. Chen, Business Intelligence 91
• We learned through our investigation, that the two most strongly
correlated attributes in our data set are Heating_Oil (or
Insulation) and Avg_Age, with a coefficient of 0.848 (or 0.643).
Thus, we know that in this data set, as the average age of the
occupants in a home increases, so too does the heating oil usage
in that home.
Illustration of positive correlation Dr. Chen, Business Intelligence
92
• We learned through our investigation, that the two most strongly
correlated attributes in our data set are Heating_Oil and
Avg_Age, with a coefficient of 0.848. Thus, we know that in this
data set, as the average age of the occupants in a home increases,
so too does the heating oil usage in that home.
• What we do not know is why that occurs. Data analysts often
make the mistake of interpreting correlation as causation.
• The assumption that correlation proves causation is
dangerous and often false.
Dr. Chen, Business Intelligence 93
Correlation Strengths
Fig. Correlation strengths between -1 and 1
Illustration of positive correlation
Illustration of negative correlation Dr. Chen, Business Intelligence 94
Summary on Correlation
• Correlation can be a quick and easy way to see how elements of
a given problem may be interacting with one another. Whenever
you find yourself asking how certain factors in a problem
you’re trying to solve interact with one another, consider
building a correlation matrix to find out.
• For example,
1) does customer satisfaction change based on time of year?
2) Does the amount of rainfall change the price of a crop?
3) Does household income influence which restaurants a
person patronizes?
• The answer to each of these questions is probably ‘yes’, but
correlation can not only help us know if that’s true, it can also
help us learn how strongly the interactions are when, and if,
they occur.
Dr. Chen, Business Intelligence 95
End of the Chapter
• Questions, comments
top related