chapter 3: data mining and data visualization
TRANSCRIPT
© 2003, Prentice-Hall Chapter 3 - 1
Chapter 3: Data Mining and
Data Visualization
Modern Data Warehousing, Mining,
and Visualization: Core Concepts
by George M. Marakas
http://alainmaterials.webs.com/
© 2003, Prentice-Hall Chapter 3 - 2
3-1: A Picture is Worth a
Thousand Words
Data mining is the set of activities used to find
new, hidden, or unexpected patterns in data.
These techniques are often called knowledge
data discovery (KDD), and include statistical
analysis, neural or fuzzy logic, intelligent
agents or data visualization.
The KDD techniques not only discover useful
patterns in the data, but also can be used to
develop predictive models.
© 2003, Prentice-Hall Chapter 3 - 3
Verification Versus Discovery
In the past, decision support activities were
primarily based on the concept of verification.
This required a great deal of prior knowledge
on the decision-maker’s part in order to verify
a suspected relationship.
With the advance of technology, the concept
of verification began to turn into discovery.
© 2003, Prentice-Hall Chapter 3 - 4
Data Mining’s Growth in Popularity
One reason is that we keep getting more and
more data all the time and need tools to
understand it.
We also are aware that the human brain has
trouble processing multidimensional data.
A third reason is that machine learning
techniques are becoming more affordable
and more refined at the same time.
© 2003, Prentice-Hall Chapter 3 - 5
Making Accurate Predictions with
Data Mining
Although the literature contains statements such as “data mining will allow us to predict who will buy a particular product,” that is against human nature.
In situations where data mining is used to predict response to a marketing movement, only about 5% of the people selected as “likely respondents” actually do respond.
© 2003, Prentice-Hall Chapter 3 - 6
Making Accurate Predictions with
Data Mining (cont.)
Although the accuracy of predicting
individual behavior is not so good, it is
better than it seems, since direct
marketing efforts often have “hit rates”
of only about 1% without data mining.
© 2003, Prentice-Hall Chapter 3 - 7
3-2: Online Analytical Processing
(OLAP)
1. Multidimensional view : for example profits could be viewed by region,
product, time period
2. Transparent to user : support heterogeneous data source.
3. Accessible: The OLAP tool should present the user with a single
logical schema of the data.
4. Consistent reporting: Performance of the OLAP tool should not suffer
significantly as the number of dimensions is increased.
5. Client-server architecture: clients can be attached with minimum
effort.
6. Generic dimensionality: not limited to 3-D, a function applied to one
dimension should also be applicable to another.
Codd developed a set of 12 rules for the
development of multidimensional databases:
© 2003, Prentice-Hall Chapter 3 - 8
3-2: Online Analytical Processing
(OLAP) – continue
7. Dynamic sparse matrix handling: sparse matrix is one in which not every
cell contains data
8. Multiuser support
9. Cross-dimensional operations: must allow calculation and data
manipulation across any number of data dimensions, and must not restrict
any relationship between data cells.
10. Intuitive manipulation: Data manipulation inherent in the direct path, such
as drilling down or zooming out not use multiple steps.
11. Flexible reporting: present information in any way the user wants to view it
12. Unlimited dimension and aggregation
© 2003, Prentice-Hall Chapter 3 - 9
OLAP as Implemented
To date, it does not appear that any
implementation exists that satisfies all 12
rules.
Some people argue it might not even be
possible to attain all of them.
More recently, the term OLAP has come to
represent the broad category of software
technology that enables multidimensional
analysis of enterprise data.
© 2003, Prentice-Hall Chapter 3 - 10
Multidimensional OLAP (MOLAP)
Data can be viewed across several dimensions. Here sales are arrayed by region and product.
A fourth dimension could be added by using several graphs -- perhaps at different time points.
Most analyses have many more dimensions than this. MOLAP handles data as an n-dimensional hypercube.
4
3
1
0.3
Product
0.4
0.5
2
0.6
0.7
2
Sales
1
3Region
© 2003, Prentice-Hall Chapter 3 - 11
Relational OLAP (ROLAP)
A large relational database server replaces the multidimensional one.
The database contains both detailed and summarized data, allowing “drill down” techniques to be applied.
SQL interfaces allow vendors to build tools, both portable and scalable.
This does require databases with many relational tables which may lead to substantial processor overhead on complex joins.
© 2003, Prentice-Hall Chapter 3 - 12
A Typical Relational Schema
© 2003, Prentice-Hall Chapter 3 - 13
3-3: Techniques Used to Mine the Data
Paralleling the popularity of data mining itself, the development of new techniques is exploding as well.
Many innovations are vendor-specific, which sometimes does little to advance the state of the art.
Regardless, data-mining techniques tend to fall into four major categories:
1. classification 2. association
3. sequencing 4. clustering
© 2003, Prentice-Hall Chapter 3 - 14
Classification methods
The goal is to discover rules that define
whether an item belongs to a particular
subset or class of data.
For example, if we are trying to determine
which households will respond to a direct mail
campaign, we will want rules that separate
the “probables” from the not probables.
These IF-THEN rules often are portrayed in a
tree-like structure.
© 2003, Prentice-Hall Chapter 3 - 15
Classification Example
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
Refund MaritalStatus
TaxableIncome Cheat
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ?10
Test
Set
Training
Set Model
Learn
Classifier
© 2003, Prentice-Hall Chapter 3 - 16
Association Methods
These techniques search all transactions
from a system for patterns of occurrence.
A common method is market basket analysis,
in which the set of products purchased by
thousands of consumers are examined.
Results are then represented as percentages;
for example, “30% of the people that buy
steaks also buy charcoal”.
© 2003, Prentice-Hall Chapter 3 - 17
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Association Methods
© 2003, Prentice-Hall Chapter 3 - 18
Sequencing Methods
These methods are applied to time series
data in an attempt to find hidden trends.
If found, these can be useful predictors of
future events.
For example, customer groups that tend to
purchase products tied-in with hit movies
would be targeted with promotional
campaigns timed to release dates.
© 2003, Prentice-Hall Chapter 3 - 19
Sequential method: Examples
In point-of-sale transaction sequences,
Computer Bookstore:
(Intro_To_Visual_C) (C++_Primer) -->
(Perl_for_dummies,Tcl_Tk)
Athletic Apparel Store:
(Shoes) (Racket, Racketball) --> (Sports_Jacket)
© 2003, Prentice-Hall Chapter 3 - 20
Clustering Techniques
Clustering techniques attempt to create
partitions in the data according to some
distance metric.
The clusters formed are data grouped
together simply by their similarity to their
neighbors.
By examining the characteristics of each
cluster, it may be possible to establish rules
for classification.
© 2003, Prentice-Hall Chapter 3 - 21
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster distances
are minimized
Intercluster distances
are maximized
© 2003, Prentice-Hall Chapter 3 - 22
Data Mining Technologies
Statistics – the most mature data mining
technologies, but are often not applicable
because they need clean data. In addition,
many statistical procedures assume linear
relationships, which limits their use.
Neural networks, genetic algorithms, fuzzy
logic – these technologies are able to work
with complicated and imprecise data. Their
broad applicability has made them popular in
the field.
© 2003, Prentice-Hall Chapter 3 - 23
Data Mining Technologies (cont.)
Decision trees – these technologies are
conceptually simple and have gained in
popularity as better tree growing
software was introduced. Because of
the way they are used, they are perhaps
better called “classification” trees.
© 2003, Prentice-Hall Chapter 3 - 24
The Knowledge Discovery
Search Process
Table 3-2 contains a more detailed outline
of the process, but the major steps are:
Define the business problem and
obtain the data to study it.
Use data mining software to model
the problem.
Mine the data to search for patterns
of interest.
© 2003, Prentice-Hall Chapter 3 - 25
The Knowledge Discovery
Search Process (cont.)
Review the mining results and refine
them by respecifying the model.
Once validated, make the model
available to other users of the DW.
© 2003, Prentice-Hall Chapter 3 - 26
Creating a Data-Mining Model
Although syntax differs from vendor to vendor, building a model on top of a database is much like creating a table:
CREATE MODEL mail_list
Income character input, Age integer input, Respond character input
To populate it with data, use an SQL INSERT:
INSERT INTO mail_list
SELECT income, age, respond
FROM client_list
WHERE region = ‘Southeast”
© 2003, Prentice-Hall Chapter 3 - 27
Creating a Data-Mining Model (cont.)
The process automatically created additional views of the model (mail_list_UNDERSTAND and mail_list_PREDICT). These can be examined:
SELECT * FROM mail_list_UNDERSTAND
WHERE input_column_name = ‘income” and
input_column_value = “high” and
output_column_name = “respond” and
output_column_value = ‘yes”
Once these are created, they are treated as tables in the database so they can be viewed and joined by other users.
© 2003, Prentice-Hall Chapter 3 - 28
New Applications for Data Mining
As the technology matures, new applications
emerge, especially in two new categories,
text mining and web mining. Some text
mining examples are:
Distilling the meaning of a text
Accurate summarization of a text
Explanation of the text theme structure
Clustering of texts
© 2003, Prentice-Hall Chapter 3 - 29
Web mining
Web mining is a special case of text mining where the mining occurs over a website.
It enhances the website with intelligent behavior, such as suggesting related links or recommending new products.
It allows you to unobtrusively learn the interests of the visitors and modify their user profiles in real time.
They also allow you to match resources to the interests of the visitor.
© 2003, Prentice-Hall Chapter 3 - 30
3-4: Market Basket Analysis: The King of
Algorithms This is the most widely used and, in many ways, most successful data mining algorithm.
It essentially determines what products people purchase together.
Stores can use this information to place these products in the same area.
Direct marketers can use this information to determine which new products to offer to their current customers.
Inventory policies can be improved if reorder points reflect the demand for the complementary products.
© 2003, Prentice-Hall Chapter 3 - 31
Association Rules for
Market Basket Analysis
Rules are written in the form “left-hand side
implies right-hand side” and an example is:
Yellow Peppers IMPLIES Red Peppers, Bananas, Bakery
To make effective use of a rule, three numeric
measures about that rule must be considered:
(1) support, (2) confidence and (3) lift
© 2003, Prentice-Hall Chapter 3 - 32
Measures of Predictive Ability
1. Support refers to the percentage of baskets
where the rule was true (both left and right
side products were present).
2. Confidence measures what percentage of
baskets that contained the left-hand product
also contained the right.
3. Lift measures how much more frequently the
left-hand item is found with the right than
without the right (confidence exceeds the
expected confidence).
Association Rule: example
Example:
Beer}Diaper,Milk{
4.05
2
|T|
)BeerDiaper,,Milk(
s
67.03
2
)Diaper,Milk(
)BeerDiaper,Milk,(
c
Association Rule
– An implication expression of the form X
Y, where X and Y are itemsets
– Example:
{Milk, Diaper} {Beer}
Rule Evaluation Metrics
– Support (s)
Fraction of transactions that contain both X
and Y
– Confidence (c)
Measures how often items in Y
appear in transactions that
contain X
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
© 2003, Prentice-Hall Chapter 3 - 34
Association Rule: example
The rule in the above example means that if customers buy swimsuits and beach towels, in 60% of the cases they also buy sun glasses.
The combination swimsuits, beach towels, and sun glasses occurs in 24% of all transactions.
swimsuits and beach towels have a very positive effect on buying sun glasses because a high lift factor indicates a strong association between items
[Swimsuits] + [Beach towels] ==> [Sun glasses]
Support=24% Confidence=60% Lift=2.0
© 2003, Prentice-Hall Chapter 3 - 35
Market Basket Analysis Methodology
We first need a list of transactions and what
was purchased. This is pretty easily obtained
these days from scanning cash registers.
Next, we choose a list of products to analyze,
and tabulate how many times each was
purchased with the others.
The diagonals of the table shows how often a
product is purchased in any combination, and
the off-diagonals show which combinations
were bought.
© 2003, Prentice-Hall Chapter 3 - 36
A Convenience Store Example
(5 transactions)
Consider the following simple example about five transactions at a convenience store:
Transaction 1: Frozen pizza, cola, milk
Transaction 2: Milk, potato chips
Transaction 3: Cola, frozen pizza
Transaction 4: Milk, pretzels
Transaction 5: Cola, pretzels
These need to be cross tabulated and displayed in a table.
© 2003, Prentice-Hall Chapter 3 - 37
A Convenience Store Example (5 transactions)
Pizza and Cola sell together more often than any other combo; a cross-marketing opportunity?
Milk sells well with everything – people probably come here specifically to buy it.
Product
Bought
Pizza
also
Milk
also
Cola
also
Chips
also
Pretzels
also
Pizza 2 1 2 0 0
Milk 1 3 1 1 1
Cola 2 1 3 0 1
Chips 0 1 0 1 0
Pretzels 0 1 1 0 2
© 2003, Prentice-Hall Chapter 3 - 38
Using the Results
The tabulations can immediately be
translated into association rules and the
numerical measures computed.
Comparing this week’s table to last week’s
table can immediately show the effect of this
week’s promotional activities.
Some rules are going to be trivial (hot dogs
and buns sell together) or inexplicable (toilet
rings sell only when a new hardware store is
opened).
© 2003, Prentice-Hall Chapter 3 - 39
Limitations to Market Basket Analysis
A large number of real transactions are
needed to do an effective basket analysis, but
the data’s accuracy is compromised if all the
products do not occur with similar frequency.
The analysis can sometimes capture results
that were due to the success of previous
marketing movements (and not natural
tendencies of customers).
© 2003, Prentice-Hall Chapter 3 - 40
Performing Analysis with Virtual Items
The sales data can be increased with the
addition of virtual items. For example, we
could record that the customer was new to
us, or had children.
The transaction record might look like:
Item 1: Sweater Item 2: Jacket Item 3: New
This might allow us to see what patterns new
customers have versus old customers.
© 2003, Prentice-Hall Chapter 3 - 41
Computing Measures of Association
Let’s do some of the textbook’s example computations here ……
Pizza Milk Cola Chips Pretzels
Pizza 2 1 2 0 0
Milk 1 3 1 1 1
Cola 2 1 3 0 1
Chips 0 1 0 1 0
Pretzels 0 1 1 0 2
© 2003, Prentice-Hall Chapter 3 - 42
Taxonomies
The presence of items not purchased very
frequently is an obstacle to a good market
basket analysis.
One way to deal with this is to eliminate
products that occur with a frequency less than
some threshold.
A better idea would be to try to form groups of
products that fall below the threshold. Four
flavors of popsicle occur 9% of the time all
together, but no more than 3% individually.
© 2003, Prentice-Hall Chapter 3 - 43
Multidimensional Market
Basket Analysis
Rules can involve more than two items, for
example Plant and Clay Pot IMPLIES Soil.
These rules are built iteratively. First, pairs
are found, then relevant sets of three or four.
These are then pruned by removing those
that occur infrequently.
In an environment like a grocery store, where
customers commonly buy over 100 items,
rules could involve as many as 10 items.
© 2003, Prentice-Hall Chapter 3 - 44
3-5: Current Limitations and
Challenges to Data Mining
Despite the potential power and value, data mining is still a new field. Some things that that thus far have limited advancement are:
Identification of missing information – not all knowledge gets stored in a database
Data noise and missing values – future systems need better ways to handle this
Large databases and high dimensionality – future applications need ways to partition data into more manageable chunks
© 2003, Prentice-Hall Chapter 3 - 45
3-6: Data Visualization:
“Seeing” the Data
© 2003, Prentice-Hall Chapter 3 - 46
Visual Presentation
For any kind of high dimensional data set,
displaying predictive relationships is a
challenge.
The picture on the previous slide uses 3-D
graphics to portray the weather balloon data
numbers in text Table 11-4. We learn very
little from just examining the numbers .
Shading is used to represent relative degrees
of thunderstorm activity, with the darkest
regions the heaviest activity.
© 2003, Prentice-Hall Chapter 3 - 47
Human Visual Perception and
Data Visualization
Data visualization is so powerful because the
human visual cortex converts objects into
information so quickly.
The next three slides show (1) usage of
global private networks, (2) flow through
natural gas pipelines, and (3) a risk analysis
report that permits the user to draw an
interactive yield curve.
All three use height or shading to add
additional dimensions to the figure.
© 2003, Prentice-Hall Chapter 3 - 48
Global Private Network Activity
High Activity
Low Activity
© 2003, Prentice-Hall Chapter 3 - 49
Natural Gas Pipeline Analysis
Note: Height shows total flow through compressor stations.
© 2003, Prentice-Hall Chapter 3 - 50
An “Enlivened” Risk Analysis Report
© 2003, Prentice-Hall Chapter 3 - 51
Geographical Information Systems
A GIS is a special purpose database that
contains a spatial coordinate system. A
comprehensive GIS requires:
1. Data input from maps, aerial photos, etc.
2. Data storage, retrieval and query
3. Data transformation and modeling
4. Data reporting (maps, reports and plans)
© 2003, Prentice-Hall Chapter 3 - 52
The Special Capabilities of a GIS
In general, a GIS contains two types of data:
Spatial data: these elements correspond to a uniquely-defined location on earth. They could be in point, line or polygon form.
Attribute data: These are the data that will be portrayed at the geographic references established by spatial data.
Example: Data from an opinion poll is displayed for multiple regions in the United States. Clicking on an area allows the user to drill down to the results for smaller areas.
© 2003, Prentice-Hall Chapter 3 - 53
Telephone Polling Results
Note: On the “live” map, clicking on an area allows the user
to drill down and see results for smaller areas.