dr. chen, data mining a/w & dr. chen, data mining chapter 2 data mining: a closer look jason c....
TRANSCRIPT
A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Chapter 2Data Mining: A Closer Look
Jason C. H. Chen, Ph.D.Professor of MIS
School of Business AdministrationGonzaga UniversitySpokane, WA 99223
A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
2.1 Data Mining Strategies
3A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Figure 2.1 - A hierarchy of data mining strategies
Data Mining Strategies
Unsupervised Clustering
Supervised Learning
Market Basket Analysis
Classification EstimationPrediction
Categorical/discrete(current behavior)
NumericFuture outcome
(categorical/numeric)
No output attributes
A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Classification
• Learning is supervised.
• The dependent variable is categorical.
• Well-defined classes.
• Current rather than future behavior.
Estimation• Learning is supervised.
• The dependent variable is numeric.
• Well-defined classes.
• Current rather than future behavior.
Prediction
• The emphasis is on predicting future rather than current outcomes.
• The output attribute may be categorical or numeric.
5A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
The Cardiology Patient Dataset
6A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Attribute Mixed Numeric
Name Values Values Comments
Age Numeric Numeric Age in years
Sex Male, Female 1, 0 Patient gender
Chest Pain Type Angina, Abnormal Angina, 1–4 NoTang =Nonanginal
NoTang, Asymptomatic pain
Blood Pressure Numeric Numeric Resting blood pressure
upon hospital admission
Cholesterol Numeric Numeric Serum cholesterol
Fasting Blood True, False 1, 0 Is fasting blood sugar less
Sugar < 120 than 120?
Resting ECG Normal, Abnormal, Hyp 0, 1, 2 Hyp = Left ventricular
hypertrophy
Maximum Heart Numeric Numeric Maximum heart rate
Rate achieved
Induced Angina? True, False 1, 0 Does the patient experience angina
as a result of exercise?
Old Peak Numeric Numeric ST depression induced by exercise
relative to rest
Slope Up, f lat, dow n 1–3 Slope of the peak exercise ST
segment
Number Colored 0, 1, 2, 3 0, 1, 2, 3 Number of major vessels
Vessels colored by f luorosopy
Thal Normal f ix, rev 3, 6, 7 Normal, f ixed defect,
reversible defect
Concept Class Healthy, Sick 1, 0 Angiographic disease status
Table 2.1 • Cardiology Patient Data
7A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Attribute Most Typical Least Typical Most Typical Least Typical
Name Healthy Class Healthy Class Sick Class Sick Class
Age 52 63 60 62
Sex Male Male Male Female
Chest Pain Type NoTang Angina Asymptomatic Asymptomatic
Blood Pressure 138 145 125 160
Cholesterol 223 233 258 164
Fasting Blood Sugar < 120 FALSE TRUE FALSE FALSE
Resting ECG Normal Hyp Hyp Hyp
Maximum Heart Rate 169 150 141 145
Induced Angina? FALSE FALSE TRUE FALSE
Old Peak 0 2.3 2.8 6.2
Slope Up Down Flat Down
Number of Colored Vessels 0 0 1 3
Thal Normal Fix Rev Rev
Table 2.2 • Most and Least Typical Instances from the Cardiology Domain
A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
A Healthy Class Rule for the Cardiology Patient Dataset
IF 169 <= Maximum Heart Rate <=202
THEN Concept Class = Healthy
Rule accuracy: 85.07%
Rule coverage: 34.55%
A Sick Class Rule for the Cardiology Patient Dataset
IF Thal = Rev & Chest Pain Type = Asymptomatic
THEN Concept Class = Sick
Rule accuracy: 91.14%
Rule coverage: 52.17%
A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Unsupervised Clustering
• Determine if concepts can be found in the data.• Evaluate the likely performance of a supervised model.• Determine a best set of input attributes for supervised learning.• Detect Outliers.
10A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Market Basket Analysis
• Find interesting relationships among retail products.
• Uses association rule algorithms.
• The results of a market basket analysis help retailers– Design promotions,
– Arrange shelf or catalog items, and
– Develop cross-marketing strategies
11A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
2.2 Supervised Data Mining Techniques
12A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
The Credit Card Promotion Database
13A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Income Magazine Watch Life Insurance Credit CardRange ($) Promotion Promotion Promotion Insurance Sex Age
40–50K Yes No No No Male 45
30–40K Yes Yes Yes No Female 40
40–50K No No No No Male 42
30–40K Yes Yes Yes Yes Male 43
50–60K Yes No Yes No Female 38
20–30K No No No No Female 55
30–40K Yes No Yes Yes Male 35
20–30K No Yes No No Male 27
30–40K Yes No No No Male 43
30–40K Yes Yes Yes No Female 41
40–50K No Yes Yes No Female 43
20–30K No Yes Yes No Male 29
50–60K Yes Yes Yes No Female 39
40–50K No Yes No No Male 55
20–30K No No Yes Yes Female 19
Table 2.3 • The Credit Card Promotion Database
14A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Income RangeMagazine PromoWatch PromoLife Ins PromoCredit Card Ins.Sex AgeC C C C C C RI I I I I I I40-50,000 Yes No No No Male 4530-40,000 Yes Yes Yes No Female 4040-50,000 No No No No Male 4230-40,000 Yes Yes Yes Yes Male 4350-60,000 Yes No Yes No Female 3820-30,000 No No No No Female 5530-40,000 Yes No Yes Yes Male 3520-30,000 No Yes No No Male 2730-40,000 Yes No No No Male 4330-40,000 Yes Yes Yes No Female 4140-50,000 No Yes Yes No Female 4320-30,000 No Yes Yes No Male 2950-60,000 Yes Yes Yes No Female 3940-50,000 No Yes No No Male 5520-30,000 No No Yes Yes Female 19
Data file: CreditCardPromotion.xls
A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
A Hypothesis for the Credit Card Promotion Database
A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer.
A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
A Production Rule for theCredit Card Promotion Database
IF Sex = Female & 19 <=Age <= 43
THEN Life Insurance Promotion = Yes
Rule Accuracy: 100.00%
Rule Coverage: 66.67%
Question: Can we assume that two-thirds of all females in the specified age range will take advantage of the
promotion?
• Rule accuracy is a between-class measure.• Rule coverage is a within-class measure.
17A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Neural Networks
• A neural network is a set of interconnected nodes designed to imitate the functioning of the human brain.
• Two phases of operations– Learning phase at the input layer until it reaches a
predetermined minimum error rate– Fixing weights and recompute output values for new
instances• A major shortcoming of the neural network approach
is a lack of explanation about what has been learned.
18A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
InputLayer
OutputLayer
HiddenLayer
19A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Table 2.3 • The Credit Card Promotion Database
Income Magazine Watch Life Insurance Credit Card
Range ($) Promotion Promotion Promotion Insurance Sex Age
40–50K Yes No No No Male 45
30–40K Yes Yes Yes No Female 40
(Note that blue: input attributes, red: output attributes)Therefore, there four input nodes and one output node and chose five hidden-layer nodes.
20A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Table 2.4 - Neural Network Training: Actual and Computed Output
Instance Number Life Insurance Promotion Computed Output
1 0 0.024
2 1 0.998
3 0 0.023
4 1 0.986
5 1 0.999
6 0 0.05
7 1 0.999
8 0 0.262
9 0 0.06
10 1 0.997
11 1 0.999
12 1 0.776
13 1 0.999
14 0 0.023
15 1 0.999
21A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Table 2.4 - Neural Network Training: Actual and Computed Output
Instance Number Life Insurance Promotion Computed Output
1 0 0.024
2 1 0.998
3 0 0.023
4 1 0.986
5 1 0.999
6 0 0.05
7 1 0.999
8 0 0.262
9 0 0.06
10 1 0.997
11 1 0.999
12 1 0.776
13 1 0.999
14 0 0.023
15 1 0.999
22A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Statistical Regression
• Statistical regression is a supervised learning technique that generalizes a set of numeric data by creating a mathematical equation relating one or more input attributes to a single numeric output attributes.
• Linear regression model is characterized by an output attribute whose value is determined by a linear sum of weighted input attribute values.
A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Statistical RegressionLife insurance promotion =
0.5909 (credit card insurance) -
0.5455 (sex) + 0.7727
Example, a female who does not have credit card insurance is a likely candidate for the life insurance promotion.
Life insurance promotion = 0.5909 (0) –
0.5455 (0) + 0.7727 = 0.7727
Because the value 0.7727 is close to 1.0, we conclude that the individual is likely to take advantage of the promotional offer.
24A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
2.3 Association Rules
• Association rule mining techniques are used to discover interesting associations between attributes contained in a database.
• Unlike traditional association rules, association rules can have one or several output attributes.
• An output attribute for one rule can be an input attribute for another rule.
• A popular technique for ‘market basket’ analysis.• Problem: we may have several rules with little value.
A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
An Association Rule for the Credit Card Promotion
DatabaseIF Sex = Female & Age = over40 &
Credit Card Insurance = No
THEN Life Insurance Promotion = Yes
26A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
2.4 Clustering Techniques
• By applying unsupervised clustering to the instances of the Acme Credit Card Company database, we will find a subset of input attributes that differentiate card holders who have taken advantage of the life insurance promotion from those cardholders who have not accepted the promotion offer.
27A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
# Instances: 5Sex: Male => 3
Female => 2Age: 37.0Credit Card Insurance: Yes => 1
No => 4Life Insurance Promotion: Yes => 2
No => 3
Cluster 1
Cluster 2
Cluster 3
# Instances: 3Sex: Male => 3
Female => 0Age: 43.3Credit Card Insurance: Yes => 0
No => 3Life Insurance Promotion: Yes => 0
No => 3
# Instances: 7Sex: Male => 2
Female => 5Age: 39.9Credit Card Insurance: Yes => 2
No => 5Life Insurance Promotion: Yes => 7
No => 0
IF Sex=Female & 43>=Age>=35 & Credit Card Insurance=NOTHEN Class = 3 Rule Accuracy: 100% Rule Coverage: 66.67%
28A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
End here for now!
29A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
2.5 Evaluating Performance
• Performance evaluation is probably the most critical of all the steps in the data mining process. Three general questions:– 1. Will the benefits received from a data mining
project more than offset the cost of the data mining process?
– 2. How do we interpret the results of a data mining session?
– 3. Can we use the results of a data mining process with confidence?
30A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Evaluating Supervised Learner Models
• Supervised learner models are designed to classify, estimate, and/or predict future outcome.
• Applications on classification correctness:– Develop a model to accept to reject credit card
applications– Develop a model to accept or reject home mortgage
applicants– Develop a model to decide whether or not to drill for
oil
A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Confusion Matrix• A matrix used to summarize the results of a supervised classification. • Entries along the main diagonal are correct classifications.• Entries other than those on the main diagonal are classification errors.
Classification correctness is best calculated by presenting previously unseen data in the form of a test to the model being evaluated.A confusion matrix is of little use for evaluating supervised learner models offering numeric output.
32A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Table 2.5 • A Three-Class Confusion Matrix
Computed Decision C1 C2 C3C1 C11 C12 C13C2 C21 C22 C23C3 C31 C32 C33
Rule 2: for C2, C21,C22,C23 are all actually members of C2; but C21 and C23 are incorrectly classified as members of another class.Rule 3: for C2, C12 and C32 are instances are incorrectly classified as members of class C2.
Rule 1: Values along the main diagonal represent correct classificatione.g., C11 represents the total number of class C1 instance correctly classified by the model
Computation questions: #1
33A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Computed Computed Accept Reject
Accept True False Accept Reject Reject False True Accept Reject
Table 2.6 • A Simple Confusion Matrix
Two-Class Error Analysis
The more the better
The less the better
34A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Model Computed Computed Model Computed Computed A Accept Reject B Accept Reject
Accept 600 25 Accept 600 75 Reject 75 300 Reject 25 300
Table 2.7 • Two Confusion Matrices Each Showing a 10% Error Rate
Which model is better?
A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Evaluating Numeric
• Mean absolute error
• Mean squared error
• Root mean squared error
36A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Comparing Models by Measuring Lift
• Marketing applications that focus on response rates from mass mailings are less concerned with test set classification error and more interested in building models able to extract bias samples from large populations.
• Supervised learner models designed for extracting bias samples from a general population are often evaluated by a measure that comes directly from marketing known as lift.
A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Computing Lift
)|(
)|(
PopulationCP
SampleCPLift
i
i
Ci is the class of al zero-balance customers who, given the
opportunity, will take advantage of the promotional offer.
38A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
0
200
400
600
800
1000
1200
0 10 20 30 40 50 60 70 80 90 100
NumberResponding
% Sampled
Figure 2.4 – A lift chart (Targeted vs. mass mailing)
39A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
No Computed Computed Ideal Computed Computed Model Accept Reject Model Accept Reject
Accept 1,000 0 Accept 1,000 0 Reject 99,000 0 Reject 0 99,000
Table 2.8 • Two Confusion Matrices: No Model and an Ideal Model
Model Computed Computed Model Computed Computed X Accept Reject Y Accept Reject
Accept 540 460 Accept 450 550 Reject 23,460 75,540 Reject 19,550 79,450
Table 2.9 • Two Confusion Matrices for Alternative Models with Lift Equal to 2.25
40A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Unsupervised Model Evaluation
• Evaluating unsupervised data mining is, in general, a more difficult task than supervised evaluation. This is true because the goals of an unsupervised data mining session are frequently not as clear as the goals for supervised learning.
41A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Data MiningStrategies
SupervisedLearning
Market BasketAnalysis
UnsupervisedClustering
PredictionEstimationClassification