data mining: a closer look chapter 2. 2.1 data mining strategies (p35) moh!
Post on 21-Dec-2015
215 views
TRANSCRIPT
Data Mining: A Closer Look
Chapter 2
2.1 Data Mining Strategies (p35)Data MiningStrategies
SupervisedLearning
Market BasketAnalysis
UnsupervisedClustering
PredictionEstimationClassification
Moh!
Classification
• Learning is supervised.
• The dependent variable is categorical.
• Well-defined classes.
• Current rather than future behavior.
Estimation
• Learning is supervised.
• The dependent variable is numeric.
• Well-defined output classes or variable.
• Current rather than future behavior.(???)
Prediction• The emphasis is on predicting future rather than current outcomes.
• The output attribute may be categorical or numeric.
• The output variable must correspond to the variable to be predicted (the dependent variable). The input variables are the predictor variables, (or independent variables).
• Hence any supervised classification model, or supervised estimation model may be used for prediction if the variables are suitably chosen. That is:
– if the output variable is “current” and
– the input variables are previous attribute values
• If you can classify/estimate the present from the past, then you can predict the future from the present!!!
3.5 Choosing a Data Mining Technique
Initial Considerations
• Is learning supervised or unsupervised?
• Is explanation required?
• What is the interaction between input and output attributes?
• What are the data types of the input and output attributes?
Further Considerationswhich we might prefer to ignore
• Do We Know the Distribution of the Data?
• Do We Know Which Attributes Best Define the Data?• Does the Data Contain Missing Values?• Is Time an Issue?• Which Technique Is Most Likely to Give a Best Test
Set Accuracy?
Methods of Supervised classification
• Decision Trees• Production Rules• Instance based methods• Multiple Discriminant Analysis• Naïve Bayes methods• Neural methodsToday we consider only the first three, which are
machine learning based; the last three are statistically based.
Table 2.1 • Cardiology Patient Data Attribute Mixed Numeric Name Values Values Comments
Age Numeric Numeric Age in years
Sex Male, Female 1, 0 Patient gender
Chest PainType Angina,
Abnormal Angina, 1–4 NoTang =
NoTang Nonanginal Asymptomatic pain
Blood Pressure Numeric Numeric Resting Bp at Hosp.
Cholesterol Numeric Numeric Serum cholesterol
Fasting Blood True, False 1, 0 Is fasting blood sugar less Sugar < 120 than 120?
Resting ECG Normal, Abnormal, Hyp 0, 1, 2 Hyp = LVT
Maximum Heart Numeric Numeric Max rate
Rate achieved
Table 2.2 • Most and Least Typical Instances from the Cardiology Domain
Attribute Most Typical Least Typical Most Typical Least TypicalName Healthy Class Healthy Class Sick Class Sick Class
Age 52 63 60 62Sex Male Male Male FemaleChest Pain Type NoTang Angina Asymptomatic AsymptomaticBlood Pressure 138 145 125 160Cholesterol 223 233 258 164Fasting Blood Sugar < 120 False True False FalseResting ECG Normal Hyp Hyp HypMaximum Heart Rate 169 150 141 145Induced Angina? False False True FalseOld Peak 0 2.3 2.8 6.2Slope Up Down Flat DownNumber of Colored Vessels 0 0 1 3Thal Normal Fix Rev Rev
Concept Class = output variable = Healthy/Sick = 1/0
A Healthy Class Rule for the Cardiology Patient Dataset
IF 169 <= Maximum Heart Rate <=202
THEN Concept Class = Healthy
Rule accuracy: 85.07% High Heart rate is quite a good predictor of health
Rule coverage: 34.55% But there are other ways of being healthy.– Rule accuracy is a between-class measure.
– Rule coverage is a within-class measure.
A Sick Class Rule for the Cardiology Patient Dataset
IF Thal = Rev & Chest Pain Type = Asymptomatic
THEN Concept Class = Sick
Rule accuracy: 91.14%
Rule coverage: 52.17%
Healthy
High Heart Rate
Acceptance/rejection of the “Life Insurance Promotion” offer is the output variable.
A Hypothesis for the Insurance Promotion
For credit card holders,
A combination of one or more of the attributes
can differentiate those who say yes to the life insurance promotion
from those who say no.
2.5 Evaluating Supervised Model Performance
The Confusion Matrix
•A matrix used to summarize the results of a supervised classification.
• Entries along the main diagonal are correct classifications.
• Entries other than those on the main diagonal are classification errors.
Table 2.5 • A Three-Class Confusion Matrix
Computed Decision
C1 C2 C3C1 C11 C12 C13
C2 C21 C22 C23
C3 C31 C32 C33
True Classes
c11 is the number with true class “1” which are correctly classified as
class “1”
c12 is the number with true class “1” which are mis-classified as
class “2”
Etc..
Two-Class Error Analysis
Table 2.6 • A Simple Confusion Matrix
Computed Computed
Accept
Reject
Accept True False Accept Reject Reject False True Accept Reject
True
Table 2.7 • Two Confusion Matrices Each Showing a 10% Error Rate
Model Computed Computed Model Computed ComputedA Accept Reject B Accept Reject
Accept 600 25 Accept 600 75Reject 75 300 Reject 25 300
Table 2.8 • Two Confusion Matrices: No Model and an Ideal Model
No Computed Computed Ideal Computed ComputedModel Accept Reject Model Accept Reject
Accept 1,000 0 Accept 1,000 0Reject 99,000 0 Reject 0 99,000
Figure 2.4 Targeted vs. mass mailing
0
200
400
600
800
1000
1200
0 10 20 30 40 50 60 70 80 90 100
NumberResponding
% Sampled
Comparing Models by Measuring Lift
Representative sample
Targetted Sample
Computing Lift
)Random|(
)Targetted|(
SamplerespondingP
SamplerespondingPLift
Table 2.9 • Two Confusion Matrices for Alternative Models with Lift Equal to 2.25
Model Computed Computed Model Computed ComputedX Accept Reject Y Accept Reject
Accept 540 460 Accept 450 550Reject 23,460 75,540 Reject 19,550 79,450
The population is 100,000. Consider the first Confusion Matrix.
The acceptance rate of those predicted to accept is 540/23,460 = 2.3%
The overall acceptance rate in the population is 1000/100,000
= 1%
Therefore the lift in the response rate from using the classification model for targetted sampling/marketting is 2.3/1 = 2.3.
Basic Data Mining Techniques : Chapter 3
3.1 Decision TreesAn Algorithm for Building Decision Trees
1. Let T be the set of training instances.2. Choose an attribute that best differentiates the instances in T.3. Create a tree node whose value is the chosen attribute. Create child links from this node where each link represents a unique value for the chosen attribute.Use the child link values to further subdivide the instances into subclasses. 4. For each subclass created in step 3: If the instances in the subclass satisfy predefined criteria or if the set of remaining attribute choices for this path is null, specify the classification for new instances following this decision path. If the subclass does not satisfy the criteria and there is at least one attribute to further subdivide the path of the tree, let T be the current set of subclass instances and return to step 2.
Don’t worry too much about this. It is just “algorithm speak”, which we do not concern ourselves with.
2.2/3.1 Supervised Data Mining TechniquesAnother (pseudo) Dataset
Table 2.3 • The Credit Card Promotion Database
Income Magazine Watch Life Insurance Credit CardRange ($) Promotion Promotion Promotion Insurance Sex Age
40–50K Yes No No No Male 4530–40K Yes Yes Yes No Female 4040–50K No No No No Male 4230–40K Yes Yes Yes Yes Male 4350–60K Yes No Yes No Female 3820–30K No No No No Female 5530–40K Yes No Yes Yes Male 3520–30K No Yes No No Male 2730–40K Yes No No No Male 4330–40K Yes Yes Yes No Female 4140–50K No Yes Yes No Female 4320–30K No Yes Yes No Male 2950–60K Yes Yes Yes No Female 3940–50K No Yes No No Male 5520–30K No No Yes Yes Female 19
Figure 3.1 A partial decision tree with root node = income range
IncomeRange
30-40K
4 Yes1 No
2 Yes2 No
1 Yes3 No
2 Yes
50-60K40-50K20-30K
Figure 3.2 A partial decision tree with root node = credit card insurance
CreditCard
Insurance
No Yes
3 Yes0 No
6 Yes6 No
Figure 3.3 A partial decision tree with root node = age
Age
<= 43 > 43
0 Yes3 No
9 Yes3 No
Decision Trees for the Credit Card Promotion Database
Figure 3.4 A three-node decision tree for the credit card database
Age
Sex
<= 43
Male
Yes (6/0)
Female
> 43
CreditCard
Insurance
YesNo
No (4/1) Yes (2/0)
No (3/0)
Figure 3.5 A two-node decision treee for the credit card database
CreditCard
Insurance
Sex
No
Male
Yes (6/1)
Female
Yes
Yes (3/0)
No (6/1)
Table 3.2 • Training Data Instances Following the Path in Figure 3.4 to Credit CardInsurance = No
Income Life Insurance Credit CardRange Promotion Insurance Sex Age
40–50K No No Male 4220–30K No No Male 2730–40K No No Male 4320–30K Yes No Male 29
Decision Tree Rules
Rules for the Tree in Figure 3.4
IF Age <=43 & Sex = Male & Credit Card Insurance = NoTHEN Life Insurance Promotion = No
• IF Sex = Female & 19 <=Age <= 43• THEN Life Insurance Promotion = Yes• Rule Accuracy: 100.00%• Rule Coverage: 66.67%
A Production Rule for theCredit Card Promotion Database
IF Sex = Female & 19 <=Age <= 43
THEN Life Insurance Promotion = Yes
Rule Accuracy: 100.00%
Rule Coverage: 66.67%
A Simplified Rule Obtained by Removing Attribute Age
IF Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No
• CART
• CHAID
Other Methods for Building Decision Trees
Advantages of Decision Trees
• Easy to understand.
• Map nicely to a set of production rules.• Applied to real problems.• Make no prior assumptions about the data.• Able to process both numerical and categorical data.
Disadvantages of Decision Trees
• Output attribute must be categorical.
• Limited to one output attribute.• Decision tree algorithms are unstable.• Trees created from numeric datasets can be complex.
An Excel-based Data Mining Tool
Chapter 4
4.1 The iData Analyzer
4.2 ESX: A Multipurpose Tool for Data Mining
A Live Demonstration
Laboratory Exercise.
Figure 4.2 A successful installation
4.5 A Six-Step Approach for Supervised Learning
Step 1: Choose an Output Attribute
Step 2: Perform the Mining Session
Step 3: Read and Interpret Summary Results
Step 4: Read and Interpret Test Set Results
Step 5: Read and Interpret Class Results
Step 6: Visualize and Interpret Class Rules
Figure 4.12 Test set instance classification
Read and Interpret Test Set Results
4.7 Instance Typicality
• The typicality of instance I is the “average” “similarity” of I to the other members of its cluster or class.
• definitions of “average” and “similarity” in iDA are secret!!!
• Typicality values lie between 0 and 1.
• 1 indicates the class “prototype”
• 0 indicates the class “outlier”
Typicality Scores
• Identify prototypical and outlier instances.
• Select a best set of training instances.
• Used to compute individual instance classification confidence scores.
• CLASS SIMILARITY is the average similarity of members of a class with other members of the same class.
Figure 4.13 Instance typicality
Other Definitions
Given class C and categorical attribute A with values v1, v2,…vn, then the
• Class C predictability score for A = v2 (say) is the proportion of instances in C with A = v2. This is concerned with the predictability of A = v2 in the class C.
• Class C predictiveness is the proportion of instances with A = v2 which are in class C. This is concerned with the predictabiity of the Class C from A = v2.