section 2.1 introduction to enterprise miner. 2 objectives open enterprise miner. explore the...
TRANSCRIPT
Section 2.1
Introduction to Enterprise Miner
2
Objectives Open Enterprise Miner. Explore the workspace components of Enterprise
Miner. Set up a project in Enterprise Miner. Conduct initial data exploration using Enterprise Miner.
3
This demonstration illustrates opening Enterprise Miner and exploring its workspace components.
Demonstration
4
The Scenario Determine who should be approved for a home equity
loan. The target variable is a binary variable that indicates
whether an applicant eventually defaulted on the loan. The input variables are variables such as the amount
of the loan, amount due on the existing mortgage, the value of the property, and the number of recent credit inquiries.
5
This demonstration illustrates setting up a project in Enterprise Miner and conducting initial data exploration.
Demonstration
Section 2.2
Modeling Issues and Data Difficulties
7
Objectives Discuss data difficulties inherent in data mining. Examine common pitfalls in model building.
9
Time Line
Projected:
Actual:
Dreaded:
Needed:
Data Preparation Data Analysis
Allotted Time
(Data Acquisition)
10
Data Arrangement
Acct type
2133 MTG2133 SVG2133 CK
2653 CK2653 SVG
3544 MTG3544 CK3544 MMF3544 CD3544 LOC
Acct CK SVG MMF CD LOC MTG
2133 1 1 0 0 0 1
2653 1 1 0 0 0 0
3544 1 0 1 1 1 1
Long-Narrow
Short-Wide
11
Derived Inputs
Claim Accident Date Time
11nov96 102396/12:38
22dec95 012395/01:42
26apr95 042395/03:05
02jul94 070294/06:25
08mar96 123095/18:33
15dec96 061296/18:12
09nov94 110594/22:14
Delay Season Dark
19 fall 0
333 winter 1
3 spring 1
0 summer 0
69 winter 0
186 summer 0
4 fall 1
12
Roll Up
HH Acct Sales
4461 2133 1604461 2244 424461 2773 2124461 2653 2504461 2801 122
4911 3544 786
5630 2496 458 5630 2635 328
6225 4244 276225 4165 759
HH Acct Sales
4461 2133 ?
4911 3544 ?
5630 2496 ?
6225 4244 ?
13
Rolling Up Longitudinal Data
Frequent Flying VIP Flier Month Mileage Member
10621 Jan 650 No
10621 Feb 0 No
10621 Mar 0 No
10621 Apr 250 No
33855 Jan 350 No
33855 Feb 300 No
33855 Mar 1200 Yes
33855 Apr 850 Yes
14
Transactions
Hard Target Search
Fraud
15
Oversampling
OK
Fraud
16
Undercoverage
AcceptedGood
RejectedNo Follow-up
AcceptedBad
NextGeneration
17
cking #cking ADB NSF dirdep SVG bal
Y 1 468.11 1 1876 Y 1208 Y 1 68.75 0 0 Y 0 Y 1 212.04 0 6 0 . . 0 0 Y 4301 y 2 585.05 0 7218 Y 234 Y 1 47.69 2 1256 238 Y 1 4687.7 0 0 0 . . 1 0 Y 1208 Y . . . 1598 0 1 0.00 0 0 0 Y 3 89981.12 0 0 Y 45662 Y 2 585.05 0 7218 Y 234
Errors, Outliers, and Missings
18
Missing Value Imputation
Cases
Inputs
?
?
?
?
?
?
??
?
19
The Curse of Dimensionality
1–D
2–D
3–D
20
Dimension ReductionIn
pu
t 3
Input1
Redundancy
Input 2Input1
E(T
arg
et)
Irrelevancy
Fool’s Gold
My model fits thetraining data perfectly...
I’ve struck it rich!
22
Data Splitting
23
Model Complexity
Too flexible
Not flexible enough
24
Overfitting
Training Set Test Set
25
Better Fitting
Training Set Test Set
Section 2.3
Introduction to Decision Trees
27
Objectives Explore the general concept of decision trees. Understand the different decision tree algorithms. Discuss the benefits and drawbacks of decision tree
models.
28
Fitted Decision Tree
NINQ>1
75%
2%
0 1-2
45%
DELINQ
DEBTINC
<45 45
10%
0,1
21%>2
BAD =New CaseDEBTINC = 20
NINQ = 2DELINQ = 0
Income = 42K
45%
29
Divide and Conquer
n = 5,000
10% BAD
n = 3,350 n = 1,650Debt-to-Income
Ratio < 45
yes no
21% BAD5% BAD
30
The Cultivation of Trees Split Search
– Which splits are to be considered? Splitting Criterion
– Which split is best? Stopping Rule
– When should the splitting stop? Pruning Rule
– Should some branches be lopped off?
31
Possible Splits to Consider
1
100,000
200,000
300,000
400,000
500,000
2 4 6 8 10 12 14 16 18 20
NominalInput Ordinal
Input
Input Levels
32
Splitting Criteria
Left Right
Perfect Split
Debt-to-Income Ratio < 45
A CompetingThree-Way Split
4500 0
0 500
3196 1304
154 346
Not Bad
Bad
2521 1188
115 162
791
223
Left Center Right
4500
500
4500
500
4500
500
Not Bad
Bad
Not Bad
Bad
33
The Right-Sized Tree
Stunting
Pruning
34
A Field Guide to Tree Algorithms
CART
AIDTHAIDCHAID
ID3C4.5C5.0
35
Benefits of Trees Interpretability
– tree-structured presentation
Mixed Measurement Scales– nominal, ordinal, interval
Regression trees
Robustness
Missing Values
36
Benefits of Trees
Automatically– Detects interactions (AID)– Accommodates
nonlinearity– Selects input variables
InputInput
Prob
MultivariateStep Function
37
Drawbacks of Trees
Roughness
Linear, Main Effects
Instability
Section 2.4
Building and Interpreting Decision Trees
39
Objectives Explore the types of decision tree models available in
Enterprise Miner. Build a decision tree model. Examine the model results and interpret these results. Choose a decision threshold theoretically and
empirically.
40
This demonstration illustrates building a decision tree model with Enterprise miner and examining the results.
Demonstration
41
Consequences of a Decision
Decision 1 Decision 0
Actual 1 True Positive False Negative
Actual 0 False Positive True Negative
42
ExampleRecall the home equity line of credit scoring example. Presume that every two dollars loaned eventually returns three dollars if the loan is paid off in full.
43
Consequences of a Decision
Decision 1 Decision 0
Actual 1 True Positive False Negative
(cost=$2)
Actual 0 False Positive
(cost=$1)
True Negative
44
Bayes Rule
positive false ofcost
negative false ofcost 1
1
45
Consequences of a Decision
Decision 1 Decision 0
Actual 1 True Positive
(profit=$2)
False Negative
Actual 0 False Positive
(profit=$-1)
True Negative
46
This demonstration illustrates using the target profile to select a decision threshold.
Demonstration