decision trees -
DESCRIPTION
CHAID and CARTSTRANSCRIPT
-
Classification Problems
Introduction to classification.
Decision Tree approach for classification.
Chi-Square Automatic Interaction Detection (CHAID)
Classification and Regression Tree (CART)
-
CHAID Chi-square Automatic Interaction Detection
-
Introduction to CHAID
CHAID is a decision tree algorithm used in classification problems.
CHAID uses chi-square test of independence for splitting. CHAID was first presented in an article title, An exploratory technique for Investigating large quantities of categorical variables, by G V Kass in Applied Statistics (1980)
-
CHAID CHAID partitions the data into mutually exclusive, exhaustive,
subsets that best describe the dependent categorical variable.
CHAID is an iterative procedure that examines the predictors (or classification variables) and uses them in the order of their statistical significance.
-
Chi-Square test of independence
Chi-square test of independence starts with an assumption that there is no relationship between two variables.
For example, we assume that there is no relationship between checking account balance and default.
-
Chi-Square test of Independence German Credit Case
H0: There is no relationship between checking account balance and default.
HA: There is a relationship between checking account balance and default.
-
Chi-Square test of Independence German Credit Case
H0: Checking account balance and default are independent.
HA: Checking account balance and default are dependent.
-
Contingency Table
Checking account balance
Default
1 0 Total
0 DM 135 139 274 Other than
0 DM 165 561 726 Total 300 700 1000
-
Chi-Square test of Independence Test Statistic
sum total
sumcolumn sum row Efrequency Expected
E
EOstatistic
22
-
Chi-Square test of Independence
Observed frequency
Expected Frequency (O-E)^2/E
0DM-1 135 82.2 33.91
0DM-0 139 191.8 14.53
N0DM-1 165 217.8 12.8
N0DM-0 561 508.2 5.48
1000 1000 66.73
P-value 3.10E-16
p-value is less than 0.05, we reject the null hypothesis.
-
CHAID Procedure Step 1: Examine each predictor variable for its statistical
significance with the dependent variable.
Step 2: Determine the most significant among the predictors (predictor with smallest p value).
Step 3: Divide the data by levels of the most significant predictor (using chi-square test of independence). Each of these groups will be examined individually further.
Step 4: For each sub-group, determine the most significant variable from the remaining predictor and divide the data again.
Step 5: Repeat step 4 until all statistically significant predictors have been identified.
-
CHAID CHAID uses both splitting and merging steps.
In merging, least significantly different groups are merged to form one class.
In splitting, the values of a predictor that results in most significantly different classes are used.
The split selection is based on the chi-square test of independence between the two grouped predictors and the independent variable.
-
Chi-Square test of independence HO: Two groups are independent with respect to a dependent
variable.
HA: Two groups are not independent with respect to a dependent variable.
0.05an greater th is valuep when hypothesis null accept the We
1)-1)(c-(r freedom of degrees
StatisticTest
22
i
ii
E
EO
-
CHAID Input
Significance level for partitioning a variable.
Significance level for merging.
Minimum number of records for the cells.
-
CHAID Example: Breaking Barriers
-
OBSERVED Total EXPECTED
LTV 0 1 0 1
80 77 83 160 32.10191 127.8981
315 1255 1570
CHI STATISTIC 10105.29
P-VALUE 0
Chi-Square test
confirms relationship
-
/* Node 1 */.
DO IF (VALUE LTV) EQ 1).
COMPUTE nod_001 = 1.
COMPUTE pre_001 = 1.
COMPUTE prb_001 = 0.8686.
END IF.
EXECUTE.
/* Node 2 */.
DO IF (SYSMIS(@0DM) OR VALUE(@0DM) NE 1).
COMPUTE nod_001 = 2.
COMPUTE pre_001 = 1.
COMPUTE prb_001 = 0.78431.
END IF.
EXECUTE.
BUSINESS RULES
-
CHAID optimal cut
For a given variable (say LTV), find a cut-off value x, such that the chi-square statistic function is maximum.
n
i
m
j ij
ijij
LTVx xE
xExOMax
1
2
1
2
)(
)()(statistic
Check whether the maximum Chi-Square is significant
-
Classification and Regression Trees (CART)
Splits are chosen based on SSE between the observation and the mean value of the node.
CART is a binary tree, whereas CHAID can split the initial node into more than 2 branches.
Uses Gini Index to minimize the classification error.
-
Gini Index (Classification Impurity)
Gini Index is used to measure the impurity at a node (in classification problem) and is given by:
k nodein jcategory of proportion theis k)P(j where
))(1)((Gini(k)1
K
k
kjPkjP
Smaller Gini Index implies less impurity.
-
Classification Tree Logic
Node t
Node tL Node tR
RL
R
L
LRLL
NNN
noderight in the nsobservatio ofNumber N
nodeleft in the nsobservatio ofNumber N
(.) nodeat Impurity i(.)
)i(tN)i(tNi(t)NMax
Reduction in
impurity
-
SHUBHAM Classification and Regression Tree
-
SHUBHAM Classification and Regression Tree
02.2870.746.2152.251
86.0*14.0*64173.0*827.0*15068.0*2.0*1570
i(2)Pi(1)Pi(0)impurityin Reduction 11
-
Entropy
Entropy is a measure of impurity and is given by:
k nodein ccategory of proportion theis k)P(c where
))(log()(E1
C
c
kcPkcPntropy
-
Entropy
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 0.2 0.4 0.6 0.8 1 1.2
-
00.05
0.1
0.15
0.2
0.25
0.3
0.35
0 0.2 0.4 0.6 0.8 1 1.2
Gini Index
Entropy
-
Tree Pruning
Based on criteria such as percentage of data in each node (say minimum number of observations is at least 5%).
Based on Level of Tree (Say 4 levels from root node)
Based on impurity functions (such as Gini Index and Entropy)