decision trees -

Classification Problems

Introduction to classification.

Decision Tree approach for classification.

Chi-Square Automatic Interaction Detection (CHAID)

Classification and Regression Tree (CART)

CHAID Chi-square Automatic Interaction Detection

Introduction to CHAID

CHAID is a decision tree algorithm used in classification problems.

CHAID uses chi-square test of independence for splitting. CHAID was first presented in an article title, An exploratory technique for Investigating large quantities of categorical variables, by G V Kass in Applied Statistics (1980)

CHAID CHAID partitions the data into mutually exclusive, exhaustive,

subsets that best describe the dependent categorical variable.

CHAID is an iterative procedure that examines the predictors (or classification variables) and uses them in the order of their statistical significance.

Chi-Square test of independence

Chi-square test of independence starts with an assumption that there is no relationship between two variables.

For example, we assume that there is no relationship between checking account balance and default.

Chi-Square test of Independence German Credit Case

H0: There is no relationship between checking account balance and default.

HA: There is a relationship between checking account balance and default.

Chi-Square test of Independence German Credit Case

H0: Checking account balance and default are independent.

HA: Checking account balance and default are dependent.

Contingency Table

Checking account balance

Default

1 0 Total

0 DM 135 139 274 Other than

0 DM 165 561 726 Total 300 700 1000

Chi-Square test of Independence Test Statistic

sum total

sumcolumn sum row Efrequency Expected

E

EOstatistic

22

Chi-Square test of Independence

Observed frequency

Expected Frequency (O-E)^2/E

0DM-1 135 82.2 33.91

0DM-0 139 191.8 14.53

N0DM-1 165 217.8 12.8

N0DM-0 561 508.2 5.48

1000 1000 66.73

P-value 3.10E-16

p-value is less than 0.05, we reject the null hypothesis.

CHAID Procedure Step 1: Examine each predictor variable for its statistical

significance with the dependent variable.

Step 2: Determine the most significant among the predictors (predictor with smallest p value).

Step 3: Divide the data by levels of the most significant predictor (using chi-square test of independence). Each of these groups will be examined individually further.

Step 4: For each sub-group, determine the most significant variable from the remaining predictor and divide the data again.

Step 5: Repeat step 4 until all statistically significant predictors have been identified.

CHAID CHAID uses both splitting and merging steps.

In merging, least significantly different groups are merged to form one class.

In splitting, the values of a predictor that results in most significantly different classes are used.

The split selection is based on the chi-square test of independence between the two grouped predictors and the independent variable.

Chi-Square test of independence HO: Two groups are independent with respect to a dependent

variable.

HA: Two groups are not independent with respect to a dependent variable.

0.05an greater th is valuep when hypothesis null accept the We

1)-1)(c-(r freedom of degrees

StatisticTest

22

i

ii

E

EO

CHAID Input

Significance level for partitioning a variable.

Significance level for merging.

Minimum number of records for the cells.

CHAID Example: Breaking Barriers

OBSERVED Total EXPECTED

LTV 0 1 0 1

80 77 83 160 32.10191 127.8981

315 1255 1570

CHI STATISTIC 10105.29

P-VALUE 0

Chi-Square test

confirms relationship

/* Node 1 */.

DO IF (VALUE LTV) EQ 1).

COMPUTE nod_001 = 1.

COMPUTE pre_001 = 1.

COMPUTE prb_001 = 0.8686.

END IF.

EXECUTE.

/* Node 2 */.

DO IF (SYSMIS(@0DM) OR VALUE(@0DM) NE 1).

COMPUTE nod_001 = 2.

COMPUTE pre_001 = 1.

COMPUTE prb_001 = 0.78431.

END IF.

EXECUTE.

BUSINESS RULES

CHAID optimal cut

For a given variable (say LTV), find a cut-off value x, such that the chi-square statistic function is maximum.

n

i

m

j ij

ijij

LTVx xE

xExOMax

1

2

1

2

)(

)()(statistic

Check whether the maximum Chi-Square is significant

Classification and Regression Trees (CART)

Splits are chosen based on SSE between the observation and the mean value of the node.

CART is a binary tree, whereas CHAID can split the initial node into more than 2 branches.

Uses Gini Index to minimize the classification error.

Gini Index (Classification Impurity)

Gini Index is used to measure the impurity at a node (in classification problem) and is given by:

k nodein jcategory of proportion theis k)P(j where

))(1)((Gini(k)1

K

k

kjPkjP

Smaller Gini Index implies less impurity.

Classification Tree Logic

Node t

Node tL Node tR

RL

R

L

LRLL

NNN

noderight in the nsobservatio ofNumber N

nodeleft in the nsobservatio ofNumber N

(.) nodeat Impurity i(.)

)i(tN)i(tNi(t)NMax

Reduction in

impurity

SHUBHAM Classification and Regression Tree

SHUBHAM Classification and Regression Tree

02.2870.746.2152.251

86.0*14.0*64173.0*827.0*15068.0*2.0*1570

i(2)Pi(1)Pi(0)impurityin Reduction 11

Entropy

Entropy is a measure of impurity and is given by:

k nodein ccategory of proportion theis k)P(c where

))(log()(E1

C

c

kcPkcPntropy

Entropy

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 0.2 0.4 0.6 0.8 1 1.2

00.05

0.1

0.15

0.2

0.25

0.3

0.35

0 0.2 0.4 0.6 0.8 1 1.2

Gini Index

Entropy

Tree Pruning

Based on criteria such as percentage of data in each node (say minimum number of observations is at least 5%).

Based on Level of Tree (Say 4 levels from root node)

Based on impurity functions (such as Gini Index and Entropy)

decision trees -

Documents