examples decision trees what is a decision tree? how to...

18
Decision Trees Content Examples What is a decision tree? How to build a decision tree? Stopping rule and tree pruning Confusion matrix (binary) Classification Example: Fisher’s Iris Data 3 species of iris flowers, 50 observations per species 4 predictor variables: petal length and width, sepal length and width Objective: to predict the class of species based Classification Example: Fisher’s Iris Data

Upload: vudien

Post on 19-Jun-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Decision Trees

Content

Examples What is a decision tree? How to build a decision tree? Stopping rule and tree pruning Confusion matrix (binary)

Classification Example: Fisher’s Iris Data 3 species of iris flowers, 50 observations per species 4 predictor variables: petal length and width, sepal length

and width Objective: to predict the class of species based

Classification Example: Fisher’s Iris Data

Classification Example:Stock Selection

Classification Example:Stock Selection To predict a stock whether

it is underperformed or overperformed.

Underperformed means its monthly return is less than the median stock return for the month

Otherwise, overperformed

Classification Example:In-Patient Data

Classification Example:In-Patient Data 1,756,484 records of hospital in-patient statistics in NSW, Australia

in 1996-97 Aim: identify risk factors for an adverse event (AE) An adverse event (AE)

is an unintended injury or complication which results in disability, death or prolongation of hospital stay, and

is caused by health care management rather than the patient’s disease Eg. accidental cut during surgery, incorrect dosage of drugs

Potential predictors: Comorbidity (multiple diagnoses), Procedures (multiple procedures),

gender, insurance, psychiatric status, age, day only, readmitted, etc 3.4% of AE cases in the dataset

Classification Example:In-Patient Data

Classification Example:In-Patient Data Model Performance:

Confusion matrix Misclassification rate

= (12601+542874)/1756484 = 31.6%

Sensitivity = 47403/60004 = 79.0%

Specificity = 1153606/1696480 = 68.0%

Pen-Digits Data Binary Tree

Tree with Multiway Splits Boston Housing Data

Regression Tree Multivariate Step Function

Content

Examples What is a decision tree? How to build a decision tree? Stopping rule and tree pruning Confusion matrix (binary)

What is a decision tree

Variation of Decision Trees

Classification treeThe target is discrete (binary, nominal)The leaves give the predicted class as well as

the probability of class membership Regression treeThe target is continuousThe leaves give the predicted value of the

target Tree with binary splits Tree with multiway splits

Illustrating Classification Task Example of a Decision Tree

Decision Tree Classification Task Apply Model to Test Data

Apply Model to Test Data Apply Model to Test Data

Apply Model to Test Data Apply Model to Test Data

Apply Model to Test Data Content

Examples What is a decision tree? How to build a decision tree? Stopping rule and tree pruning Confusion matrix (binary)

How to build a decision tree

Recursive partitioning a top-down, greedy algorithm to fit the decision tree

for the data Top-down

Starting at the root node, split the data into subgroups that are as homogeneous as possible with respect to the target.

Greedy method always make a locally optimal choice in the hope that

this will lead to a globally optimal solution

Root-Node Split

1-Deep Space Depth 2

2-Deep Space Three steps in tree construction

Selection of the best splitWhich input variable could give the ‘best’ split? ‘Best’ according to which splitting criterion?

Stop-splitting ruleWhen should the splitting stop?

Assignment of each leaf node to a class Predict the value of the target variable

(discrete or continuous) at each leaf node

No. of possible splits

Split on a nominal input with L distinct levels No. of possible splits into B branches:

S(L,B) = B ⋅S(L −1,B) + S(L −1,B −1)

No. of possible splits

Split on an ordinal input with L distinct levels No. of possible splits into B branches:

No. of possible splits

Split on a continuous input Treat it as if an ordinal input

No. of possible splits

Selection of the best splits

Exhaustively examining all possible splits is time consuming.

By default, Softwares will use exhaustive search if no. of possible splits < 5000.

Otherwise, a clustering of levels of an input is used to limit the possible splits to consider.

An alternative way is to consider binary splits only (B = 2) nominal : 2L–1 – 1 possible splits ordinal : L – 1 possible splits

Splitting Criterion

After a set of candidate splits is determined, a splitting criterion is used to determine the best one.

Splitting criterion for discrete target

Two approaches for discrete target: Method 1: statistical test for independence

between the input and target variables Chi-squared test Likelihood ratio test

The best split is the one that is most significant (i.e. p-value is the smallest)

Statistical approach to splitting Any split in a classification tree can be arranged in a

contingency table.

Test of independence between target (row) and input (column): Chi-squared test X2 = Σ (O – E)2/E Likelihood ratio test G2 = 2 Σ O ln(O/E) O = observed frequency E = expected frequency X2 and G2 ~ chi-square dist. with d.f. (r-1)(B-1) r = no. of target levels B = no. of branches

Example revisited …

X2 = 266.67 d.f. = 1 G2 = 345.22 d.f. = 1 Smaller p-value ⇒ Stronger association between input and target The split with the smallest P-value or largest logworth =

– log10(p-value) will be chosen

Pen-Digits Data:Chi-Squared Test

Splitting criterion for discrete target

Method 2: based on impurity function of a nodeGini index: 1 – Σj pj

2

Entropy: –Σj pj log2 pj where log2(x) = ln(x) / ln(2)Misclassification error: 1 – max pj

The best split is the one that gives the maximum reduction in impurity (IP):ΔIP = 0.4 – 6/10(0.33) – 4/10(0) = 0.202

Gini Index Gini index is a measure of diversity for discrete data.

Gini = 1-2(3/8)2-2(1/8)2 = .69

Gini = 1-(6/7)2-(1/7)2 = .24

Minimum G = 0 if one of the pj’s is 1Maximum G = 1 – 1/k if p1 = … = pk = 1/k

Entropy Impurity function

Properties of an impurity function of a node: Nonnegative decreases when the node is more “pure”, i.e. one

class dominates

For node 1:• Gini = 1 – 0.52 – 0.52 = 0.5• Entropy = –0.5 log2(0.5) – 0.5 log2(0.5) = 1• Misclassification error = 1 – 0.5 = 0.5For node 2:• Gini = 1 – 0.752 – 0.252 = 0.375• Entropy = –0.75 log2(0.75) – 0.25 log2(0.25) = 0.811• Misclassification error = 1 – 0.75 = 0.25

Remarks

The process of selecting the best split on a node: 1) Select the best split on each input variable (i.e.

choose number of branches and cut-off points) 2) select the best of these

Comparing splits on the same input variable:Gini, Entropy, and Misclass favour splits into greater

numbers of branches (large B). They are not appropriate for evaluating multiway

splits. The p-values of Chi-squared and likelihood ratio tests

automatically adjust for this bias through the d.f..

Problem with Impurity Reduction

Impurity reduction tends to prefer splits that result in large number of partitions, each being small but pure

Customer ID has highest information gain because entropy for all the children is zero

Remarks Comparing splits on different input variables:

The p-values of Chi-squared and likelihood ratio tests tends to be smaller as the number of possible splits, m, increases.

Kass (1980) proposed Bonferroni adjustments of the pvalues to account for this bias.

Logworth = – log10(m p-value) What is the value of m? If all the splits have logworth < – log10(0.2) then don’t split.

Otherwise, the split with the largest logworth is selected as the best split.

Which splitting criterion is the best? No single best choice Attempt all and determine the best results

P-Value Adjustments inChi-Square Test

Splitting Criterionfor continuous target Two approaches for continuous target:

Based on impurity function of a node Sample variance

Based on a statistical test for one-way ANOVA F test

Boston Housing Data

NOX

F test is better than (sample) variance reduction as it has P-value adjustment for different no. of branches.

F test is relatively robust to departures from normality assumption

However, F test is sensitive to departures from non-constant variance

Assignment of each leaf nodeto a class For classification tree:

Classify an observation in a node to the class with maximum posterior probability

p( j ) is prior probability p( t | j ) = proportion of class j obs. going to node t If p( j ) = proportion of all obs. belonging to class j, then p( j | t ) =

proportion of obs. in node t belonging to class j

For regression tree: Predict an observation in a node by the sample mean of the

target values in the node.

Example Prior probabilities

p(1) = p(7) = 364/1064 p(9) = 336/1064

Conditional probabilities p(t|1) = 285/364 p(t|7) = 143/364 p(t|9) = 41/336

Show the following results for posterior probabilities: p(1|t) = 285/469 p(7|t) = 143/469 p(9|t) = 41/469

Classify to class 1.

Content

Examples What is a decision tree? How to build a decision tree? Stopping rule and tree pruning Confusion matrix (binary)

Stop splitting rule

A simple method:continuous splitting until every node is pure or

contains only one observation. fit training data perfectly but may predict

poorly on new data. Two approaches:Top-down stopping rules (pre-pruning)Bottom-up assessment criteria (post-pruning)

Advantages of Trees Easy to interpret

Tree structured presentation Allow mixed input data types:

Nominal, ordinal, interval Allow discrete (binary and nominal) or continuous target

ordinal target not allowed Robust to outliers in inputs No problem with missing values Automatically

Detects interactions (AID) Accommodates nonlinearity Selects input variables

Disadvantages of trees Most algorithms use univariate splits

Solution: Linear combination split (a1x1+a2x2< c?) Unstable fitted tree

Often a small change in the data result in a very different series of splits

Solution: Bagging Lack of smoothness (step function) in reg. tree Splitting turns continuous input variables into discrete

variables. Solution: tree-based regression

Spitting using a “greedy” algorithm While each split is optimal, the overall tree is not.

Content

Examples What is a decision tree? How to build a decision tree? Stopping rule and tree pruning Confusion matrix (binary)

Confusion Matrix

Misclassification rate = (false positive + false negative)/(total cases)

Accuracy (or correct classification rate) = (true ‘+’ + true ‘–’)/(total cases)

Captured Response Curve orTarget Concentration Curve

Proportion of responders in the full sample are captured in the top 10% (20% …) of people as ranked by the model.

Try to locate all positive targets (all respondents)

Response rate

Response Rate = true positives / total predicted positives

Gains Chart or Response Chart

Proportion of responders in the top 10% (20% …) of people as ranked by the model.

Lift chart:lift (= response

rate / baseline)

Predictive Power