decision tree dr. jieh-shan george yeh [email protected]

Decision Tree

Dr. Jieh-Shan George [email protected]

Decision Tree

• Recursive partitioning is a fundamental tool in data mining.

• It helps us explore the structure of a set of data, while developing easy to visualize decision rules for predicting a categorical (classification tree) or continuous (regression tree) outcome.

• Decision tree is an algorithm the can have a continuous or categorical dependent (DV) and independent variables (IV).

Decision Tree

Advantages to using trees

Simple to understand and interpret.

People are able to understand decision tree models after a brief

explanation.

Requires little data preparation.

Other techniques often require data normalization, dummy variables

need to be created and blank values to be removed.

Able to handle both numerical and categorical data.

Advantages to using trees

Uses a white box model.

If a given situation is observable in a model the explanation for the

condition is easily explained by Boolean logic

Possible to validate a model using statistical tests.

That makes it possible to account for the reliability of the model.

Performs well with large data in a short time.

Some things to consider when coding the model…

Splits. Gini or information.

Type of DV (method). Classification (class), regression (anova), count

(poison), survival (exp).

Minimum of observations for a split (minsplit).

Minimum if observations in a node (minbucket).

Cross validation (xval). Used more in model building rather than in

exploration.

Complexity parameter (Cp). This value is used for pruning. A smaller

tree is perhaps less detailed, but with less error.

R has many packages for similar/same endeavors

party.

rpart. Comes with R.

C50.

Cubists.

rpart.plot. Makes rpart plots much nicer.

Dataset iris• The iris dataset has been used for classification in many research

publications. It consists of 50 samples from each of three classes of iris flowers [Frank and Asuncion, 2010]. One class is linearly separable from the other two, while the latter are not linearly separable from each other.

• There are five attributes in the dataset:– Sepal.Length in cm,– Sepal.Width in cm,– Petal.Length in cm,– Petal.Width in cm, and– Species: Iris Setosa, Iris Versicolour, and Iris Virginica.

• Sepal.Length, Sepal.Width, Petal.Length and Petal.Width are used to predict the Species of flowers.

str(iris)

• head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa4 4.6 3.1 1.5 0.2 setosa5 5.0 3.6 1.4 0.2 setosa6 5.4 3.9 1.7 0.4 setosa

CTREE: CONDITIONAL INFERENCE TREE

http://cran.r-project.org/web/packages/party/party.pdf

Conditional Inference Trees

formula a symbolic description of the model to be fit. Note that symbols like : and - will not work and the tree will make use of all variables listed on the rhs of formula.

data a data frame containing the variables in the model.

subset an optional vector specifying a subset of observations to be used in the fitting process.

weights an optional vector of weights to be used in the fitting process. Only non-negative integer valued weights are allowed.

controls an object of class TreeControl, which can be obtained using ctree_control.

DescriptionRecursive partitioning for continuous, censored, ordered, nominal and multivariate response variables in a conditional inference framework.

Usagectree(formula, data, subset = NULL, weights = NULL, controls = ctree_control(), xtrafo = ptrafo, ytrafo = ptrafo, scores = NULL) Arguments

http://127.0.0.1:19015/help/library/party/help/TreeControl

http://127.0.0.1:19015/help/library/party/help/ctree_control

• Before modeling, the iris data is split below into two subsets: training (70%) and test (30%)

• The random seed is set to a fixed value below to make the results reproducible

set.seed(1234)ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))trainData <- iris[ind==1,]testData <- iris[ind==2,]

library(party)# Species is the target variable and all other variables are independent variables.myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Widthiris_ctree <- ctree(myFormula, data=trainData)

Prediction Table

# check the predictiontable(predict(iris_ctree), trainData$Species)

print(iris_ctree) Conditional inference tree with 4 terminal nodes

Response: Species Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Number of observations: 112 1) Petal.Length <= 1.9; criterion = 1, statistic = 104.643 2)* weights = 40 1) Petal.Length > 1.9 3) Petal.Width <= 1.7; criterion = 1, statistic = 48.939 4) Petal.Length <= 4.4; criterion = 0.974, statistic = 7.397 5)* weights = 21 4) Petal.Length > 4.4 6)* weights = 19 3) Petal.Width > 1.7 7)* weights = 32

plot(iris_ctree)

plot(iris_ctree, type="simple")

# predict on test datatestPred <- predict(iris_ctree, newdata = testData)table(testPred, testData$Species)

Issues on ctree()

• The current version of ctree() does not handle missing values well, in that an instance with a missing value may sometimes go to the left sub-tree and sometimes to the right. This might be caused by surrogate rules.

• When a variable exists in training data and is fed into ctree() but does not appear in the built decision tree, the test data must also have that variable to make prediction. Otherwise, a call to predict() would fail.

Issues on ctree()

• If the value levels of a categorical variable in test data are different from that in training data, it would also fail to make prediction on the test data.

• One way to get around the above issue is, after building a decision tree, to call ctree() to build a new decision tree with data containing only those variables existing in the first tree, and to explicitly set the levels of categorical variables in test data to the levels of the corresponding variables in training data.

More info

#Edgar Anderson's Iris Datahelp("iris")#Conditional Inference Treeshelp("ctree")#Class "BinaryTree"help("BinaryTree-class")#Visualization of Binary Regression Treeshelp("plot.BinaryTree")

RPART: RECURSIVE PARTITIONING AND REGRESSION TREES

http://cran.r-project.org/web/packages/rpart/rpart.pdf

Recursive partitioning for classification, regression and survival trees

data("bodyfat", package="TH.data")dim(bodyfat)set.seed(1234)ind <- sample(2, nrow(bodyfat), replace=TRUE, prob=c(0.7, 0.3))bodyfat.train <- bodyfat[ind==1,]bodyfat.test <- bodyfat[ind==2,]

# train a decision treelibrary(rpart)myFormula <- DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadthbodyfat_rpart <- rpart(myFormula, data = bodyfat.train, control = rpart.control(minsplit = 10))attributes(bodyfat_rpart)

print(bodyfat_rpart$cptable)

print(bodyfat_rpart)

plot(bodyfat_rpart)text(bodyfat_rpart, use.n=T)

• select the tree with the minimum prediction erroropt <- which.min(bodyfat_rpart$cptable[,"xerror"])cp <- bodyfat_rpart$cptable[opt, "CP"]bodyfat_prune <- prune(bodyfat_rpart, cp = cp)print(bodyfat_prune)

plot(bodyfat_prune)text(bodyfat_prune, use.n=T)

• After that, the selected tree is used to make prediction and the predicted values are compared with actual labels.

• Function abline() draws a diagonal line. The predictions of a good model are expected to be equal or very close to their actual values, that is, most points should be on or close to the diagonal line.

DEXfat_pred <- predict(bodyfat_prune, newdata=bodyfat.test)xlim <- range(bodyfat$DEXfat)plot(DEXfat_pred ~ DEXfat, data=bodyfat.test, xlab="Observed", ylab="Predicted", ylim=xlim, xlim=xlim)abline(a=0, b=1)

More info

#Recursive Partitioning and Regression Treeshelp("rpart")

#Control for Rpart Fitshelp("rpart.control")

#Prediction of Body Fat by Skinfold Thickness, Circumferences, and Bone Breadths??TH.data::bodyfat

C5.0http://cran.r-project.org/web/packages/C50/C50.pdf

C50

library(C50)myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width

iris_C5.0 <- C5.0(myFormula, data=trainData)

summary(iris_C5.0)C5imp(iris_C5.0)C5.0testPred <- predict(iris_C5.0, testData)table(C5.0testPred, testData$Species)predict(iris_C5.0, testData, type = "prob")

More info

#C5.0 Decision Trees and Rule-Based Modelshelp("C5.0")#Control for C5.0 Modelshelp("C5.0Control")#Summaries of C5.0 Modelshelp("summary.C5.0")#Variable Importance Measures for C5.0 Modelshelp("C5imp")

PLOT RPART MODELS. AN ENHANCED VERSION OF PLOT.RPART

http://cran.r-project.org/web/packages/rpart.plot/rpart.plot.pdf

rpart.plotlibrary(rpart.plot)data(ptitanic) #Titanic datatree <- rpart(survived ~ ., data=ptitanic, cp=.02)# cp=.02 because want small tree for demo

rpart.plot(tree, main="default rpart.plot\n(type = 0, extra = 0)")

prp(tree, main="type = 4, extra = 6", type=4, extra=6, faclen=0)# faclen=0 to print full factor names

rpart.plot

rpart.plot(tree, main="extra = 106, under = TRUE", extra=106, under=TRUE, faclen=0)# the old way for comparison

plot(tree, uniform=TRUE, compress=TRUE, branch=.2)text(tree, use.n=TRUE, cex=.6, xpd=NA) # cex is a guess, depends on your window sizetitle("rpart.plot for comparison", cex=.6)

rpart.plot(tree, box.col=3, xflip=FALSE)

More info

#Titanic data with passenger names and other details removed.help("ptitanic")

# Plot an rpart model.help("rpart.plot")

# Plot an rpart model. A superset of rpart.plot.help("prp")

decision tree dr. jieh-shan george yeh [email protected]

Documents

decision tree slide

setosa slide

iris setosa

length sepal

length petal

width petal

striris slide

decision tree models