decision tree dr. jieh-shan george yeh [email protected]
TRANSCRIPT
Decision Tree
Dr. Jieh-Shan George [email protected]
Decision Tree
• Recursive partitioning is a fundamental tool in data mining.
• It helps us explore the structure of a set of data, while developing easy to visualize decision rules for predicting a categorical (classification tree) or continuous (regression tree) outcome.
• Decision tree is an algorithm the can have a continuous or categorical dependent (DV) and independent variables (IV).
Decision Tree
Advantages to using trees
Simple to understand and interpret.
People are able to understand decision tree models after a brief
explanation.
Requires little data preparation.
Other techniques often require data normalization, dummy variables
need to be created and blank values to be removed.
Able to handle both numerical and categorical data.
Advantages to using trees
Uses a white box model.
If a given situation is observable in a model the explanation for the
condition is easily explained by Boolean logic
Possible to validate a model using statistical tests.
That makes it possible to account for the reliability of the model.
Performs well with large data in a short time.
Some things to consider when coding the model…
Splits. Gini or information.
Type of DV (method). Classification (class), regression (anova), count
(poison), survival (exp).
Minimum of observations for a split (minsplit).
Minimum if observations in a node (minbucket).
Cross validation (xval). Used more in model building rather than in
exploration.
Complexity parameter (Cp). This value is used for pruning. A smaller
tree is perhaps less detailed, but with less error.
R has many packages for similar/same endeavors
party.
rpart. Comes with R.
C50.
Cubists.
rpart.plot. Makes rpart plots much nicer.
Dataset iris• The iris dataset has been used for classification in many research
publications. It consists of 50 samples from each of three classes of iris flowers [Frank and Asuncion, 2010]. One class is linearly separable from the other two, while the latter are not linearly separable from each other.
• There are five attributes in the dataset:– Sepal.Length in cm,– Sepal.Width in cm,– Petal.Length in cm,– Petal.Width in cm, and– Species: Iris Setosa, Iris Versicolour, and Iris Virginica.
• Sepal.Length, Sepal.Width, Petal.Length and Petal.Width are used to predict the Species of flowers.
str(iris)
• head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa4 4.6 3.1 1.5 0.2 setosa5 5.0 3.6 1.4 0.2 setosa6 5.4 3.9 1.7 0.4 setosa
CTREE: CONDITIONAL INFERENCE TREE
http://cran.r-project.org/web/packages/party/party.pdf
Conditional Inference Trees
formula a symbolic description of the model to be fit. Note that symbols like : and - will not work and the tree will make use of all variables listed on the rhs of formula.
data a data frame containing the variables in the model.
subset an optional vector specifying a subset of observations to be used in the fitting process.
weights an optional vector of weights to be used in the fitting process. Only non-negative integer valued weights are allowed.
controls an object of class TreeControl, which can be obtained using ctree_control.
DescriptionRecursive partitioning for continuous, censored, ordered, nominal and multivariate response variables in a conditional inference framework.
Usagectree(formula, data, subset = NULL, weights = NULL, controls = ctree_control(), xtrafo = ptrafo, ytrafo = ptrafo, scores = NULL) Arguments
• Before modeling, the iris data is split below into two subsets: training (70%) and test (30%)
• The random seed is set to a fixed value below to make the results reproducible
set.seed(1234)ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))trainData <- iris[ind==1,]testData <- iris[ind==2,]
library(party)# Species is the target variable and all other variables are independent variables.myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Widthiris_ctree <- ctree(myFormula, data=trainData)
Prediction Table
# check the predictiontable(predict(iris_ctree), trainData$Species)
print(iris_ctree) Conditional inference tree with 4 terminal nodes
Response: Species Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Number of observations: 112 1) Petal.Length <= 1.9; criterion = 1, statistic = 104.643 2)* weights = 40 1) Petal.Length > 1.9 3) Petal.Width <= 1.7; criterion = 1, statistic = 48.939 4) Petal.Length <= 4.4; criterion = 0.974, statistic = 7.397 5)* weights = 21 4) Petal.Length > 4.4 6)* weights = 19 3) Petal.Width > 1.7 7)* weights = 32
plot(iris_ctree)
plot(iris_ctree, type="simple")
# predict on test datatestPred <- predict(iris_ctree, newdata = testData)table(testPred, testData$Species)
Issues on ctree()
• The current version of ctree() does not handle missing values well, in that an instance with a missing value may sometimes go to the left sub-tree and sometimes to the right. This might be caused by surrogate rules.
• When a variable exists in training data and is fed into ctree() but does not appear in the built decision tree, the test data must also have that variable to make prediction. Otherwise, a call to predict() would fail.
Issues on ctree()
• If the value levels of a categorical variable in test data are different from that in training data, it would also fail to make prediction on the test data.
• One way to get around the above issue is, after building a decision tree, to call ctree() to build a new decision tree with data containing only those variables existing in the first tree, and to explicitly set the levels of categorical variables in test data to the levels of the corresponding variables in training data.
More info
#Edgar Anderson's Iris Datahelp("iris")#Conditional Inference Treeshelp("ctree")#Class "BinaryTree"help("BinaryTree-class")#Visualization of Binary Regression Treeshelp("plot.BinaryTree")
RPART: RECURSIVE PARTITIONING AND REGRESSION TREES
http://cran.r-project.org/web/packages/rpart/rpart.pdf
Recursive partitioning for classification, regression and survival trees
data("bodyfat", package="TH.data")dim(bodyfat)set.seed(1234)ind <- sample(2, nrow(bodyfat), replace=TRUE, prob=c(0.7, 0.3))bodyfat.train <- bodyfat[ind==1,]bodyfat.test <- bodyfat[ind==2,]
# train a decision treelibrary(rpart)myFormula <- DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadthbodyfat_rpart <- rpart(myFormula, data = bodyfat.train, control = rpart.control(minsplit = 10))attributes(bodyfat_rpart)
print(bodyfat_rpart$cptable)
print(bodyfat_rpart)
plot(bodyfat_rpart)text(bodyfat_rpart, use.n=T)
• select the tree with the minimum prediction erroropt <- which.min(bodyfat_rpart$cptable[,"xerror"])cp <- bodyfat_rpart$cptable[opt, "CP"]bodyfat_prune <- prune(bodyfat_rpart, cp = cp)print(bodyfat_prune)
plot(bodyfat_prune)text(bodyfat_prune, use.n=T)
• After that, the selected tree is used to make prediction and the predicted values are compared with actual labels.
• Function abline() draws a diagonal line. The predictions of a good model are expected to be equal or very close to their actual values, that is, most points should be on or close to the diagonal line.
DEXfat_pred <- predict(bodyfat_prune, newdata=bodyfat.test)xlim <- range(bodyfat$DEXfat)plot(DEXfat_pred ~ DEXfat, data=bodyfat.test, xlab="Observed", ylab="Predicted", ylim=xlim, xlim=xlim)abline(a=0, b=1)
More info
#Recursive Partitioning and Regression Treeshelp("rpart")
#Control for Rpart Fitshelp("rpart.control")
#Prediction of Body Fat by Skinfold Thickness, Circumferences, and Bone Breadths??TH.data::bodyfat
C5.0http://cran.r-project.org/web/packages/C50/C50.pdf
C50
library(C50)myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
iris_C5.0 <- C5.0(myFormula, data=trainData)
summary(iris_C5.0)C5imp(iris_C5.0)C5.0testPred <- predict(iris_C5.0, testData)table(C5.0testPred, testData$Species)predict(iris_C5.0, testData, type = "prob")
More info
#C5.0 Decision Trees and Rule-Based Modelshelp("C5.0")#Control for C5.0 Modelshelp("C5.0Control")#Summaries of C5.0 Modelshelp("summary.C5.0")#Variable Importance Measures for C5.0 Modelshelp("C5imp")
PLOT RPART MODELS. AN ENHANCED VERSION OF PLOT.RPART
http://cran.r-project.org/web/packages/rpart.plot/rpart.plot.pdf
rpart.plotlibrary(rpart.plot)data(ptitanic) #Titanic datatree <- rpart(survived ~ ., data=ptitanic, cp=.02)# cp=.02 because want small tree for demo
rpart.plot(tree, main="default rpart.plot\n(type = 0, extra = 0)")
prp(tree, main="type = 4, extra = 6", type=4, extra=6, faclen=0)# faclen=0 to print full factor names
rpart.plot
rpart.plot(tree, main="extra = 106, under = TRUE", extra=106, under=TRUE, faclen=0)# the old way for comparison
plot(tree, uniform=TRUE, compress=TRUE, branch=.2)text(tree, use.n=TRUE, cex=.6, xpd=NA) # cex is a guess, depends on your window sizetitle("rpart.plot for comparison", cex=.6)
rpart.plot(tree, box.col=3, xflip=FALSE)
More info
#Titanic data with passenger names and other details removed.help("ptitanic")
# Plot an rpart model.help("rpart.plot")
# Plot an rpart model. A superset of rpart.plot.help("prp")