decision tree dr. jieh-shan george yeh [email protected]

39
Decision Tree Dr. Jieh-Shan George YEH [email protected]

Upload: sybil-heath

Post on 19-Dec-2015

228 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

Decision Tree

Dr. Jieh-Shan George [email protected]

Page 2: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

Decision Tree

• Recursive partitioning is a fundamental tool in data mining.

• It helps us explore the structure of a set of data, while developing easy to visualize decision rules for predicting a categorical (classification tree) or continuous (regression tree) outcome.

• Decision tree is an algorithm the can have a continuous or categorical dependent (DV) and independent variables (IV).

Page 3: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

Decision Tree

Page 4: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

Advantages to using trees

Simple to understand and interpret.

People are able to understand decision tree models after a brief

explanation.

Requires little data preparation.

Other techniques often require data normalization, dummy variables

need to be created and blank values to be removed.

Able to handle both numerical and categorical data.

Page 5: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

Advantages to using trees

Uses a white box model.

If a given situation is observable in a model the explanation for the

condition is easily explained by Boolean logic

Possible to validate a model using statistical tests.

That makes it possible to account for the reliability of the model.

Performs well with large data in a short time.

Page 6: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

Some things to consider when coding the model…

Splits. Gini or information.

Type of DV (method). Classification (class), regression (anova), count

(poison), survival (exp).

Minimum of observations for a split (minsplit).

Minimum if observations in a node (minbucket).

Cross validation (xval). Used more in model building rather than in

exploration.

Complexity parameter (Cp). This value is used for pruning. A smaller

tree is perhaps less detailed, but with less error.

Page 7: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

R has many packages for similar/same endeavors

party.

rpart. Comes with R.

C50.

Cubists.

rpart.plot. Makes rpart plots much nicer.

Page 8: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

Dataset iris• The iris dataset has been used for classification in many research

publications. It consists of 50 samples from each of three classes of iris flowers [Frank and Asuncion, 2010]. One class is linearly separable from the other two, while the latter are not linearly separable from each other.

• There are five attributes in the dataset:– Sepal.Length in cm,– Sepal.Width in cm,– Petal.Length in cm,– Petal.Width in cm, and– Species: Iris Setosa, Iris Versicolour, and Iris Virginica.

• Sepal.Length, Sepal.Width, Petal.Length and Petal.Width are used to predict the Species of flowers.

str(iris)

Page 9: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

• head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa4 4.6 3.1 1.5 0.2 setosa5 5.0 3.6 1.4 0.2 setosa6 5.4 3.9 1.7 0.4 setosa

Page 10: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

CTREE: CONDITIONAL INFERENCE TREE

http://cran.r-project.org/web/packages/party/party.pdf

Page 11: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

Conditional Inference Trees

formula a symbolic description of the model to be fit. Note that symbols like : and - will not work and the tree will make use of all variables listed on the rhs of formula.

data a data frame containing the variables in the model.

subset an optional vector specifying a subset of observations to be used in the fitting process.

weights an optional vector of weights to be used in the fitting process. Only non-negative integer valued weights are allowed.

controls an object of class TreeControl, which can be obtained using ctree_control.

DescriptionRecursive partitioning for continuous, censored, ordered, nominal and multivariate response variables in a conditional inference framework.

Usagectree(formula, data, subset = NULL, weights = NULL, controls = ctree_control(), xtrafo = ptrafo, ytrafo = ptrafo, scores = NULL) Arguments

Page 12: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

• Before modeling, the iris data is split below into two subsets: training (70%) and test (30%)

• The random seed is set to a fixed value below to make the results reproducible

set.seed(1234)ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))trainData <- iris[ind==1,]testData <- iris[ind==2,]

Page 13: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

library(party)# Species is the target variable and all other variables are independent variables.myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Widthiris_ctree <- ctree(myFormula, data=trainData)

Page 14: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

Prediction Table

# check the predictiontable(predict(iris_ctree), trainData$Species)

Page 15: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

print(iris_ctree) Conditional inference tree with 4 terminal nodes

Response: Species Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Number of observations: 112 1) Petal.Length <= 1.9; criterion = 1, statistic = 104.643 2)* weights = 40 1) Petal.Length > 1.9 3) Petal.Width <= 1.7; criterion = 1, statistic = 48.939 4) Petal.Length <= 4.4; criterion = 0.974, statistic = 7.397 5)* weights = 21 4) Petal.Length > 4.4 6)* weights = 19 3) Petal.Width > 1.7 7)* weights = 32

Page 16: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

plot(iris_ctree)

Page 17: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

plot(iris_ctree, type="simple")

Page 18: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

# predict on test datatestPred <- predict(iris_ctree, newdata = testData)table(testPred, testData$Species)

Page 19: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

Issues on ctree()

• The current version of ctree() does not handle missing values well, in that an instance with a missing value may sometimes go to the left sub-tree and sometimes to the right. This might be caused by surrogate rules.

• When a variable exists in training data and is fed into ctree() but does not appear in the built decision tree, the test data must also have that variable to make prediction. Otherwise, a call to predict() would fail.

Page 20: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

Issues on ctree()

• If the value levels of a categorical variable in test data are different from that in training data, it would also fail to make prediction on the test data.

• One way to get around the above issue is, after building a decision tree, to call ctree() to build a new decision tree with data containing only those variables existing in the first tree, and to explicitly set the levels of categorical variables in test data to the levels of the corresponding variables in training data.

Page 21: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

More info

#Edgar Anderson's Iris Datahelp("iris")#Conditional Inference Treeshelp("ctree")#Class "BinaryTree"help("BinaryTree-class")#Visualization of Binary Regression Treeshelp("plot.BinaryTree")

Page 22: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

RPART: RECURSIVE PARTITIONING AND REGRESSION TREES

http://cran.r-project.org/web/packages/rpart/rpart.pdf

Page 23: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

Recursive partitioning for classification, regression and survival trees

data("bodyfat", package="TH.data")dim(bodyfat)set.seed(1234)ind <- sample(2, nrow(bodyfat), replace=TRUE, prob=c(0.7, 0.3))bodyfat.train <- bodyfat[ind==1,]bodyfat.test <- bodyfat[ind==2,]

# train a decision treelibrary(rpart)myFormula <- DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadthbodyfat_rpart <- rpart(myFormula, data = bodyfat.train, control = rpart.control(minsplit = 10))attributes(bodyfat_rpart)

Page 24: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

print(bodyfat_rpart$cptable)

Page 25: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

print(bodyfat_rpart)

Page 26: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

plot(bodyfat_rpart)text(bodyfat_rpart, use.n=T)

Page 27: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

• select the tree with the minimum prediction erroropt <- which.min(bodyfat_rpart$cptable[,"xerror"])cp <- bodyfat_rpart$cptable[opt, "CP"]bodyfat_prune <- prune(bodyfat_rpart, cp = cp)print(bodyfat_prune)

plot(bodyfat_prune)text(bodyfat_prune, use.n=T)

Page 28: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw
Page 29: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

• After that, the selected tree is used to make prediction and the predicted values are compared with actual labels.

• Function abline() draws a diagonal line. The predictions of a good model are expected to be equal or very close to their actual values, that is, most points should be on or close to the diagonal line.

Page 30: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

DEXfat_pred <- predict(bodyfat_prune, newdata=bodyfat.test)xlim <- range(bodyfat$DEXfat)plot(DEXfat_pred ~ DEXfat, data=bodyfat.test, xlab="Observed", ylab="Predicted", ylim=xlim, xlim=xlim)abline(a=0, b=1)

Page 31: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw
Page 32: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

More info

#Recursive Partitioning and Regression Treeshelp("rpart")

#Control for Rpart Fitshelp("rpart.control")

#Prediction of Body Fat by Skinfold Thickness, Circumferences, and Bone Breadths??TH.data::bodyfat

Page 33: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

C5.0http://cran.r-project.org/web/packages/C50/C50.pdf

Page 34: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

C50

library(C50)myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width

iris_C5.0 <- C5.0(myFormula, data=trainData)

summary(iris_C5.0)C5imp(iris_C5.0)C5.0testPred <- predict(iris_C5.0, testData)table(C5.0testPred, testData$Species)predict(iris_C5.0, testData, type = "prob")

Page 35: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

More info

#C5.0 Decision Trees and Rule-Based Modelshelp("C5.0")#Control for C5.0 Modelshelp("C5.0Control")#Summaries of C5.0 Modelshelp("summary.C5.0")#Variable Importance Measures for C5.0 Modelshelp("C5imp")

Page 36: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

PLOT RPART MODELS. AN ENHANCED VERSION OF PLOT.RPART

http://cran.r-project.org/web/packages/rpart.plot/rpart.plot.pdf

Page 37: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

rpart.plotlibrary(rpart.plot)data(ptitanic) #Titanic datatree <- rpart(survived ~ ., data=ptitanic, cp=.02)# cp=.02 because want small tree for demo

rpart.plot(tree, main="default rpart.plot\n(type = 0, extra = 0)")

prp(tree, main="type = 4, extra = 6", type=4, extra=6, faclen=0)# faclen=0 to print full factor names

Page 38: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

rpart.plot

rpart.plot(tree, main="extra = 106, under = TRUE", extra=106, under=TRUE, faclen=0)# the old way for comparison

plot(tree, uniform=TRUE, compress=TRUE, branch=.2)text(tree, use.n=TRUE, cex=.6, xpd=NA) # cex is a guess, depends on your window sizetitle("rpart.plot for comparison", cex=.6)

rpart.plot(tree, box.col=3, xflip=FALSE)

Page 39: Decision Tree Dr. Jieh-Shan George YEH jsyeh@pu.edu.tw

More info

#Titanic data with passenger names and other details removed.help("ptitanic")

# Plot an rpart model.help("rpart.plot")

# Plot an rpart model. A superset of rpart.plot.help("prp")