predicting wine quality using different implementations of decision tree algorithm in r
TRANSCRIPT
![Page 1: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/1.jpg)
Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R
MOHAMMED ALHAMADI - PROJECT 1
![Page 2: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/2.jpg)
Acknowledgement
This project was done as a partial requirement for the course Introduction to Machine Learning offered online fall-2016 at the Tandon Online, Tandon
School of Engineering, NYU.
![Page 3: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/3.jpg)
Outline1. Data set
2. Data exploration and visualization
3. Factorizing a variable
4. Splitting data into training and testing
5. Using (C50) Library
6. Using (Tree) library
7. Using (rpart) library
8. Results Comparison
![Page 4: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/4.jpg)
Data Set• The data set contains 4898 observations on white wine varieties and quality ranked by the wine tasters
• The data set contains 11 independent variables and 1 dependent variable• The Independent variables include: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides,
free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol
• The dependent variable is the quality of the wine ranked from 3 (lowest quality) to 9 (highest quality)
![Page 5: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/5.jpg)
Data Exploration and Visualizationwine_data <- read.csv("C:/Users/Mohammed/Google Drive/R_code/Project1/winequality-white.csv", header=TRUE, sep=";")dim(wine_data)
[1] 4898 12
names(wine_data)
[1] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar" [5] "chlorides" "free.sulfur.dioxide" "total.sulfur.dioxide" "density" [9] "pH" "sulphates" "alcohol" "quality"
![Page 6: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/6.jpg)
Data Exploration and Visualization (cont.)
'data.frame': 4898 obs. of 12 variables: $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ... $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ... $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ... $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ... $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ... $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ... $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ... $ density : num 1.001 0.994 0.995 0.996 0.996 ... $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ... $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ... $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ... $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
str(wine_data)
![Page 7: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/7.jpg)
Data Exploration and Visualization (cont.)
cor(wine_data)
Fixed acid Vol. acid Citric acid Res.Sugar Chlorides FS dioxide TS dioxide Density pH Sulphates Alcohol Quality
Fixed acid 1 -0.02 0.29 0.09 0.02 -0.05 0.1 0.27 -0.43 -0.02 -0.12 -0.11
Vol. acid -0.02 1 -0.15 0.06 0.07 -0.1 0.09 0.03 -0.03 -0.04 0.07 -0.19
Citric acid 0.29 -0.15 1 0.09 0.11 0.09 0.12 0.15 -0.16 0.06 -0.08 -0.01
Res.Sugar 0.09 0.06 0.09 1 0.09 0.3 0.4 0.83 -0.2 -0.02 -0.05 -0.1
Chlorides 0.02 0.07 0.11 0.09 1 0.1 0.2 0.26 -0.1 0.02 -0.36 -0.21
FS dioxide -0.05 -0.1 0.09 0.3 0.1 1 0.62 0.3 -0.01 0.01 -0.25 0.01
TS dioxide 0.1 0.09 0.12 0.4 0.2 0.62 1 0.53 0.01 0.13 -0.45 -0.17
Density 0.27 0.03 0.15 0.83 0.26 0.3 0.53 1 -0.1 0.07 -0.8 -0.31
pH -0.43 -0.03 -0.16 -0.2 -0.1 -0.01 0.01 -0.1 1 0.16 0.12 0.1
Sulphates -0.02 -0.04 0.06 -0.02 0.02 0.01 0.13 0.07 0.16 1 -0.02 0.05
Alcohol -0.12 0.07 -0.08 -0.05 -0.36 -0.25 -0.45 -0.8 0.12 -0.02 1 0.44
Quality -0.11 -0.19 -0.01 -0.1 -0.21 0.01 -0.17 -0.31 0.1 0.05 0.44 1
![Page 8: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/8.jpg)
Data Exploration and Visualization (cont.)hist(wine_data$alcohol, col="#EE3B3B", main="Histogram of Alcohol Percent in Wine", xlab="Alcohol Percent", ylab="Number of samples", las=1)
hist(wine_data$density, col="#BCEE6B", main="Histogram of Wine Density", xlab="Density", ylab="Number of samples", las=1)
hist(wine_data$chlorides, col="#CDB79E", main="Histogram of Chlorides in Wine", xlab="Chlorides", ylab="Number of samples", las=1)
![Page 9: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/9.jpg)
Data Exploration and Visualization (cont.)hist(wine_data$quality, col="#458B74", main="Wine Quality Histogram", xlab="Quality", ylab="Number of samples")
![Page 10: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/10.jpg)
typeof(wine_data$fixed.acidity)
typeof(wine_data$volatile.acidity)
typeof(wine_data$citric.acid)
typeof(wine_data$residual.sugar)
typeof(wine_data$chlorides)
typeof(wine_data$free.sulfur.dioxide)
typeof(wine_data$total.sulfur.dioxide)
typeof(wine_data$density)
typeof(wine_data$pH)
typeof(wine_data$sulphates)
typeof(wine_data$alcohol)
typeof(wine_data$quality)
Data Exploration and Visualization (cont.)• Explore the exact types of each column in the data
[1] "double"
[1] "double“
[1] "double"
[1] "double“
[1] "double"
[1] "double"
[1] "double"
[1] "double"
[1] "double"
[1] "double"
[1] "double"
[1] "integer"
![Page 11: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/11.jpg)
Factorizing a variable• Frequency of each quality level:
• 45% of the scores are at score 6
• The categorical variable we want is either: High or Low, so we have 2 options
• For better score distribution, we’ll choose low scores to be from 1 to 5 and high scores from 6 to 9
table(wine_data$quality)
3 4 5 6 7 8 9
20 163 1457 2198 880 175 5
High: scores from 6 to 9 (67%)Low: scores from 1 to 5 (33%)
High: scores from 7 to 9 (22%)Low: scores from 1 to 6 (78%)
![Page 12: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/12.jpg)
Factorizing a variable (cont.)quality_fac <- ifelse(wine_data$quality >= 6, "high", "low")
wine_data <- data.frame(wine_data, quality_fac)
table(wine_data$quality_fac)
High Low
3258 1640
• We can now remove the old integer quality variable
wine_data <- wine_data[,-12]
![Page 13: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/13.jpg)
Splitting data into training and testing
set.seed(71)training_size <- round(0.8 * dim(wine_data)[1])training_sample <- sample(dim(wine_data)[1], training_size, replace=FALSE)
training_data <- wine_data[training_sample,]testing_data <- wine_data[-training_sample,]
• We set a seed so that we can reproduce results• 80% of the data set will be training data• 20% will be testing data
![Page 14: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/14.jpg)
Using (C50) LibraryC50_model <- C5.0(quality_fac~., data=training_data)predict_C50 <- predict(C50_model, testing_data[,-12])testing_high <- quality_fac[testing_sample]
# missclassification errormean(predict_C50 != testing_high)
[1] 0.2265306
• So the misclassification error for this model is almost 23%
![Page 15: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/15.jpg)
Using (C50) Library (cont.)predict_C50_num <- as.numeric(predict_C50)actual_num <- as.numeric(testing_data$quality_fac)pr <- prediction(predict_C50_num, actual_num)auc_data1 <- performance(pr, "tpr", "fpr")plot(auc_data1, main="ROC Curve for C50 Model")
![Page 16: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/16.jpg)
Using (C50) Library (cont.)
aucval1 <- performance(pr, measure="auc")[email protected][[1]] # area under the curve value = 0.7444854
[1] 0.7444854
• So, the area under the curve value for the C50 model = 0.7444854
![Page 17: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/17.jpg)
Using (Tree) Librarytree_model <- tree(quality_fac~., data=training_data)predict_tree <- predict(tree_model, testing_data[,-12], type="class")
mean(predict_tree != testing_high)
[1] 0.2663265
• So the misclassification error for the tree model is almost 27%
![Page 18: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/18.jpg)
Using (Tree) Library (cont.)plot(tree_model)text(tree_model, pretty=0)
![Page 19: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/19.jpg)
Using (Tree) Library (cont.)predict_tree_num <- as.numeric(predict_tree)pr2 <- prediction(predict_tree_num, actual_num)auc_data2 <- performance(pr2, "tpr", "fpr")plot(auc_data2, main="ROC Curve for Tree Model")
![Page 20: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/20.jpg)
Using (Tree) Library (cont.)aucval2 <- performance(pr2, measure="auc")[email protected][[1]]
[1] 0.6439793
• So, the area under the curve value for the tree model = 0.6439793
![Page 21: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/21.jpg)
Using (rpart) libraryrpart_model <- rpart(quality_fac~., data=training_data, method="class")predict_rpart <- predict(rpart_model, testing_data[,-12], type="class")
mean(predict_rpart != testing_high)
[1] 0.2428571
• So the misclassification error for the tree model is almost 24%
![Page 22: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/22.jpg)
Using (rpart) library (cont.)
rpart.plot(rpart_model, extra=101)
• We can plot the tree and show the correctly and incorrectly classified instances
![Page 23: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/23.jpg)
Using (rpart) library (cont.)predict_rpart_num <- as.numeric(predict_rpart)pr3 <- prediction(predict_rpart_num, actual_num)auc_data3 <- performance(pr3, "tpr", "fpr")plot(auc_data3, main="ROC Curve for RPART Model")
![Page 24: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/24.jpg)
Using (rpart) library (cont.)aucval3 <- performance(pr3, measure="auc")[email protected][[1]]
[1] 0.7118481
• So, the area under the curve value for the tree model = 0.7118481
![Page 25: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/25.jpg)
Results ComparisonC50 Model Tree Model RPART Model
table(testing, predicted=predict_rpart)table(testing, predicted=predict_tree)table(testing, predicted=predict_C50)
Predicted
Testing High Low
High 545 112
Low 110 213
Predicted
Testing High Low
High 596 61
Low 200 123
Predicted
Testing High Low
High 555 102
Low 136 187
• 758 correctly classified (77%)• 222 incorrectly classified (23%)• TPR (Sensitivity) = 545/657 = 83%• FPR (Fall-out) = 110/323 = 34%
• 719 correctly classified (73%)• 261 incorrectly classified (27%)• TPR (Sensitivity) = 596/657 = 91%• FPR (Fall-out) = 200/323 = 62%
• 742 correctly classified (76%)• 238 incorrectly classified (24%)• TPR (Sensitivity) = 555/657 = 84%• FPR (Fall-out) = 136/323 = 42%
![Page 26: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/26.jpg)
Results Comparison (cont.)
C50 Model Tree Model RPART Model
Area Under Curve = 0.7444854 Area Under Curve = 0.6439793 Area Under Curve = 0.7118481
![Page 27: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef15f21a28abc0248b45dd/html5/thumbnails/27.jpg)
Referenceso Wine Quality Dataset
oDSO 530: Decision Trees in R (Classification)
o Analysis of Wine Quality Data
o Scatterplots
o Tree Based Models
o R - Classification Trees (part 2 using rpart)
o Information retrieval – Wikipedia
o What does AUC stand for and what is it?