predicting wine quality using different implementations of decision tree algorithm in r

Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

MOHAMMED ALHAMADI - PROJECT 1

Acknowledgement

This project was done as a partial requirement for the course Introduction to Machine Learning offered online fall-2016 at the Tandon Online, Tandon

School of Engineering, NYU.

Outline1. Data set

2. Data exploration and visualization

3. Factorizing a variable

4. Splitting data into training and testing

5. Using (C50) Library

6. Using (Tree) library

7. Using (rpart) library

8. Results Comparison

Data Set• The data set contains 4898 observations on white wine varieties and quality ranked by the wine tasters

• The data set contains 11 independent variables and 1 dependent variable• The Independent variables include: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides,

free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol

• The dependent variable is the quality of the wine ranked from 3 (lowest quality) to 9 (highest quality)

Data Exploration and Visualizationwine_data <- read.csv("C:/Users/Mohammed/Google Drive/R_code/Project1/winequality-white.csv", header=TRUE, sep=";")dim(wine_data)

[1] 4898 12

names(wine_data)

[1] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar" [5] "chlorides" "free.sulfur.dioxide" "total.sulfur.dioxide" "density" [9] "pH" "sulphates" "alcohol" "quality"

Data Exploration and Visualization (cont.)

'data.frame': 4898 obs. of 12 variables: $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ... $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ... $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ... $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ... $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ... $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ... $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ... $ density : num 1.001 0.994 0.995 0.996 0.996 ... $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ... $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ... $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ... $ quality : int 6 6 6 6 6 6 6 6 6 6 ...

str(wine_data)

Data Exploration and Visualization (cont.)

cor(wine_data)

Fixed acid Vol. acid Citric acid Res.Sugar Chlorides FS dioxide TS dioxide Density pH Sulphates Alcohol Quality

Fixed acid 1 -0.02 0.29 0.09 0.02 -0.05 0.1 0.27 -0.43 -0.02 -0.12 -0.11

Vol. acid -0.02 1 -0.15 0.06 0.07 -0.1 0.09 0.03 -0.03 -0.04 0.07 -0.19

Citric acid 0.29 -0.15 1 0.09 0.11 0.09 0.12 0.15 -0.16 0.06 -0.08 -0.01

Res.Sugar 0.09 0.06 0.09 1 0.09 0.3 0.4 0.83 -0.2 -0.02 -0.05 -0.1

Chlorides 0.02 0.07 0.11 0.09 1 0.1 0.2 0.26 -0.1 0.02 -0.36 -0.21

FS dioxide -0.05 -0.1 0.09 0.3 0.1 1 0.62 0.3 -0.01 0.01 -0.25 0.01

TS dioxide 0.1 0.09 0.12 0.4 0.2 0.62 1 0.53 0.01 0.13 -0.45 -0.17

Density 0.27 0.03 0.15 0.83 0.26 0.3 0.53 1 -0.1 0.07 -0.8 -0.31

pH -0.43 -0.03 -0.16 -0.2 -0.1 -0.01 0.01 -0.1 1 0.16 0.12 0.1

Sulphates -0.02 -0.04 0.06 -0.02 0.02 0.01 0.13 0.07 0.16 1 -0.02 0.05

Alcohol -0.12 0.07 -0.08 -0.05 -0.36 -0.25 -0.45 -0.8 0.12 -0.02 1 0.44

Quality -0.11 -0.19 -0.01 -0.1 -0.21 0.01 -0.17 -0.31 0.1 0.05 0.44 1

Data Exploration and Visualization (cont.)hist(wine_data$alcohol, col="#EE3B3B", main="Histogram of Alcohol Percent in Wine", xlab="Alcohol Percent", ylab="Number of samples", las=1)

hist(wine_data$density, col="#BCEE6B", main="Histogram of Wine Density", xlab="Density", ylab="Number of samples", las=1)

hist(wine_data$chlorides, col="#CDB79E", main="Histogram of Chlorides in Wine", xlab="Chlorides", ylab="Number of samples", las=1)

Data Exploration and Visualization (cont.)hist(wine_data$quality, col="#458B74", main="Wine Quality Histogram", xlab="Quality", ylab="Number of samples")

typeof(wine_data$fixed.acidity)

typeof(wine_data$volatile.acidity)

typeof(wine_data$citric.acid)

typeof(wine_data$residual.sugar)

typeof(wine_data$chlorides)

typeof(wine_data$free.sulfur.dioxide)

typeof(wine_data$total.sulfur.dioxide)

typeof(wine_data$density)

typeof(wine_data$pH)

typeof(wine_data$sulphates)

typeof(wine_data$alcohol)

typeof(wine_data$quality)

Data Exploration and Visualization (cont.)• Explore the exact types of each column in the data

[1] "double"

[1] "double“

[1] "double"

[1] "double“

[1] "double"

[1] "double"

[1] "double"

[1] "double"

[1] "double"

[1] "double"

[1] "double"

[1] "integer"

Factorizing a variable• Frequency of each quality level:

• 45% of the scores are at score 6

• The categorical variable we want is either: High or Low, so we have 2 options

• For better score distribution, we’ll choose low scores to be from 1 to 5 and high scores from 6 to 9

table(wine_data$quality)

3 4 5 6 7 8 9

20 163 1457 2198 880 175 5

High: scores from 6 to 9 (67%)Low: scores from 1 to 5 (33%)

High: scores from 7 to 9 (22%)Low: scores from 1 to 6 (78%)

Factorizing a variable (cont.)quality_fac <- ifelse(wine_data$quality >= 6, "high", "low")

wine_data <- data.frame(wine_data, quality_fac)

table(wine_data$quality_fac)

High Low

3258 1640

• We can now remove the old integer quality variable

wine_data <- wine_data[,-12]

Splitting data into training and testing

set.seed(71)training_size <- round(0.8 * dim(wine_data)[1])training_sample <- sample(dim(wine_data)[1], training_size, replace=FALSE)

training_data <- wine_data[training_sample,]testing_data <- wine_data[-training_sample,]

• We set a seed so that we can reproduce results• 80% of the data set will be training data• 20% will be testing data

Using (C50) LibraryC50_model <- C5.0(quality_fac~., data=training_data)predict_C50 <- predict(C50_model, testing_data[,-12])testing_high <- quality_fac[testing_sample]

# missclassification errormean(predict_C50 != testing_high)

[1] 0.2265306

• So the misclassification error for this model is almost 23%

Using (C50) Library (cont.)predict_C50_num <- as.numeric(predict_C50)actual_num <- as.numeric(testing_data$quality_fac)pr <- prediction(predict_C50_num, actual_num)auc_data1 <- performance(pr, "tpr", "fpr")plot(auc_data1, main="ROC Curve for C50 Model")

Using (C50) Library (cont.)

aucval1 <- performance(pr, measure="auc")[email protected][[1]] # area under the curve value = 0.7444854

[1] 0.7444854

• So, the area under the curve value for the C50 model = 0.7444854

Using (Tree) Librarytree_model <- tree(quality_fac~., data=training_data)predict_tree <- predict(tree_model, testing_data[,-12], type="class")

mean(predict_tree != testing_high)

[1] 0.2663265

• So the misclassification error for the tree model is almost 27%

Using (Tree) Library (cont.)plot(tree_model)text(tree_model, pretty=0)

Using (Tree) Library (cont.)predict_tree_num <- as.numeric(predict_tree)pr2 <- prediction(predict_tree_num, actual_num)auc_data2 <- performance(pr2, "tpr", "fpr")plot(auc_data2, main="ROC Curve for Tree Model")

Using (Tree) Library (cont.)aucval2 <- performance(pr2, measure="auc")[email protected][[1]]

[1] 0.6439793

• So, the area under the curve value for the tree model = 0.6439793

Using (rpart) libraryrpart_model <- rpart(quality_fac~., data=training_data, method="class")predict_rpart <- predict(rpart_model, testing_data[,-12], type="class")

mean(predict_rpart != testing_high)

[1] 0.2428571

• So the misclassification error for the tree model is almost 24%

Using (rpart) library (cont.)

rpart.plot(rpart_model, extra=101)

• We can plot the tree and show the correctly and incorrectly classified instances

Using (rpart) library (cont.)predict_rpart_num <- as.numeric(predict_rpart)pr3 <- prediction(predict_rpart_num, actual_num)auc_data3 <- performance(pr3, "tpr", "fpr")plot(auc_data3, main="ROC Curve for RPART Model")

Using (rpart) library (cont.)aucval3 <- performance(pr3, measure="auc")[email protected][[1]]

[1] 0.7118481

• So, the area under the curve value for the tree model = 0.7118481

Results ComparisonC50 Model Tree Model RPART Model

table(testing, predicted=predict_rpart)table(testing, predicted=predict_tree)table(testing, predicted=predict_C50)

Predicted

Testing High Low

High 545 112

Low 110 213

Predicted

Testing High Low

High 596 61

Low 200 123

Predicted

Testing High Low

High 555 102

Low 136 187

• 758 correctly classified (77%)• 222 incorrectly classified (23%)• TPR (Sensitivity) = 545/657 = 83%• FPR (Fall-out) = 110/323 = 34%



Results Comparison (cont.)

C50 Model Tree Model RPART Model

Area Under Curve = 0.7444854 Area Under Curve = 0.6439793 Area Under Curve = 0.7118481

Referenceso Wine Quality Dataset

oDSO 530: Decision Trees in R (Classification)

o Analysis of Wine Quality Data

o Scatterplots

o Tree Based Models

o R - Classification Trees (part 2 using rpart)

o Information retrieval – Wikipedia

o What does AUC stand for and what is it?

https://archive.ics.uci.edu/ml/datasets/Wine+Quality

https://www.youtube.com/watch?v=GOJN9SKl_OE

https://onlinecourses.science.psu.edu/stat857/node/223

http://www.statmethods.net/graphs/scatterplot.html

http://www.statmethods.net/advstats/cart.html

https://www.youtube.com/watch?v=XLNsl1Da5MA

https://en.wikipedia.org/wiki/Information_retrieval

http://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it

predicting wine quality using different implementations of decision tree algorithm in r

Science