predicting wine quality using different implementations of decision tree algorithm in r

27
Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R MOHAMMED ALHAMADI - PROJECT 1

Upload: mohammed-al-hamadi

Post on 21-Jan-2018

476 views

Category:

Science


3 download

TRANSCRIPT

Page 1: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

MOHAMMED ALHAMADI - PROJECT 1

Page 2: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Acknowledgement

This project was done as a partial requirement for the course Introduction to Machine Learning offered online fall-2016 at the Tandon Online, Tandon

School of Engineering, NYU.

Page 3: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Outline1. Data set

2. Data exploration and visualization

3. Factorizing a variable

4. Splitting data into training and testing

5. Using (C50) Library

6. Using (Tree) library

7. Using (rpart) library

8. Results Comparison

Page 4: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Data Set• The data set contains 4898 observations on white wine varieties and quality ranked by the wine tasters

• The data set contains 11 independent variables and 1 dependent variable• The Independent variables include: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides,

free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol

• The dependent variable is the quality of the wine ranked from 3 (lowest quality) to 9 (highest quality)

Page 5: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Data Exploration and Visualizationwine_data <- read.csv("C:/Users/Mohammed/Google Drive/R_code/Project1/winequality-white.csv", header=TRUE, sep=";")dim(wine_data)

[1] 4898 12

names(wine_data)

[1] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar" [5] "chlorides" "free.sulfur.dioxide" "total.sulfur.dioxide" "density" [9] "pH" "sulphates" "alcohol" "quality"

Page 6: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Data Exploration and Visualization (cont.)

'data.frame': 4898 obs. of 12 variables: $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ... $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ... $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ... $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ... $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ... $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ... $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ... $ density : num 1.001 0.994 0.995 0.996 0.996 ... $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ... $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ... $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ... $ quality : int 6 6 6 6 6 6 6 6 6 6 ...

str(wine_data)

Page 7: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Data Exploration and Visualization (cont.)

cor(wine_data)

Fixed acid Vol. acid Citric acid Res.Sugar Chlorides FS dioxide TS dioxide Density pH Sulphates Alcohol Quality

Fixed acid 1 -0.02 0.29 0.09 0.02 -0.05 0.1 0.27 -0.43 -0.02 -0.12 -0.11

Vol. acid -0.02 1 -0.15 0.06 0.07 -0.1 0.09 0.03 -0.03 -0.04 0.07 -0.19

Citric acid 0.29 -0.15 1 0.09 0.11 0.09 0.12 0.15 -0.16 0.06 -0.08 -0.01

Res.Sugar 0.09 0.06 0.09 1 0.09 0.3 0.4 0.83 -0.2 -0.02 -0.05 -0.1

Chlorides 0.02 0.07 0.11 0.09 1 0.1 0.2 0.26 -0.1 0.02 -0.36 -0.21

FS dioxide -0.05 -0.1 0.09 0.3 0.1 1 0.62 0.3 -0.01 0.01 -0.25 0.01

TS dioxide 0.1 0.09 0.12 0.4 0.2 0.62 1 0.53 0.01 0.13 -0.45 -0.17

Density 0.27 0.03 0.15 0.83 0.26 0.3 0.53 1 -0.1 0.07 -0.8 -0.31

pH -0.43 -0.03 -0.16 -0.2 -0.1 -0.01 0.01 -0.1 1 0.16 0.12 0.1

Sulphates -0.02 -0.04 0.06 -0.02 0.02 0.01 0.13 0.07 0.16 1 -0.02 0.05

Alcohol -0.12 0.07 -0.08 -0.05 -0.36 -0.25 -0.45 -0.8 0.12 -0.02 1 0.44

Quality -0.11 -0.19 -0.01 -0.1 -0.21 0.01 -0.17 -0.31 0.1 0.05 0.44 1

Page 8: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Data Exploration and Visualization (cont.)hist(wine_data$alcohol, col="#EE3B3B", main="Histogram of Alcohol Percent in Wine", xlab="Alcohol Percent", ylab="Number of samples", las=1)

hist(wine_data$density, col="#BCEE6B", main="Histogram of Wine Density", xlab="Density", ylab="Number of samples", las=1)

hist(wine_data$chlorides, col="#CDB79E", main="Histogram of Chlorides in Wine", xlab="Chlorides", ylab="Number of samples", las=1)

Page 9: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Data Exploration and Visualization (cont.)hist(wine_data$quality, col="#458B74", main="Wine Quality Histogram", xlab="Quality", ylab="Number of samples")

Page 10: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

typeof(wine_data$fixed.acidity)

typeof(wine_data$volatile.acidity)

typeof(wine_data$citric.acid)

typeof(wine_data$residual.sugar)

typeof(wine_data$chlorides)

typeof(wine_data$free.sulfur.dioxide)

typeof(wine_data$total.sulfur.dioxide)

typeof(wine_data$density)

typeof(wine_data$pH)

typeof(wine_data$sulphates)

typeof(wine_data$alcohol)

typeof(wine_data$quality)

Data Exploration and Visualization (cont.)• Explore the exact types of each column in the data

[1] "double"

[1] "double“

[1] "double"

[1] "double“

[1] "double"

[1] "double"

[1] "double"

[1] "double"

[1] "double"

[1] "double"

[1] "double"

[1] "integer"

Page 11: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Factorizing a variable• Frequency of each quality level:

• 45% of the scores are at score 6

• The categorical variable we want is either: High or Low, so we have 2 options

• For better score distribution, we’ll choose low scores to be from 1 to 5 and high scores from 6 to 9

table(wine_data$quality)

3 4 5 6 7 8 9

20 163 1457 2198 880 175 5

High: scores from 6 to 9 (67%)Low: scores from 1 to 5 (33%)

High: scores from 7 to 9 (22%)Low: scores from 1 to 6 (78%)

Page 12: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Factorizing a variable (cont.)quality_fac <- ifelse(wine_data$quality >= 6, "high", "low")

wine_data <- data.frame(wine_data, quality_fac)

table(wine_data$quality_fac)

High Low

3258 1640

• We can now remove the old integer quality variable

wine_data <- wine_data[,-12]

Page 13: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Splitting data into training and testing

set.seed(71)training_size <- round(0.8 * dim(wine_data)[1])training_sample <- sample(dim(wine_data)[1], training_size, replace=FALSE)

training_data <- wine_data[training_sample,]testing_data <- wine_data[-training_sample,]

• We set a seed so that we can reproduce results• 80% of the data set will be training data• 20% will be testing data

Page 14: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Using (C50) LibraryC50_model <- C5.0(quality_fac~., data=training_data)predict_C50 <- predict(C50_model, testing_data[,-12])testing_high <- quality_fac[testing_sample]

# missclassification errormean(predict_C50 != testing_high)

[1] 0.2265306

• So the misclassification error for this model is almost 23%

Page 15: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Using (C50) Library (cont.)predict_C50_num <- as.numeric(predict_C50)actual_num <- as.numeric(testing_data$quality_fac)pr <- prediction(predict_C50_num, actual_num)auc_data1 <- performance(pr, "tpr", "fpr")plot(auc_data1, main="ROC Curve for C50 Model")

Page 16: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Using (C50) Library (cont.)

aucval1 <- performance(pr, measure="auc")[email protected][[1]] # area under the curve value = 0.7444854

[1] 0.7444854

• So, the area under the curve value for the C50 model = 0.7444854

Page 17: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Using (Tree) Librarytree_model <- tree(quality_fac~., data=training_data)predict_tree <- predict(tree_model, testing_data[,-12], type="class")

mean(predict_tree != testing_high)

[1] 0.2663265

• So the misclassification error for the tree model is almost 27%

Page 18: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Using (Tree) Library (cont.)plot(tree_model)text(tree_model, pretty=0)

Page 19: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Using (Tree) Library (cont.)predict_tree_num <- as.numeric(predict_tree)pr2 <- prediction(predict_tree_num, actual_num)auc_data2 <- performance(pr2, "tpr", "fpr")plot(auc_data2, main="ROC Curve for Tree Model")

Page 20: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Using (Tree) Library (cont.)aucval2 <- performance(pr2, measure="auc")[email protected][[1]]

[1] 0.6439793

• So, the area under the curve value for the tree model = 0.6439793

Page 21: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Using (rpart) libraryrpart_model <- rpart(quality_fac~., data=training_data, method="class")predict_rpart <- predict(rpart_model, testing_data[,-12], type="class")

mean(predict_rpart != testing_high)

[1] 0.2428571

• So the misclassification error for the tree model is almost 24%

Page 22: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Using (rpart) library (cont.)

rpart.plot(rpart_model, extra=101)

• We can plot the tree and show the correctly and incorrectly classified instances

Page 23: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Using (rpart) library (cont.)predict_rpart_num <- as.numeric(predict_rpart)pr3 <- prediction(predict_rpart_num, actual_num)auc_data3 <- performance(pr3, "tpr", "fpr")plot(auc_data3, main="ROC Curve for RPART Model")

Page 24: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Using (rpart) library (cont.)aucval3 <- performance(pr3, measure="auc")[email protected][[1]]

[1] 0.7118481

• So, the area under the curve value for the tree model = 0.7118481

Page 25: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Results ComparisonC50 Model Tree Model RPART Model

table(testing, predicted=predict_rpart)table(testing, predicted=predict_tree)table(testing, predicted=predict_C50)

Predicted

Testing High Low

High 545 112

Low 110 213

Predicted

Testing High Low

High 596 61

Low 200 123

Predicted

Testing High Low

High 555 102

Low 136 187

• 758 correctly classified (77%)• 222 incorrectly classified (23%)• TPR (Sensitivity) = 545/657 = 83%• FPR (Fall-out) = 110/323 = 34%

• 719 correctly classified (73%)• 261 incorrectly classified (27%)• TPR (Sensitivity) = 596/657 = 91%• FPR (Fall-out) = 200/323 = 62%

• 742 correctly classified (76%)• 238 incorrectly classified (24%)• TPR (Sensitivity) = 555/657 = 84%• FPR (Fall-out) = 136/323 = 42%

Page 26: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Results Comparison (cont.)

C50 Model Tree Model RPART Model

Area Under Curve = 0.7444854 Area Under Curve = 0.6439793 Area Under Curve = 0.7118481

Page 27: Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R

Referenceso Wine Quality Dataset

oDSO 530: Decision Trees in R (Classification)

o Analysis of Wine Quality Data

o Scatterplots

o Tree Based Models

o R - Classification Trees (part 2 using rpart)

o Information retrieval – Wikipedia

o What does AUC stand for and what is it?