red wine quality assessment

33
Red Wine Quality Evaluation Weiyang Bi Shilin Wang Zheng Xue

Upload: weiyang-abbie-bi

Post on 22-Jan-2018

264 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Red Wine Quality Assessment

Red Wine Quality

EvaluationWeiyang Bi

Shilin WangZheng Xue

Page 2: Red Wine Quality Assessment

Data Description

• Source:

Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal @2009

Page 3: Red Wine Quality Assessment

Data Description

• The dataset is related to red variant of the Portuguese "Vinho Verde" wine.

• Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available.

Page 4: Red Wine Quality Assessment

Dataset

> nrow(data[!complete.cases(data),])[1] 0

Missing values check

Page 5: Red Wine Quality Assessment

Attribute informationInput variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric

acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 -

sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)

Page 6: Red Wine Quality Assessment

Correlation matrix

Page 7: Red Wine Quality Assessment

R code for training set and test set

B<-20

for(i in 1:B){

set.seed(i)

indexes<-sample(1:nrow(data),size=1000,replace=F)

train<-data[indexes[1:1000],]

test<-data[-indexes[1:1000],]

}

Page 8: Red Wine Quality Assessment

Methods

Three methods were applied to the data set:

1) CART

2) Bagging

3) Random Forest

Page 9: Red Wine Quality Assessment

Classification and Regression Trees (CART)

Page 10: Red Wine Quality Assessment

CP Table

Page 11: Red Wine Quality Assessment

Pruned Tree

Page 12: Red Wine Quality Assessment
Page 13: Red Wine Quality Assessment

Variable Selection

Total sulfur

dioxide

Volatile acidity

sulfatesResidual

sugar

alcohol

Page 14: Red Wine Quality Assessment

Number of splits

Page 15: Red Wine Quality Assessment

Error rate

Page 16: Red Wine Quality Assessment

Bagging

Page 17: Red Wine Quality Assessment

Merging data

Page 18: Red Wine Quality Assessment

CP Table

Page 19: Red Wine Quality Assessment

Misclassification Rate

Page 20: Red Wine Quality Assessment

Misclassification rate ofbagged 100 trees

Page 21: Red Wine Quality Assessment

ROC Graph

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1-Sepecificity

Sen

sitiv

ity

Best single tree:0.64

Bagged 100 trees:0.644

Page 22: Red Wine Quality Assessment

Frequency Table

Page 23: Red Wine Quality Assessment

Evaluation of Variable Importance

Page 24: Red Wine Quality Assessment

Random Forest

Page 25: Red Wine Quality Assessment

Data Structure

Page 26: Red Wine Quality Assessment

Random Forest Fit

Page 27: Red Wine Quality Assessment

Random Forest Plot

Page 28: Red Wine Quality Assessment

Importance

Page 29: Red Wine Quality Assessment

Relative Variable Importance

Page 30: Red Wine Quality Assessment

Partial Dependence Plot

Alcohol Sulphates

Volatiles acidity Total sulfur dioxide

Par

tial

de

pe

nd

en

ce

Volatiles acidity

Par

tial

de

pe

nd

en

ce

Par

tial

de

pe

nd

en

ce

Par

tial

de

pe

nd

en

ce

Page 31: Red Wine Quality Assessment

CART Bagging and RF Comparison

CART Bagging Random Forest

VariableSelection

Alcoholtotal sulfur dioxide

volatile aciditySulphates

Residual sugar

AlcoholSulphates

volatile aciditytotal sulfur dioxide

Densityfixed acidity

residual sugarcitric acid

pHfree sulfur dioxide

Chlorides

AlcoholSulphates

volatile aciditytotal sulfur dioxide

DensityChlorides

fixed acidityfree sulfur dioxide

pHcitric acid

residual sugar

Page 32: Red Wine Quality Assessment

CART Bagging and RF Comparison

Page 33: Red Wine Quality Assessment

Conclusion

• Random forest is the best prediction tool in this case over CART and bagging in terms of the lowest estimate test error rate.