red wine quality assessment
TRANSCRIPT
Red Wine Quality
EvaluationWeiyang Bi
Shilin WangZheng Xue
Data Description
• Source:
Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal @2009
Data Description
• The dataset is related to red variant of the Portuguese "Vinho Verde" wine.
• Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available.
Dataset
> nrow(data[!complete.cases(data),])[1] 0
Missing values check
Attribute informationInput variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric
acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 -
sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)
Correlation matrix
R code for training set and test set
B<-20
for(i in 1:B){
set.seed(i)
indexes<-sample(1:nrow(data),size=1000,replace=F)
train<-data[indexes[1:1000],]
test<-data[-indexes[1:1000],]
}
Methods
Three methods were applied to the data set:
1) CART
2) Bagging
3) Random Forest
Classification and Regression Trees (CART)
CP Table
Pruned Tree
Variable Selection
Total sulfur
dioxide
Volatile acidity
sulfatesResidual
sugar
alcohol
Number of splits
Error rate
Bagging
Merging data
CP Table
Misclassification Rate
Misclassification rate ofbagged 100 trees
ROC Graph
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1-Sepecificity
Sen
sitiv
ity
Best single tree:0.64
Bagged 100 trees:0.644
Frequency Table
Evaluation of Variable Importance
Random Forest
Data Structure
Random Forest Fit
Random Forest Plot
Importance
Relative Variable Importance
Partial Dependence Plot
Alcohol Sulphates
Volatiles acidity Total sulfur dioxide
Par
tial
de
pe
nd
en
ce
Volatiles acidity
Par
tial
de
pe
nd
en
ce
Par
tial
de
pe
nd
en
ce
Par
tial
de
pe
nd
en
ce
CART Bagging and RF Comparison
CART Bagging Random Forest
VariableSelection
Alcoholtotal sulfur dioxide
volatile aciditySulphates
Residual sugar
AlcoholSulphates
volatile aciditytotal sulfur dioxide
Densityfixed acidity
residual sugarcitric acid
pHfree sulfur dioxide
Chlorides
AlcoholSulphates
volatile aciditytotal sulfur dioxide
DensityChlorides
fixed acidityfree sulfur dioxide
pHcitric acid
residual sugar
CART Bagging and RF Comparison
Conclusion
• Random forest is the best prediction tool in this case over CART and bagging in terms of the lowest estimate test error rate.