red wine mine

32
Knowledge Discovery in Databases MIS 637 Professor Mahmoud Daneshmand Fall 2012 Final Project: Red Wine Recipe Data Mining By Jorge Madrazo

Upload: jonny-michal

Post on 30-Sep-2015

230 views

Category:

Documents


3 download

DESCRIPTION

redwine

TRANSCRIPT

Knowledge Discovery in Databases

Knowledge Discovery in DatabasesMIS 637Professor Mahmoud DaneshmandFall 2012Final Project: Red Wine Recipe Data MiningBy Jorge Madrazo

Profound QuestionsWhat basic properties are the formula for a good wine?Wine making is believed to be an art. But is there a formula for a quality wine?There was a paper on Modeling wine preferences by Data Mining submitted by the provider of the data set. How do my results compare with the papers?

ProcedureFollow a data mining process Use SAS and SAS Enterprise Miner to execute the processSAS Enterprise Miner tool is modeled on the SAS Institute defined data mining process of SEMMA Sample, Explore, Modify, Model, AssessSEMMA is similar to the CRISP DM processSample1,599 recordsSet up a data partitionTraining 40%Validation 30%Test 30%

Explore: Data BackgroundData source UCI Machine Learning Repository.Wine Quality Data Set.There are a red and white wine data set. I focused on the red wine set only.There are 11 input variables and one target variable.fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol Output variable (based on sensory data): quality (score between 0 and 10)Regarding the preferences, each sample was evaluated by a minimum of three sensory assessors (using blind tastes), which graded the wine in a scale that ranges from 0 (very bad) to 10 (excellent). The final sensory score is given by the median of these evaluations.5Explore: Target=QualityQualityPeople gave a quality assessment of different wines on a scale of 0-10. Actual range 3-8.An ordinal target

Explore: InputsCorrelation AnalysisSome correlation, but not enough to discard inputsods graphics on;ods select MatrixPlot;proc corr data=wino.red PLOTS(MAXPOINTS=100000 ) plots=matrix(histogram nvar=all); var quality alcohol ph fixed_acidity density volatile_acidity sulphates citric_acid;run;

Explore: Correlation Graphs

Explore: Chi2 Statistics of Inputs

Explore: Worth of Inputs

Explore: Worth GraphThe Worth Tracks closely with the Chi Statistic

ModifyAt this stage, no modifications are doneModel: Selection Because I want to list the important elements in what is considered a quality wine, I choose a Decision Tree ConfigurationThe Splitting Rule is EntropyMaximum Branch is set to 5Therefore a C4.5 type of algorithm is being implementedAssess: Initial ResultsA Bushy Tree using. The Resulting tree is too intricate for simple recommendation.Over 20 Leaf nodes.

Modify: TargetChange the target so that it becomes a binary.New variable in the model called isGood. Any rating over 6 is categorized as isGood.SAS Code:data wino.xx;set wino.red;if (quality>6) thenisgood=1;else isgood = 0;run;proc print data = wino.xx;title 'xx';run;

Explore: Target = isGood

Model Strategy for isGoodModel with Decision Tree to hope for more descriptive results.Also model with Neural Network to aid in assessment and do comparisonModel: Decision TreeProbF splitting criteria at Significance Level .2Maximum Branch size = 5

Assess: Decision Tree ResultsMuch simpler Tree

Assess: Decision Tree Results 2Leaf Statistics

Assess: Variable ImportanceVariable NameLabelNumber of Splitting RulesNumber of Surrogate RulesImportanceValidation ImportanceRatio of Validation to Training Importancealcohol10111density010.770551750.770551751volatile_acidity010.7288689870.7288689871sulphates100.6716756280.4777105050.711222032fixed_acidity010.5537197290.3938176710.711222032citric_acid010.5497503610.3909945690.711222032free_sulfur_dioxide0000NaNpH0000NaNchlorides0000NaNtotal_sulfur_dioxide0000NaNresidual_sugar0000NaNEvent Classification Table Data Role=TRAIN Target=isgoodFalse NegativeTrue NegativeFalse PositiveTrue Positive535391434 Data Role=VALIDATE Target=isgoodFalse NegativeTrue NegativeFalse PositiveTrue Positive434031221Model: Neural NetworkPositive better at predictingNegative hard to interpret the modelConfigured with 3 Hidden NodesModify: Input Variables to NNBecause of the complexity of the NN, it is recommended to prune variables prior to running the network.

Modify: R2 FilterVariable NameRoleMeasurement LevelReasons for RejectionalcoholINPUTINTERVALchloridesINPUTINTERVALcitric_acidREJECTEDINTERVALVarsel:Small R-square valuedensityINPUTINTERVALfixed_acidityINPUTINTERVALfree_sulfur_dioxideINPUTINTERVALpHREJECTEDINTERVALVarsel:Small R-square valueresidual_sugarREJECTEDINTERVALVarsel:Small R-square valuesulphatesINPUTINTERVALtotal_sulfur_dioxideREJECTEDINTERVALVarsel:Small R-square valuevolatile_acidityINPUTINTERVAL

Model: NNSpecify 3 Hidden Units in the Hidden Layer

Assess: NN ResultsHard to interpret results to formulate a recipeThe NEURAL Procedure Optimization Results Parameter Estimates Gradient Objective N Parameter Estimate Function 1 alcohol_H11 3.679818 -0.001411 2 chlorides_H11 0.520190 -0.000479 3 density_H11 -2.171623 0.000883 4 fixed_acidity_H11 -0.055929 0.000179 5 free_sulfur_dioxide_H11 0.403412 0.000139 6 sulphates_H11 -4.954290 -0.000224 7 volatile_acidity_H11 2.686209 0.000205 8 alcohol_H12 -0.313005 0.001209 9 chlorides_H12 0.200973 0.000759

Assess: Comparative ResultsReceiver Operating Characteristics (ROC) Chart for NN vs Decision Tree

ROC is useful for binary results.The Sensitivity is the True Positive Rate Total Predicted Positive / All PositiveThe Specificity is the True Negative Rate Total Predicted Negative / All Negative1 Specificity is the False Positive Rate

Each point on the curves represents cutoff probabilities. Points that are closer to the upper-right corner correspond to low cutoff probabilities. Points that are closer to the lower-left corner correspond to high cutoff probabilities. The extreme points (1,1) and (0,0) represent rules where all cases are classified into either class 1 (event) or class 0 (non-event). For a given false positive rate (the probability of a non-event that was predicted as an event), the curve indicates the corresponding true positive rate, the probability for an event to be correctly predicted as an event.Therefore, for a given false positive rate on the 1-Specificity axis, the true positive rate should be as high as possible. The different curves in the chart exhibit various degrees of concavity. The higher the degree of concavity, the better the model is expected to be. In the chart above, A appears to be the best model. Conversely, a poor model of random predictions, such as D, appears as a flat 45-degree line. Curves that push upward and to the left represent better models.27Assess: Comparative ResultsCumulative Lift for NN vs Decision Tree

The Cumulative Lift Chart shows you the lift factor of how many times it is better to use a model in contrast to not using a model.The x-coordinate of the chart shows the percentage of the cumulated number of sorted data records of the current model. The data records are sorted in descending order by the confidence that the model assigns to a prediction of the selected value of the target field.28Assess: Comparison with Reference PaperUsed R-MinerSupport Vector Machine (SVM) and Neural Network usedHe applied techniques to extract relative importance of variablesHe attempted to predict every quality levelHe noted the importance of alcohol and sulphates. An increase in sulphates might be related to the fermenting nutrition, which is very important to improve the wine aroma.Assess: Paper Variable Importance

Overall Project in SAS EM

ReferencesUCI Machine Learning Repository http://archive.ics.uci.edu/ml/datasets/WineP. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.In Decision Support Systems, Elsevier, 47(4):547-553, 2009.Modeling wine preferences by data mining from physicochemical properties, Paulo Cortez et. al http://www3.dsi.uminho.pt/pcortez/wine5.pdf