© galit shmueli and peter bruce 2010 chapter 6: multiple linear regression data mining for business...
DESCRIPTION
Explanatory Modeling Goal: Explain relationship between predictors (explanatory variables) and target Familiar use of regression in data analysis Model Goal: Fit the data well and understand the contribution of explanatory variables to the model “goodness-of-fit”: R 2, residual analysis, p-valuesTRANSCRIPT
© Galit Shmueli and Peter Bruce 2010
Chapter 6: Multiple Linear Regression
Data Mining for Business AnalyticsShmueli, Patel & Bruce
TopicsExplanatory vs. predictive modeling with
regressionExample: prices of Toyota CorollasFitting a predictive modelAssessing predictive accuracySelecting a subset of predictors
Explanatory ModelingGoal: Explain relationship between predictors (explanatory variables) and target
Familiar use of regression in data analysis
Model Goal: Fit the data well and understand the contribution of explanatory variables to the model
“goodness-of-fit”: R2, residual analysis, p-values
Predictive ModelingGoal: predict target values in other data where we have predictor values, but not target valuesClassic data mining contextModel Goal: Optimize predictive accuracyTrain model on training dataAssess performance on validation (hold-
out) dataExplaining role of predictors is not primary
purpose (but useful)
Example: Prices of Toyota CorollaToyotaCorolla.xls
Goal: predict prices of used Toyota Corollas based on their specification
Data: Prices of 1442 used Toyota Corollas, with their specification information
Data Sample(showing only the variables to be used in analysis)
Price Age KM Fuel_Type HP
Metallic
Automatic cc Doors
Quart_Tax Weight
13500 23 46986 Diesel 90 1 0 2000 3 210 116513750 23 72937 Diesel 90 1 0 2000 3 210 116513950 24 41711 Diesel 90 1 0 2000 3 210 116514950 26 48000 Diesel 90 0 0 2000 3 210 116513750 30 38500 Diesel 90 0 0 2000 3 210 117012950 32 61000 Diesel 90 0 0 2000 3 210 117016900 27 94612 Diesel 90 1 0 2000 3 210 124518600 30 75889 Diesel 90 1 0 2000 3 210 124521500 27 19700 Petrol 192 0 0 1800 3 100 118512950 23 71138 Diesel 69 0 0 1900 3 185 110520950 25 31461 Petrol 192 0 0 1800 3 100 1185
Variables Used
Price in EurosAge in months as of 8/04KM (kilometers)Fuel Type (diesel, petrol, CNG)HP (horsepower)Metallic color (1=yes, 0=no)Automatic transmission (1=yes,
0=no)CC (cylinder volume)DoorsQuart_Tax (road tax)Weight (in kg)
PreprocessingFuel type is categorical, must be transformed into binary variables
Diesel (1=yes, 0=no)
CNG (1=yes, 0=no)
None needed for “Petrol” (reference category)
Subset of the records selected for training partition (limited # of variables shown)
60% training data / 40% validation data
Price Age KM HP MetallicAutoma
tic cc Doors Quart_Tax WeightFuel_Type_CNG
Fuel_Type_Dies
el
Fuel_Type_Petr
ol13500 23 46986 90 1 0 2000 3 210 1165 0 1 013750 30 38500 90 0 0 2000 3 210 1170 0 1 018600 30 75889 90 1 0 2000 3 210 1245 0 1 020950 25 31461 192 0 0 1800 3 100 1185 0 0 119950 22 43610 192 0 0 1800 3 100 1185 0 0 122500 32 34131 192 1 0 1800 3 100 1185 0 0 122000 28 18739 192 0 0 1800 3 100 1185 0 0 117950 24 21716 110 1 0 1600 3 85 1105 0 0 116950 30 64359 110 1 0 1600 3 85 1105 0 0 115950 30 67660 110 1 0 1600 3 85 1105 0 0 116950 29 43905 110 0 1 1600 3 100 1170 0 0 115950 28 56349 110 1 0 1600 3 85 1120 0 0 1
The Fitted Regression ModelInputVariables Coefficient Std. Error t-Statistic P-ValueIntercept -1833.3 1363.468725 -1.34458531 0.179118Age -121.607 3.320627169 -36.6218141 5.8E-177KM -0.01928 0.001699871 -11.3431708 7.2E-28HP 29.75675 4.224060494 7.044584823 3.84E-12Metallic 101.1609 97.8079189 1.034280977 0.301299Automatic 456.6505 199.4455759 2.289599302 0.022289cc -0.0223 0.091510026 -0.2437426 0.807489Doors -118.378 50.63479867 -2.33788041 0.019625Quart_Tax 11.75023 2.201777534 5.336700883 1.21E-07Weight 16.06755 1.414469773 11.35941787 6.13E-28Fuel_Type_CNG -2367.26 448.7481365 -5.27524242 1.68E-07Fuel_Type_Diesel -1077.4 354.0357916 -3.04319608 0.002413
Error reports
Predicted Values
Predicted price computed using regression coefficients
Residuals = difference between actual and predicted prices
PredictedValue
ActualValue Residual
16952.1212 13500 -3452.1211816243.6727 13750 -2493.6727316828.9683 18600 1771.0316820052.9734 20950 897.02659820183.5395 19950 -233.53950719251.3998 22500 3248.6001919933.4558 22000 2066.5441616566.3936 17950 1383.6064115014.5103 16950 1935.4897514950.8606 15950 999.13937217106.644 16950 -156.644035
15653.1865 15950 296.81347416118.44 16950 831.559982
Distribution of Residuals
Symmetric distributionSome outliers
Selecting Subsets of PredictorsGoal: Find parsimonious model (the simplest model that performs sufficiently well)
More robustHigher predictive accuracy
Exhaustive Search
Partial Search AlgorithmsForwardBackwardStepwise
Exhaustive SearchAll possible subsets of predictors assessed
(single, pairs, triplets, etc.)Computationally intensiveJudge by “adjusted R2”
)1(111 22 RpnnRadj
Penalty for number of predictors
Forward SelectionStart with no predictorsAdd them one by one (add the one with
largest contribution)Stop when the addition is not statistically
significant
Backward EliminationStart with all predictorsSuccessively eliminate least useful
predictors one by oneStop when all remaining predictors have
statistically significant contribution
StepwiseLike Forward SelectionExcept at each step, also consider dropping
non-significant predictors
Backward elimination (showing last 7 models)
1 2 3 4 5 6 7 8Constant Age_08_04 * * * * * *Constant Age_08_04 Weight * * * * *Constant Age_08_04 KM Weight * * * *Constant Age_08_04 KMFuel_Type_Petrol Weight * * *Constant Age_08_04 KMFuel_Type_Petrol Quarterly_Tax Weight * *Constant Age_08_04 KMFuel_Type_Petrol HP Quarterly_Tax Weight *Constant Age_08_04 KMFuel_Type_Petrol HP Automatic Quarterly_Tax Weight
Top model has a single predictor (Age_08_04)Second model has two predictors, etc.
All 12 Models
Diagnostics for the 12 models
Good model has:High adj-R2, Cp = # predictors
Next stepSubset selection methods give candidate
models that might be “good models”Do not guarantee that “best” model is
indeed bestAlso, “best” model can still have insufficient
predictive accuracyMust run the candidates and assess
predictive accuracy (click “choose subset”)
Model with only 6 predictors
The Regression Model
Coefficient Std. Error p-value SS-3874.492188 1415.003052 0.00640071 97276411904-123.4366303 3.33806777 0 8033339392-0.01749926 0.00173714 0 2515745282409.154297 319.5795288 0 504956719.70204735 4.22180223 0.00000394 29133657616.88731384 2.08484554 0 19239086415.91809368 1.26474357 0 281026176
HPQuarterly_TaxWeight
Constant termAge_08_04KMFuel_Type_Petrol
Input variables
Training Data scoring - Summary Report
Total sum of squared errors RMS Error Average Error
1516825972 1326.521353 -0.000143957
Validation Data scoring - Summary Report
Total sum of squared errors RMS Error Average Error
1021510219 1334.029433 118.4483556
Model Fit
Predictive performance(compare to 12-predictor model!)
SummaryLinear regression models are very popular
tools, not only for explanatory modeling, but also for prediction
A good predictive model has high predictive accuracy (to a useful practical level)
Predictive models are built using a training data set, and evaluated on a separate validation data set
Removing redundant predictors is key to achieving predictive accuracy and robustness
Subset selection methods help find “good” candidate models. These should then be run and assessed.