© galit shmueli and peter bruce 2010 chapter 6: multiple linear regression data mining for business...

© Galit Shmueli and Peter Bruce 2010

Chapter 6: Multiple Linear Regression

Data Mining for Business AnalyticsShmueli, Patel & Bruce

TopicsExplanatory vs. predictive modeling with

regressionExample: prices of Toyota CorollasFitting a predictive modelAssessing predictive accuracySelecting a subset of predictors

Explanatory ModelingGoal: Explain relationship between predictors (explanatory variables) and target

Familiar use of regression in data analysis

Model Goal: Fit the data well and understand the contribution of explanatory variables to the model

“goodness-of-fit”: R2, residual analysis, p-values

Predictive ModelingGoal: predict target values in other data where we have predictor values, but not target valuesClassic data mining contextModel Goal: Optimize predictive accuracyTrain model on training dataAssess performance on validation (hold-

out) dataExplaining role of predictors is not primary

purpose (but useful)

Example: Prices of Toyota CorollaToyotaCorolla.xls

Goal: predict prices of used Toyota Corollas based on their specification

Data: Prices of 1442 used Toyota Corollas, with their specification information

Data Sample(showing only the variables to be used in analysis)

Price Age KM Fuel_Type HP

Metallic

Automatic cc Doors

Quart_Tax Weight

13500 23 46986 Diesel 90 1 0 2000 3 210 116513750 23 72937 Diesel 90 1 0 2000 3 210 116513950 24 41711 Diesel 90 1 0 2000 3 210 116514950 26 48000 Diesel 90 0 0 2000 3 210 116513750 30 38500 Diesel 90 0 0 2000 3 210 117012950 32 61000 Diesel 90 0 0 2000 3 210 117016900 27 94612 Diesel 90 1 0 2000 3 210 124518600 30 75889 Diesel 90 1 0 2000 3 210 124521500 27 19700 Petrol 192 0 0 1800 3 100 118512950 23 71138 Diesel 69 0 0 1900 3 185 110520950 25 31461 Petrol 192 0 0 1800 3 100 1185

Variables Used

Price in EurosAge in months as of 8/04KM (kilometers)Fuel Type (diesel, petrol, CNG)HP (horsepower)Metallic color (1=yes, 0=no)Automatic transmission (1=yes,

0=no)CC (cylinder volume)DoorsQuart_Tax (road tax)Weight (in kg)

PreprocessingFuel type is categorical, must be transformed into binary variables

Diesel (1=yes, 0=no)

CNG (1=yes, 0=no)

None needed for “Petrol” (reference category)

Subset of the records selected for training partition (limited # of variables shown)

60% training data / 40% validation data

Price Age KM HP MetallicAutoma

tic cc Doors Quart_Tax WeightFuel_Type_CNG

Fuel_Type_Dies

el

Fuel_Type_Petr

ol13500 23 46986 90 1 0 2000 3 210 1165 0 1 013750 30 38500 90 0 0 2000 3 210 1170 0 1 018600 30 75889 90 1 0 2000 3 210 1245 0 1 020950 25 31461 192 0 0 1800 3 100 1185 0 0 119950 22 43610 192 0 0 1800 3 100 1185 0 0 122500 32 34131 192 1 0 1800 3 100 1185 0 0 122000 28 18739 192 0 0 1800 3 100 1185 0 0 117950 24 21716 110 1 0 1600 3 85 1105 0 0 116950 30 64359 110 1 0 1600 3 85 1105 0 0 115950 30 67660 110 1 0 1600 3 85 1105 0 0 116950 29 43905 110 0 1 1600 3 100 1170 0 0 115950 28 56349 110 1 0 1600 3 85 1120 0 0 1

The Fitted Regression ModelInputVariables Coefficient Std. Error t-Statistic P-ValueIntercept -1833.3 1363.468725 -1.34458531 0.179118Age -121.607 3.320627169 -36.6218141 5.8E-177KM -0.01928 0.001699871 -11.3431708 7.2E-28HP 29.75675 4.224060494 7.044584823 3.84E-12Metallic 101.1609 97.8079189 1.034280977 0.301299Automatic 456.6505 199.4455759 2.289599302 0.022289cc -0.0223 0.091510026 -0.2437426 0.807489Doors -118.378 50.63479867 -2.33788041 0.019625Quart_Tax 11.75023 2.201777534 5.336700883 1.21E-07Weight 16.06755 1.414469773 11.35941787 6.13E-28Fuel_Type_CNG -2367.26 448.7481365 -5.27524242 1.68E-07Fuel_Type_Diesel -1077.4 354.0357916 -3.04319608 0.002413

Error reports

Predicted Values

Predicted price computed using regression coefficients

Residuals = difference between actual and predicted prices

PredictedValue

ActualValue Residual

16952.1212 13500 -3452.1211816243.6727 13750 -2493.6727316828.9683 18600 1771.0316820052.9734 20950 897.02659820183.5395 19950 -233.53950719251.3998 22500 3248.6001919933.4558 22000 2066.5441616566.3936 17950 1383.6064115014.5103 16950 1935.4897514950.8606 15950 999.13937217106.644 16950 -156.644035

15653.1865 15950 296.81347416118.44 16950 831.559982

Distribution of Residuals

Symmetric distributionSome outliers

Selecting Subsets of PredictorsGoal: Find parsimonious model (the simplest model that performs sufficiently well)

More robustHigher predictive accuracy

Exhaustive Search

Partial Search AlgorithmsForwardBackwardStepwise

Exhaustive SearchAll possible subsets of predictors assessed

(single, pairs, triplets, etc.)Computationally intensiveJudge by “adjusted R2”

)1(111 22 RpnnRadj

Penalty for number of predictors

Forward SelectionStart with no predictorsAdd them one by one (add the one with

largest contribution)Stop when the addition is not statistically

significant

Backward EliminationStart with all predictorsSuccessively eliminate least useful

predictors one by oneStop when all remaining predictors have

statistically significant contribution

StepwiseLike Forward SelectionExcept at each step, also consider dropping

non-significant predictors

Backward elimination (showing last 7 models)

1 2 3 4 5 6 7 8Constant Age_08_04 * * * * * *Constant Age_08_04 Weight * * * * *Constant Age_08_04 KM Weight * * * *Constant Age_08_04 KMFuel_Type_Petrol Weight * * *Constant Age_08_04 KMFuel_Type_Petrol Quarterly_Tax Weight * *Constant Age_08_04 KMFuel_Type_Petrol HP Quarterly_Tax Weight *Constant Age_08_04 KMFuel_Type_Petrol HP Automatic Quarterly_Tax Weight

Top model has a single predictor (Age_08_04)Second model has two predictors, etc.

All 12 Models

Diagnostics for the 12 models

Good model has:High adj-R2, Cp = # predictors

Next stepSubset selection methods give candidate

models that might be “good models”Do not guarantee that “best” model is

indeed bestAlso, “best” model can still have insufficient

predictive accuracyMust run the candidates and assess

predictive accuracy (click “choose subset”)

Model with only 6 predictors

The Regression Model

Coefficient Std. Error p-value SS-3874.492188 1415.003052 0.00640071 97276411904-123.4366303 3.33806777 0 8033339392-0.01749926 0.00173714 0 2515745282409.154297 319.5795288 0 504956719.70204735 4.22180223 0.00000394 29133657616.88731384 2.08484554 0 19239086415.91809368 1.26474357 0 281026176

HPQuarterly_TaxWeight

Constant termAge_08_04KMFuel_Type_Petrol

Input variables

Training Data scoring - Summary Report

Total sum of squared errors RMS Error Average Error

1516825972 1326.521353 -0.000143957

Validation Data scoring - Summary Report

Total sum of squared errors RMS Error Average Error

1021510219 1334.029433 118.4483556

Model Fit

Predictive performance(compare to 12-predictor model!)

SummaryLinear regression models are very popular

tools, not only for explanatory modeling, but also for prediction

A good predictive model has high predictive accuracy (to a useful practical level)

Predictive models are built using a training data set, and evaluated on a separate validation data set

Removing redundant predictors is key to achieving predictive accuracy and robustness

Subset selection methods help find “good” candidate models. These should then be run and assessed.

© galit shmueli and peter bruce 2010 chapter 6: multiple linear regression data mining for business...

Documents