Chapter 15
Multiple Regression
Regression
Multiple Regression Modely = b0 + b1x1 + b2x2 + … + bpxp + e
Multiple Regression Equationy = b0 + b1x1 + b2x2 + … + bpxp
Estimated Multiple Regression Equation
ppxbxbxbby ...ˆ 22110
Car DataMPG Weight Year Cylinders
18 3504 70 815 3693 70 818 3436 70 816 3433 70 817 3449 70 815 4341 70 814 4354 70 814 4312 70 814 4425 70 815 3850 70 8
. . . .
. . . .
. . . .
Continuing on for 397 observations
Multiple Regression, Example Coefficients Standard Error t Stat
Intercept 46.3 0.800 57.8Weight -0.00765 0.000259 -29.4
R Square 0.687
Coefficients Standard Error t StatIntercept -14.7 3.96 -3.71Weight -0.00665 0.000214 -31.0Year 0.763 0.0490 15.5
R Square 0.807
Multiple Regression, Example
Coefficients Standard Error t StatIntercept -14.4 4.03 -3.58Weight -0.00652 0.000460 -14.1Year 0.760 0.0498 15.2Cylinders -0.0741 0.232 -0.319
R Square 0.807
Predicted MPG for car weighing 4000 lbs built in 1980 with 6 cylinders:-14.4 -.00652(4000)+.76(80)-.0741(6)=-14.4-26.08+60.8-.4446=19.88
2ˆ ii yySSE
2ˆ yySSR i
2 yySST i
SST = SSR + SSE
Sums of Squares
Multiple Coefficient of DeterminationThe share of the variation explained by the estimated model.
R2 = SSR/SST
Multiple Correlation Coefficient
yyrRR ˆ2
The correlation coefficient of the actual and predicted values
Adjusted Multiple Coefficient of Determination
1
111 22
pn
nRRa
Regression StatisticsMultiple R 0.898R Square 0.807Adjusted R Square 0.805Standard Error 3.44Observations 397
F Test for Overall Significance
H0: b1 = b2 = . . . = bp = 0Ha: One or more of the parameters is not equal to zero
Reject H0 if: F > Fa OrReject H0 if: p-value < a
F = MSR/MSE
ANOVA Table for Multiple Regression Model
Source Sum of Squares
Degrees of Freedom
Mean Squares F
Regression SSR p MSR = SSR/p F=MSR/MSE
Error SSE n-p-1 MSE = SSE/(n-p-1)
Total SST n-1
ANOVA Example
ANOVA
df SS MS FSignificance
FRegression 3 19382 6460 547 6.42E-140Residual 393 4638 11.8Total 396 24021
t Test for Coefficients
H0: b1 = 0Ha: b1 ≠ 0
Reject H0 if:t < -t /2a or t > t /2a Or if:p < a
t = b1/sb1
With a t distribution of n-p-1 df
t Test Example
Coefficients Standard Error t Stat P-valueIntercept -14.48 4.038 -3.587 0.0003769Weight -0.006525 0.0004603 -14.18 3.892E-37Year 0.7608 0.04985 15.26 1.258E-41Cylinders -0.07420 0.2322 -0.3196 0.7494
MulticollinearityWhen two or more independent variables are highly correlated.
When multicollinearity is severe the estimated values of coefficients will be unreliable.
MulticollinearityTwo guidelines for identifying multicollinearity:• If the absolute value of the correlation coefficient for two independent variables exceeds 0.7• If the correlation coefficient for an independent variable and some other independent variable is greater than the correlation with that variable and the dependent variable
Multicollinearity
MPG Weight Year CylindersMPG 1Weight -0.829 1Year 0.578 -0.300 1Cylinders -0.773 0.895 -0.344 1
Table of correlation coefficients:
Multicollinearity Coefficients Standard Error t Stat
Intercept -14.4 4.03 -3.58Weight -0.00652 0.000460 -14.1Year 0.760 0.0498 15.2Cylinders -0.0741 0.232 -0.319
R Square 0.807
Coefficients Standard Error t StatIntercept -16.9 4.95 -3.42Year 0.747 0.0612 12.21
Cylinders -2.99 0.133 -22.46
R Square 0.708
Qualitative Variables and Regression
Quantitative variable – A variable that can be measured numerically (interval or ratio scale of measurement)
Qualitative variable – A variable where labels or names are used to identify some attribute (nominal or ordinal scale of measurement)
Qualitative Variables and Regression
The effect of a quantitative variable can be estimated using a dummy variable.
A dummy variable can equal 0 or 1, it creates different y intercepts for groups with different attributes.
Qualitative Variables and Regression
Assume we estimate a regression model for the number of sick days an employee takes per year. A dummy variable is included that equals 1 if the individual smokes and 0 if they do not. Age is also included in the model.
Qualitative Variables and Regression
Estimated model:Sick days taken = -1 +(3)Smoker + (.1)Age
Sick Days Smoker Age
3 0 45
6 1 50
0 0 20
5 0 65
10 1 60
Example of how data would be coded:
Dummy VariablesSick days taken = -1 +(3)Smoker + (.1)Age
What is the y-intercept for nonsmokers? -1What is the y-intercept for smokers? 2What is the predicted number of sick days for a 40-year-old smoker? 6What is the average difference in the number of sick days taken by smokers and nonsmokers? 3
Dummy Variables
If an attribute has three or more possible values you must include k-1 dummy variables in the model, where k is the number of possible values.
Dummy VariablesSuppose we have three job classifications: manager, operator, and secretary
Operator dummy equals 1 if the person is an operator, 0 otherwise
Secretary dummy equals 1 if the person is an secretary, 0 otherwise
Manager is the omitted group (choice of omitted group will not alter the predicted values)
Dummy VariablesSick days taken = -1 +(1)Operator + 1.5(Secretary) + (.1)Age
What are the y-intercepts for each job classification? Managers=-1, Operators=0, Secretaries=0.5 What is the predicted number of sick days for a 40-year-old secretary? 4.5What is the average difference in the number of sick days taken by operators and secretaries? 0.5
Dummy VariablesIn some cases there will be multiple sets of dummy variables, such as:Sick days taken = -1 +(3)Smoker + (1)Operator + 1.5(Secretary) + (.1)Age
Note that there are now 6 different intercepts:Nonsmoker, Manager: -1 (omitted group)Smoker, Manager: 2Nonsmoker, Operator: 0Smoker, Operator: 3Nonsmoker, Secretary: 0.5Smoker, Secretary: 3.5
Dummy VariablesNote that when dummy variables are used we are assuming that the coefficients of the other variables are the same for all groups.
In this example the increase in sick days used from aging a year is equal to 0.1 for all of the groups.
If there is reason to believe the effect of an independent variable differs by group, you may want to estimate separate equations for each group.
Nonlinear Relationships
Nonlinear relationships can be modeled by including a variable that is a nonlinear function of an independent variable.
For example it is usually assumed that health care expenditures increase at an increasing rate as people age.
Nonlinear Relationships
In that case you might try including age squared into the model:Health expend = 500 + (5)Age + (.5)AgeSQ
Age Health Expend10 60020 80030 110040 1500
Nonlinear Relationships
If the dependent variable increases at a decreasing rate as the independent variable rises you might want to include the square root of the independent variable.
If you are unsure of the nature of the relationship you can use dummy variables for different ranges of values of the independent variable.
Non-continuous Relationships
If the relationship between the dependent variable and an independent variable is non-continuous a slope dummy variable can be used to estimate two sets of coefficients for the independent variable.
For example, if natural gas usage is not affected by temperature when the temperature rises above 60 degrees, we could have:Gas usage = b0 + b1(GT60) + b2(Temp) + b2(GT60)(Temp)
Non-continuous Relationships
Note that at temperatures above 60 degrees the net effect of a 1 degree increase in temperature on gas usage is -0.056 (-.866+.810)
CoefficientsStandard Error t Stat P-value
Intercept 53.002 2.415 21.95 7.48E-18
GT60 -46.623 16.682 -2.79 0.0098
Temp -0.866 0.0595 -14.56 1.02E-13
(GT60)(Temp) 0.810 0.255 3.18 0.0039