statistics- multiple regression
TRANSCRIPT
-
8/3/2019 Statistics- Multiple Regression
1/23
2/10/2012
Multiple Regression: Predicting OneFactor from Several Others
-
8/3/2019 Statistics- Multiple Regression
2/23
Predicting a single Yvariable from two or moreXvariables Describe and Understand the Relationship
Understand the effect of one Xvariable while holdingthe others fixed
Forecast (Predict) a New Observation Lets you use all available information (Xvariables) to
find out about what you dont know (the Yvariable for
this new situation) Adjust and Control a Process
because the regression equation (you hope) tells youwhat would happen if you made a change
2/10/2012
-
8/3/2019 Statistics- Multiple Regression
3/23
n cases (elementary units)
k explanatory Xvariables
2/10/2012
Case 1
Case 2
.
.
.
Case n
Y
(dependentvariable to be
explained)
10.9
23.6
.
.
.
6.0
X1
(first independentor explanatory
variable)
2.0
4.0
.
.
.
0.5
Xk
(last independentor explanatory
variable)
12.5
12.3
.
.
.
7.0
.
.
.
-
8/3/2019 Statistics- Multiple Regression
4/23
Intercept: a Predicted value for Ywhen every X is 0
Regression Coefficients: b ,b2, bk The effect of each Xon Y, holding all other X
variables constant
Prediction Equation or Regression Equation
(PredictedY
) = a+b1X
1+b2X
2++bkX
k The predicted Y, given the values for all Xvariables
Prediction Errors or Residuals(Actual Y) (Predicted Y)
2/10/2012
-
8/3/2019 Statistics- Multiple Regression
5/23
Standard Error of Estimate: Se or S Approximate size of errors made predicting Y
Coefficient of Determination: R2 Percentage of variability in Yexplained by the X
variables as a group
FTest: Significant or Not Significant Tests whether the Xvariables, as a group, can
predict Ybetter than just randomly
2/10/2012
-
8/3/2019 Statistics- Multiple Regression
6/23
t Tests for Individual Regression Coefficients Significant or not significant, for each Xvariable
Tests whether a particular Xvariable has an effect onY, holding the other
Xvariables constant
Should be performed only if the Ftest is significant
Standard Errors of the Regression Coefficients(with n k 1 degrees of freedom)
Indicates the estimated sampling standard deviationof each regression coefficient
Used in the usual way to find confidence intervalsand hypothesis tests for individual regressioncoefficients
2/10/2012
kbbbSSS ,,,
21.
-
8/3/2019 Statistics- Multiple Regression
7/23
Input Data To predict cost of ads from magazine characteristics
2/10/2012
Audubon
Better Homes
..
.
YM
Y
Page Costs
(color ad)
$25,315
198,000
.
.
.
73,270
X1Audience
(thousands)
1,645
34,797
.
.
.
3,109
X3
Median
Income
$38,787
41,933
.
.
.
43,696
X2Percent
Male
51.1
22.1
.
.
.
14.4
-
8/3/2019 Statistics- Multiple Regression
8/23
Predicted Page Costs=a+b1X1 +b2X2 +b3X3= $4,043+3.79(Audience) 124(Percent Male)
+0.903(Median Income)
Intercept a= $4,043
Essentially a base rate, representing the cost ofadvertising in a magazine that has no audience, nomale readers, and zero income level
But thereareno such magazines
intercept a is merely there to help achieve bestpredictions
2/10/2012
-
8/3/2019 Statistics- Multiple Regression
9/23
Predicted Page Costs
= a + b1X1 + b2X2 + b3X3= $4,043 + 3.79(Audience) 124(Percent Male)
+ 0.903(Median Income)
Regression coefficient b1 = 3.79
Allelseequal: The effect of Audience on Page Costs,while holding Percent Male and Median Incomeconstant
The effect of Audience on Page Costs, adjusted forPercent Male and Median Income
On average,Page Costs are estimated to be $3.79 higherfor a magazine with one more (thousand) Audience, ascompared to another magazine with the same PercentMale and Median Income2/10/2012
-
8/3/2019 Statistics- Multiple Regression
10/23
Predicted Page Costs
= a + b1X1 + b2X2 + b3X3= $4,043 + 3.79(Audience) 124(Percent Male)
+ 0.903(Median Income)
Regression coefficient b2 = 124
Allelseequal: The effect of Percent Male on Page Costs,while holding Audience and Median Income constant
The effect of Percent Male on Page Costs, adjusted for
Audience and Median Income On average,Page Costs are estimated to be $124 lower
for a magazine with one more percentage point of malereaders, as compared to another magazine with thesame Audience and Median Income
But dont believe it! We will see that it is not significant2/10/2012
-
8/3/2019 Statistics- Multiple Regression
11/23
Predicted Page Costs
= a + b1X1 + b2X2 + b3X3= $4,043 + 3.79(Audience) 124(Percent Male)
+ 0.903(Median Income)
Regression coefficient b3 = 0.903
Allelseequal: The effect of Median Income on PageCosts, while holding Audience and Percent Maleconstant
The effect of Median Income on Page Costs, adjustedfor Audience and Percent Male
On average,Page Costs are estimated to be $0.903higher for a magazine with one more dollar of MedianIncome, as compared to another magazine with thesame Audience and Percent Male2/10/2012
-
8/3/2019 Statistics- Multiple Regression
12/23
Predicted Page Costs for Audubon= a + b1X1 + b2X2 + b3X3= $4,043 + 3.79(Audience) 124(Percent Male)
+ 0.903(Median Income)= $4,043 + 3.79(1,645) 124(51.1) + 0.903(38,787)
= $38,966
Actual Page Costs are $25,315
Residual is $25,315 38,966 = $13,651 Audubon has Page Costs $13,651 lower than you
would expect for a magazine with its characteristics(Audience,Percent Male, and Median Income)
2/10/2012
-
8/3/2019 Statistics- Multiple Regression
13/23
Standard Error of Estimate Se Indicates the approximate size of the prediction
errors
About how far are the Yvalues from theirpredictions?
For the magazine data
Se = S = $21,578
Actual Page Costs are about $21,578 from theirpredictions for this group of magazines (usingregression)
Compare to SY
= $45,446: Actual Page Costs are about
$45,446 from their average (notusing regression) Usin the re ression e uation to redict Pa e Costs
2/10/2012
Y
-
8/3/2019 Statistics- Multiple Regression
14/23
Coefficient of Determination R2
Indicates the percentage of the variation in Ythat isexplained by (or attributed to) all of the Xvariables
How well do the Xvariables explain Y? For the magazine data
R2 = 0.787 = 78.7%
The Xvariables (Audience,Percent Male, and MedianIncome) taken together explain 78.7% of the variance ofPage Costs
This leaves 100% 78.7% = 21.3% of the variation inPage Costs unexplained
2/10/2012
-
8/3/2019 Statistics- Multiple Regression
15/23
Linear Model for the Population
Y= (E + F1X1 + F2X2 + +FkXk) + I
= (Population relationship) + Randomness
Where I has a normal distribution with mean 0 andconstant standard deviation W, and this randomness
is independent from one case to another An assumption needed for statistical inference
2/10/2012
-
8/3/2019 Statistics- Multiple Regression
16/23
2/10/2012
Table 12.1.7
Intercept or constant
Regression coefficients
Uncertainty in Y
E
F1F2.
.
.
Fk
W
a
b1
b2
.
.
.
bk
SorSe
Population
(parameters:
fixed and
unknown)
Sample
(estimators:
random and
known)
-
8/3/2019 Statistics- Multiple Regression
17/23
Is the regression significant? Do the Xvariables, taken together, explain a
significant amount of the variation in Y?
The null hypothesis claims that, in the population,the Xvariables do nothelp explain Y; all coefficientsare 0
H0: F1 =F2 = = Fk = 0
The research hypothesis claims that, in thepopulation,at least one of the Xvariables does helpexplain Y
H1: At least one of F1,F2, ,Fk { 02/10/2012
-
8/3/2019 Statistics- Multiple Regression
18/23
Three equivalent methods for performing Ftest; they always give the same result Use the p-value
Ifp < 0.05, then the test is significant Same interpretation as p-values in Chapter 10
Use the R2 value
If R2 is larger than the value in the R2 table, then the
result is significant Do the Xvariables explain more than just randomness?
Use the Fstatistic
If the Fstatistic is larger than the value in the Ftable,
then the result is significant2/10/2012
-
8/3/2019 Statistics- Multiple Regression
19/23
For the magazine data,The Xvariables (Audience,Percent Male, and Median Income) explain a very highlysignificant percentage of the variation in Page Costs
Thep-value, listed as 0.000, is less than 0.0005, and istherefore very highly significant(since it is less than0.001)
The R2 value, 78.7%, isgreaterthan 27.1% (from theR2 table at level 0.1% with n = 55 and k = 3), and istherefore very highly significant
The Fstatistic,62.84, isgreaterthan the value(between 7.054 and 6.171) from the Ftable at level0.1%, and is therefore very highly significant
2/10/2012
-
8/3/2019 Statistics- Multiple Regression
20/23
At test for each regression coefficient To be used only if the Ftest is significant
If Fis notsignificant, you should notlook at the t tests
Does thejth Xvariable have a significant effect on Y,holding the other Xvariables constant?
Hypotheses are
H0:Fj = 0, H1:Fj { 0
Test using the confidence interval use the t table with n k 1 degrees of freedom
Or use the t statistic
compare to the t table value with n k 1 degrees of
freedom2/10/2012
jbjstatisticSbt /!
jbjtSb s
-
8/3/2019 Statistics- Multiple Regression
21/23
Testing b1, the coefficient for Audienceb1 = 3.79,t = 13.5,p = 0.000
Audience has a very highly significanteffect on Page
Costs, after adjusting for Percent Male and MedianIncome
Testing b2, the coefficient for Percent Maleb2 = 124,t = 0.90,p = 0.374
Percent Male does nothave a significant effect on PageCosts, after adjusting for Audience and Median Income
Testing b3, the coefficient for Median Incomeb3 = 0.903,t = 2.44,p = 0.018
Median Income has a significanteffect on Page Costs,2/10/2012
-
8/3/2019 Statistics- Multiple Regression
22/23
Standardized Regression Coefficients Indicate relative importance of the information eachXvariable brings inaddition to the others
Ordinary regression coefficients are in different units And cannot be compared without standardization
Defined as for thejth Xvariable
Compare the absolute values
Correlation Coefficients Indicate relative importance of the information eachXvariable brings withoutadjusting forthe other Xvariables
2/10/2012
YXjSSb
j/
-
8/3/2019 Statistics- Multiple Regression
23/23
Multicollinearity When some Xvariables are too similar to one
another
Might do a good job of explaining and predicting Y But t tests might not significant because no Xvariable is
bringing new information
Variable Selection
How to choose from a long list of Xvariables? Too many: waste the information in the data
Too few: risk ignoring useful predictive information
Model Misspecification
Perhaps the multiple regression linear model is2/10/2012