irwin/mcgraw-hill © andrew f. siegel, 1997 and 2000 12-1 l chapter 12 l multiple regression:...

23
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000 12-1 l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others

Upload: waylon-over

Post on 15-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-1

l Chapter 12 l

Multiple Regression: Predicting One Factor from

Several Others

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-2

Multiple RegressionPredicting a single Y variable from two or more X variables

Describe and Understand the Relationship Understand the effect of one X variable while holding the others fixed

Forecast (Predict) a New Observation Lets you use all available information (X variables) to find out about what

you don’t know (the Y variable for this new situation) Adjust and Control a Process

because the regression equation (you hope) tells you what would happen if you made a change

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-3

Input Datan cases (elementary units)k explanatory X variables

Case 1Case 2 . . .Case n

Y(dependent

variable to be explained)

10.923.6 . . .6.0

X1

(first independent or explanatory

variable)

2.04.0 . . .0.5

Xk

(last independent or explanatory

variable)

12.512.3 . . .7.0

……...

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-4

ResultsIntercept: a

Predicted value for Y when every X is 0

Regression Coefficients: b1, b2, …bk

The effect of each X on Y, holding all other X variables constantPrediction Equation or Regression Equation

(Predicted Y) = a+b1 X1+b2 X2+…+bk Xk

The predicted Y, given the values for all X variablesPrediction Errors or Residuals

(Actual Y) – (Predicted Y)

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-5

Results (continued)Standard Error of Estimate: Se or S

Approximate size of errors made predicting Y

Coefficient of Determination: R2

Percentage of variability in Y explained by the X variables as a group

F Test: Significant or Not Significant Tests whether the X variables, as a group, can predict Y better than

just randomly

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-6

Results (continued) t Tests for Individual Regression Coefficients

Significant or not significant, for each X variable Tests whether a particular X variable has an effect on Y, holding the

other X variables constant Should be performed only if the F test is significant

Standard Errors of the Regression Coefficients(with n – k – 1 degrees of freedom)

Indicates the estimated sampling standard deviation of each regression coefficient

Used in the usual way to find confidence intervals and hypothesis tests for individual regression coefficients

kbbb SSS ,,,21

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-7

Example: Magazine AdsInput Data

To predict cost of ads from magazine characteristics

AudubonBetter Homes . . .YM

YPage Costs(color ad)

$25,315198,000

. . .

73,270

X1

Audience(thousands)

1,64534,797

. . .

3,109

X3

MedianIncome

$38,78741,933

. . .

43,696

X2

PercentMale

51.122.1

. . .

14.4

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-8

Example: Prediction, Intercept aPredicted Page Costs

= a + b1 X1 + b2 X2 + b3 X3

= $4,043 + 3.79(Audience) – 124(Percent Male)

+ 0.903(Median Income)

• Intercept a = $4,043 Essentially a base rate, representing the cost of advertising in a magazine

that has no audience, no male readers, and zero income level But there are no such magazines intercept a is merely there to help achieve best predictions

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-9

Example: Coefficient b1

Predicted Page Costs= a + b1 X1 + b2 X2 + b3 X3

= $4,043 + 3.79(Audience) – 124(Percent Male)

+ 0.903(Median Income)

• Regression coefficient b1 = 3.79 All else equal: The effect of Audience on Page Costs, while holding

Percent Male and Median Income constant The effect of Audience on Page Costs, adjusted for Percent Male and

Median Income On average, Page Costs are estimated to be $3.79 higher for a magazine

with one more (thousand) Audience, as compared to another magazine with the same Percent Male and Median Income

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-10

Example: Coefficient b2

Predicted Page Costs= a + b1 X1 + b2 X2 + b3 X3

= $4,043 + 3.79(Audience) – 124(Percent Male)

+ 0.903(Median Income)

• Regression coefficient b2 = – 124 All else equal: The effect of Percent Male on Page Costs, while holding

Audience and Median Income constant The effect of Percent Male on Page Costs, adjusted for Audience and

Median Income On average, Page Costs are estimated to be $124 lower for a magazine with

one more percentage point of male readers, as compared to another magazine with the same Audience and Median Income But don’t believe it! We will see that it is not significant

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-11

Example: Coefficient b3

Predicted Page Costs= a + b1 X1 + b2 X2 + b3 X3

= $4,043 + 3.79(Audience) – 124(Percent Male)

+ 0.903(Median Income)

• Regression coefficient b3 = 0.903 All else equal: The effect of Median Income on Page Costs, while holding

Audience and Percent Male constant The effect of Median Income on Page Costs, adjusted for Audience and

Percent Male On average, Page Costs are estimated to be $0.903 higher for a magazine

with one more dollar of Median Income, as compared to another magazine with the same Audience and Percent Male

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-12

Example: Prediction and ResidualPredicted Page Costs for Audubon

= a + b1 X1 + b2 X2 + b3 X3

= $4,043 + 3.79(Audience) – 124(Percent Male)

+ 0.903(Median Income)

= $4,043 + 3.79(1,645) – 124(51.1) + 0.903(38,787)

= $38,966Actual Page Costs are $25,315Residual is $25,315 – 38,966 = –$13,651

Audubon has Page Costs $13,651 lower than you would expect for a magazine with its characteristics (Audience, Percent Male, and Median Income)

Residual =

Actual – Predicted

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-13

Example: Standard Error Standard Error of Estimate Se

Indicates the approximate size of the prediction errors About how far are the Y values from their predictions? For the magazine data

Se = S = $21,578

Actual Page Costs are about $21,578 from their predictions for this group of magazines (using regression)

Compare to SY = $45,446: Actual Page Costs are about $45,446 from their average (not using regression)

Using the regression equation to predict Page Costs (instead of simply using) the typical error is reduced from $45,446 to $21,578

Y

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-14

Example: Coeff. of DeterminationCoefficient of Determination R2

Indicates the percentage of the variation in Y that is explained by (or attributed to) all of the X variables

How well do the X variables explain Y? For the magazine data

R2 = 0.787 = 78.7%

The X variables (Audience, Percent Male, and Median Income) taken together explain 78.7% of the variance of Page Costs

This leaves 100% – 78.7% = 21.3% of the variation in Page Costs unexplained

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-15

Multiple Regression Linear ModelLinear Model for the Population

Y = ( + 1 X1 + 2 X2 + … + k Xk) +

= (Population relationship) + Randomness

Where has a normal distribution with mean 0 and constant standard deviation , and this randomness is independent from one case to another

An assumption needed for statistical inference

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-16

Population and Sample QuantitiesTable 12.1.7

Intercept or constant

Regression coefficients

Uncertainty in Y

1

2

.

.

.k

a

b1

b2

.

.

.bk

S or Se

Population(parameters:fixed and unknown)

Sample(estimators:random and

known)

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-17

The F testIs the regression significant?

Do the X variables, taken together, explain a significant amount of the variation in Y?

The null hypothesis claims that, in the population, the X variables do not help explain Y; all coefficients are 0

H0: 1 = 2 = … = k = 0

The research hypothesis claims that, in the population, at least one of the X variables does help explain Y

H1: At least one of 1, 2, …, k 0

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-18

Performing the F testThree equivalent methods for performing F test; they always

give the same result Use the p-value

If p < 0.05, then the test is significant Same interpretation as p-values in Chapter 10

Use the R2 value If R2 is larger than the value in the R2 table, then the result is significant Do the X variables explain more than just randomness?

Use the F statistic If the F statistic is larger than the value in the F table, then the result is

significant

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-19

Example: F testFor the magazine data, The X variables (Audience, Percent Male, and

Median Income) explain a very highly significant percentage of the variation in Page Costs The p-value, listed as 0.000, is less than 0.0005, and is therefore very

highly significant (since it is less than 0.001) The R2 value, 78.7%, is greater than 27.1% (from the R2 table at

level 0.1% with n = 55 and k = 3), and is therefore very highly significant

The F statistic, 62.84, is greater than the value (between 7.054 and 6.171) from the F table at level 0.1%, and is therefore very highly significant

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-20

t TestsA t test for each regression coefficient

To be used only if the F test is significant If F is not significant, you should not look at the t tests

Does the jth X variable have a significant effect on Y, holding the other X variables constant?

Hypotheses are

H0: j = 0, H1: j 0 Test using the confidence interval

use the t table with n – k – 1 degrees of freedom Or use the t statistic

compare to the t table value with n – k – 1 degrees of freedom

jbjstatistic Sbt /

jbj tSb Significant if

0 is not in

the

interval

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-21

Example: t TestsTesting b1, the coefficient for Audience

b1 = 3.79, t = 13.5, p = 0.000 Audience has a very highly significant effect on Page Costs, after adjusting

for Percent Male and Median Income

Testing b2, the coefficient for Percent Male

b2 = – 124, t = – 0.90, p = 0.374 Percent Male does not have a significant effect on Page Costs, after adjusting

for Audience and Median Income

Testing b3, the coefficient for Median Income

b3 = 0.903, t = 2.44, p = 0.018 Median Income has a significant effect on Page Costs, after adjusting for

Audience and Percent Male

p < 0.001

p > 0.05

p < 0.05

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-22

Comparing the X variablesStandardized Regression Coefficients

Indicate relative importance of the information each X variable brings in addition to the others

Ordinary regression coefficients are in different units And cannot be compared without standardization

Defined as for the jth X variable Compare the absolute values

Correlation Coefficients Indicate relative importance of the information each X variable

brings without adjusting for the other X variables

YXj SSbj/

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and 2000

12-23

Problems with Multiple RegressionMulticollinearity

When some X variables are too similar to one another Might do a good job of explaining and predicting Y But t tests might not significant because no X variable is bringing new

information

Variable Selection How to choose from a long list of X variables?

Too many: waste the information in the data Too few: risk ignoring useful predictive information

Model Misspecification Perhaps the multiple regression linear model is wrong

Unequal variability? Nonlinearity? Interaction?