ibm401 lecture 5

QUANTITATIVE ANALYSIS FOR

BUSINESSLecture 5

August 9th, 2010

MULTIPLE REGRESSION ANALYSIS

Multiple regression models are extensions to the simple linear model and allow the creation of models with several independent variables

Y = 0 + 1X1 + 2X2 + … + kXk + where

Y = dependent variable (response variable)Xi = ith independent variable (predictor or explanatory variable)0 = intercept (value of Y when all Xi = 0)I = coefficient of the ith independent variablek = number of independent variables = random error

MULTIPLE REGRESSION ANALYSIS

To estimate these values, a sample is taken the following equation developed

kkXbXbXbbY ...ˆ22110

where = predicted value of Yb0 = sample intercept (and is an estimate of 0)bi = sample coefficient of the ith variable (and is an estimate of i)

Y

INTERPRETING MULTIPLE REGRESSION RESULTS Intercept – the value of the dependent

variable when the independent variables are all equal to zero

Each slope coefficient – the estimated change in the dependent variable for a one-unit change in the independent variable, holding the other independent variables contantSometimes it’s called “partial slope

coefficients”

INTERPRETING EXAMPLE 10-year real earnings growth of S&P500

(EG10) Intercept term

If dividend payout ratio (PR) is zero and the slope of the yield curve (YC) is zero, we would expect the subsequent 10-year real earnings growth rate to be -11.6% intercept

Slope coefficient of PR If they payout ratio increases by 1%, we would

expect the subsequent 10-year earnings growth rate to increase by 0.25%, holding YC constant

Slope coefficient of YC If the yield curve slope increases by 1%, we would

expect the subsequent 10-year earnings growth rate to increase by 0.14%, holding PR constant

HYPOTHESIS TESTING OF REGRESSION COEFFICIENTS t-statistic – used to test the significance

of the individual coefficient in a multiple regression

t-statistic has n-k-1 degrees of freedom

jb

jj

s

bbt

^

^

Estimated regression coefficient – hypothesized value

Coefficient standard error of bj

EX: TESTING THE STATISTICAL SIGNIFICANCE OF A REGRESSION COEFFICIENT Test the statistical significance of the

independent variable PR in the real earnings growth example at the 10% significance level.

Coefficient Standard Error

Intercept -11.6% 1.657%

PR 0.25 0.032

YC 0.14 0.28

Data based on 46 observations

EX: TESTING THE STATISTICAL SIGNIFICANCE OF A REGRESSION COEFFICIENT We are testing the following hypothesis:

The 10% two-tailed critical t-value with 43 degree of freedom (46-2-1) is approximately 1.68We should reject the hypothesis if the t-

statistic is greater than 1.68 or less than -1.68

0:0 PRH

0:1 PRH

8.7032.0

25.0

^

^

jb

jj

s

bbt

Greater than 1.68, we can reject the null hypothesis and conclude that PR regression coefficient is statistically significant a the 10% significant level

EXAMPLE: JENNY WILSON REALTY

Jenny Wilson wants to develop a model to determine the suggested listing price for houses based on the size and age of the house kkXbXbXbbY ...ˆ

22110

where = predicted value of dependent variable (selling price)b0 = Y interceptX1 and X2 = value of the two independent variables (square footage and age) respectivelyb1 and b2 = slopes for X1 and X2 respectively

Y

She selects a sample of houses that have sold recently and records the data shown in Table 4.5

JENNY WILSON REALTYSELLING PRICE ($)

SQUARE FOOTAGE AGE CONDITION

95,0001,926 30 Good

119,0002,069 40 Excellent

124,8001,720 30 Excellent

135,0001,396 15 Good

142,0001,706 32 Mint

145,0001,847 38 Mint

159,0001,950 27 Mint

165,0002,323 30 Excellent

182,0002,285 26 Mint

183,0003,752 35 Good

200,0002,300 18 Good

211,0002,525 17 Good

215,0003,800 40 Excellent

219,0001,740 12 Mint

Table 4.5

JENNY WILSON REALTY

21 289944146631 XXY ˆ

EVALUATING MULTIPLE REGRESSION MODELS

Evaluation is similar to simple linear regression models The p-value for the F-test and r2 are

interpreted the same The hypothesis is different because there is

more than one independent variable The F-test is investigating whether all the

coefficients are equal to 0 p-value – the smallest level of significance for

which the null hypothesis can be rejected p-value < significance level

Reject null hypothesis p-value > significance level

Null hypothesis cannot be rejected

EVALUATING MULTIPLE REGRESSION MODELS

To determine which independent variables are significant, tests are performed for each variable

010 :H011 :H

The test statistic is calculated and if the p-value is lower than the level of significance (), the null hypothesis is rejected

EX: INTERPRETING P-VALUES Given the following regression results, determine which

regression parameters for the independent variables are statistically significantly different from zero at the 1% significant level, assuming the sample size is 60

The independent variable is statistically significant if the p-value is less than 1%, or 0.01 X1 and X3 are statistically significantly different than zero

Coefficient Standard Error

t-statistic p-value

Intercept 0.40 0.40 1.0 0.3215

X1 8.2 2.05 4.0 0.0002

X2 0.40 0.18 2.2 0.0319

X3 -1.80 0.56 -3.2 0.0022

F-STATISTIC F-test assesses how well the set of

independent variables, as a group, explains the variation of the dependent variableF-statistic is used to test whether at least

one of the independent variables explains a significant portion of the variation of the dependent variable

0 oneleast at :

0: 43210

ja bH

bbbbH

F-STATISTIC F-statistic is calculated as

Where: SSR = Sum of Square of Regression SSE = Sum of Square of Errors MSR = Mean Regression Sum of Squares MSE = Mean Squared Error

Reject H0 if F-statistic > Fc (critical value)

1

knSSE

kSSR

MSE

MSRF

EX: CALCULATING AND INTERPRETING F-STATISTIC An analyst runs a regression of monthly

value-stock returns on five independent variables over 60 months. The total sum of squares is 460, and the sum of squared errors is 170. Test the null hypothesis at the 5% significance level that all five of the independent variables are equal to zero

The critical F-value for 5 and 54 degrees of freedom at 5% significance level is approximately 2.40

EX: CALCULATING AND INTERPRETING F-STATISTIC The null and alternative hypothesis are

Calculations

F-statistic > F-critical We reject null hypothesis! At least one independent variable is significantly

different than zero

0 oneleast at :

0: 543210

ja bH

bbbbbH

41.1815.3

0.58

15.31560

170

0.585

290

290170460

F

MSE

MSR

SSESSTRSS

EX: JENNY WILSON REALTY

The model is statistically significant The p-value for the F-test is 0.002 r2 = 0.6719 so the model explains about

67% of the variation in selling price (Y) But the F-test is for the entire model and we

can’t tell if one or both of the independent variables are significant

By calculating the p-value of each variable, we can assess the significance of the individual variables

Since the p-value for X1 (square footage) and X2 (age) are both less than the significance level of 0.05, both null hypotheses can be rejected

COEFFICIENT OF DETERMINATION (R2) Multiple coefficient of determination, R2,

can be used to test the overall effectiveness of the entire set of independent variables in explaining the dependent variable.

SST

SSR

SSt

SSESSTR

variationTotal

variationed Unexplain- variationTotal2

ADJUSTED R2

Unfortunately, R2 by itself may not be a reliable measure of the multiple regression modelR2 almost always increases as variables are

added to the modelWe need to take new variables into account

Where n = number of observations k = number of independent variables Ra

2 = adjusted R2

)1(1

11 22 R

kn

nRa

ADJUSTED R2

Whenever there is more than 1 independent variableRa

2 is less than or equal to R2

So adding new variables to the model will increase R2 but may increase or decrease the Ra

2

Ra2 maybe less than 0 if R2 is low

enough

EX: ADJUSTED R2

An analyst runs a regression model of monthly value-stock returns on five independent variables over 60 months. The total sum of squares for the regression is 460, and the sum of squared errors is 170. Calculate Ra

2 = adjusted R2

The R2 of 63% suggests that the five independent variables together explain 63% of the variation in monthly value-stock returns

%6.59596.0)63.01(1560

1601

%6363.0460

170460

2

2

aR

R

EX: ADJUSTED R2

Suppose the analyst now adds four more independent variables to the regression and the R2 increases to 65%, which model the analyst would most likely prefer?

The analyst would prefer the first model because the adjusted Ra

2 is higher and the model has five independent variables as opposed to nine

%7.58587.0)65.01(1960

16012

aR

BINARY OR DUMMY VARIABLES

Binary (or dummy or indicator) variables are special variables created for qualitative data

A dummy variable is assigned a value of 1 if a particular condition is met and a value of 0 otherwise

The number of dummy variables must equal one less than the number of categories of the qualitative variable

JENNY WILSON REALTYSELLING PRICE ($)

SQUARE FOOTAGE AGE CONDITION

95,0001,926 30 Good

119,0002,069 40 Excellent

124,8001,720 30 Excellent

135,0001,396 15 Good

142,0001,706 32 Mint

145,0001,847 38 Mint

159,0001,950 27 Mint

165,0002,323 30 Excellent

182,0002,285 26 Mint

183,0003,752 35 Good

200,0002,300 18 Good

211,0002,525 17 Good

215,0003,800 40 Excellent

219,0001,740 12 Mint

Table 4.5

Jenny’s Qualitative Category Condition• Excellent• Mint• Good

JENNY WILSON REALTY

Jenny believes a better model can be developed if she includes information about the condition of the property

X3 = 1 if house is in excellent condition= 0 otherwise

X4 = 1 if house is in mint condition= 0 otherwise Two dummy variables are used to describe the three categories of condition

No variable is needed for “good” condition since if both X3 and X4 = 0, the house must be in good condition

JENNY WILSON REALTY

JENNY WILSON REALTY

Program 4.3

Model explains about 90% of the variation in selling price

F-value indicates significance

Low p-values indicate each variable is significant

4321 369471623396234356658121 XXXXY ,,,.,ˆ

MODEL BUILDING

The best model is a statistically significant model with a high r2 and few variables

As more variables are added to the model, the r2-value usually increases

For this reason, the adjusted r2 value is often used to determine the usefulness of an additional variable

The adjusted r2 takes into account the number of independent variables in the model

MODEL BUILDING

SSTSSE

SSTSSR

12r

The formula for r2

The formula for adjusted r2

)/(SST)/(SSE

11

1 Adjusted 2

nkn

r

As the number of variables increases, the adjusted r2 gets smaller unless the increase due to the new variable is large enough to offset the change in k

MODEL BUILDING

In general, if a new variable increases the adjusted r2, it should probably be included in the model

In some cases, variables contain duplicate information

When two independent variables are correlated, they are said to be collinear

When more than two independent variables are correlated, multicollinearity exists

When multicollinearity is present, hypothesis tests for the individual coefficients are not valid but the model may still be useful

NONLINEAR REGRESSION

In some situations, variables are not linear

Transformations may be used to turn a nonlinear model into a linear model

** **

** ** *

Linear relationship Nonlinear relationship

**** **

****

*

EX: COLONEL MOTORS

The engineers want to use regression analysis to improve fuel efficiency

They have been asked to study the impact of weight on miles per gallon (MPG)

MPGWEIGHT

(1,000 LBS.) MPGWEIGHT

(1,000 LBS.)

12 4.58 20 3.18

13 4.66 23 2.68

15 4.02 24 2.65

18 2.53 33 1.70

19 3.09 36 1.95

19 3.11 42 1.92

Table 4.6

COLONEL MOTORS

Figure 4.6A

45 –

40 –

35 –

30 –

25 –

20 –

15 –

10 –

5 –

0 – | | | | |

1.00 2.00 3.00 4.00 5.00

MPG

Weight (1,000 lb.)

Linear model110 XbbY ˆ

COLONEL MOTORS

A useful model with a small F-test for significance and a good r2 value

COLONEL MOTORS

Figure 4.6B

45 –

40 –

35 –

30 –

25 –

20 –

15 –

10 –

5 –

0 – | | | | |

1.00 2.00 3.00 4.00 5.00

MPG

Weight (1,000 lb.)

Nonlinear model2

210 weightweight )()(MPG bbb

COLONEL MOTORS

The nonlinear model is a quadratic model

The easiest way to work with this model is to develop a new variable2

2 weight)(X

This gives us a model that can be solved with linear regression software

22110 XbXbbY ˆ

COLONEL MOTORS

A better model with a smaller F-test for significance and a larger adjusted r2 value

21 43230879 XXY ...ˆ

CAUTIONS AND PITFALLS

If the assumptions are not met, the statistical test may not be valid

Correlation does not necessarily mean causation

Multicollinearity makes interpreting coefficients problematic, but the model may still be good

Using a regression model beyond the range of X is questionable, the relationship may not hold outside the sample data

CAUTIONS AND PITFALLS

t-tests for the intercept (b0) may be ignored as this point is often outside the range of the model

A linear relationship may not be the best relationship, even if the F-test returns an acceptable value

A nonlinear relationship can exist even if a linear relationship does not

Just because a relationship is statistically significant doesn't mean it has any practical value

ibm401 lecture 5

Education

pr regression coefficient

regression coefficienttest

regression coefficientwe

regression parameters

independent variablesy

set of independent variables

multiple regression

ith variable