ibm401 lecture 5
TRANSCRIPT
QUANTITATIVE ANALYSIS FOR
BUSINESSLecture 5
August 9th, 2010
MULTIPLE REGRESSION ANALYSIS
Multiple regression models are extensions to the simple linear model and allow the creation of models with several independent variables
Y = 0 + 1X1 + 2X2 + … + kXk + where
Y = dependent variable (response variable)Xi = ith independent variable (predictor or explanatory variable)0 = intercept (value of Y when all Xi = 0)I = coefficient of the ith independent variablek = number of independent variables = random error
MULTIPLE REGRESSION ANALYSIS
To estimate these values, a sample is taken the following equation developed
kkXbXbXbbY ...ˆ22110
where = predicted value of Yb0 = sample intercept (and is an estimate of 0)bi = sample coefficient of the ith variable (and is an estimate of i)
Y
INTERPRETING MULTIPLE REGRESSION RESULTS Intercept – the value of the dependent
variable when the independent variables are all equal to zero
Each slope coefficient – the estimated change in the dependent variable for a one-unit change in the independent variable, holding the other independent variables contantSometimes it’s called “partial slope
coefficients”
INTERPRETING EXAMPLE 10-year real earnings growth of S&P500
(EG10) Intercept term
If dividend payout ratio (PR) is zero and the slope of the yield curve (YC) is zero, we would expect the subsequent 10-year real earnings growth rate to be -11.6% intercept
Slope coefficient of PR If they payout ratio increases by 1%, we would
expect the subsequent 10-year earnings growth rate to increase by 0.25%, holding YC constant
Slope coefficient of YC If the yield curve slope increases by 1%, we would
expect the subsequent 10-year earnings growth rate to increase by 0.14%, holding PR constant
HYPOTHESIS TESTING OF REGRESSION COEFFICIENTS t-statistic – used to test the significance
of the individual coefficient in a multiple regression
t-statistic has n-k-1 degrees of freedom
jb
jj
s
bbt
^
^
Estimated regression coefficient – hypothesized value
Coefficient standard error of bj
EX: TESTING THE STATISTICAL SIGNIFICANCE OF A REGRESSION COEFFICIENT Test the statistical significance of the
independent variable PR in the real earnings growth example at the 10% significance level.
Coefficient Standard Error
Intercept -11.6% 1.657%
PR 0.25 0.032
YC 0.14 0.28
Data based on 46 observations
EX: TESTING THE STATISTICAL SIGNIFICANCE OF A REGRESSION COEFFICIENT We are testing the following hypothesis:
The 10% two-tailed critical t-value with 43 degree of freedom (46-2-1) is approximately 1.68We should reject the hypothesis if the t-
statistic is greater than 1.68 or less than -1.68
0:0 PRH
0:1 PRH
8.7032.0
25.0
^
^
jb
jj
s
bbt
Greater than 1.68, we can reject the null hypothesis and conclude that PR regression coefficient is statistically significant a the 10% significant level
EXAMPLE: JENNY WILSON REALTY
Jenny Wilson wants to develop a model to determine the suggested listing price for houses based on the size and age of the house kkXbXbXbbY ...ˆ
22110
where = predicted value of dependent variable (selling price)b0 = Y interceptX1 and X2 = value of the two independent variables (square footage and age) respectivelyb1 and b2 = slopes for X1 and X2 respectively
Y
She selects a sample of houses that have sold recently and records the data shown in Table 4.5
JENNY WILSON REALTYSELLING PRICE ($)
SQUARE FOOTAGE AGE CONDITION
95,0001,926 30 Good
119,0002,069 40 Excellent
124,8001,720 30 Excellent
135,0001,396 15 Good
142,0001,706 32 Mint
145,0001,847 38 Mint
159,0001,950 27 Mint
165,0002,323 30 Excellent
182,0002,285 26 Mint
183,0003,752 35 Good
200,0002,300 18 Good
211,0002,525 17 Good
215,0003,800 40 Excellent
219,0001,740 12 Mint
Table 4.5
JENNY WILSON REALTY
21 289944146631 XXY ˆ
EVALUATING MULTIPLE REGRESSION MODELS
Evaluation is similar to simple linear regression models The p-value for the F-test and r2 are
interpreted the same The hypothesis is different because there is
more than one independent variable The F-test is investigating whether all the
coefficients are equal to 0 p-value – the smallest level of significance for
which the null hypothesis can be rejected p-value < significance level
Reject null hypothesis p-value > significance level
Null hypothesis cannot be rejected
EVALUATING MULTIPLE REGRESSION MODELS
To determine which independent variables are significant, tests are performed for each variable
010 :H011 :H
The test statistic is calculated and if the p-value is lower than the level of significance (), the null hypothesis is rejected
EX: INTERPRETING P-VALUES Given the following regression results, determine which
regression parameters for the independent variables are statistically significantly different from zero at the 1% significant level, assuming the sample size is 60
The independent variable is statistically significant if the p-value is less than 1%, or 0.01 X1 and X3 are statistically significantly different than zero
Coefficient Standard Error
t-statistic p-value
Intercept 0.40 0.40 1.0 0.3215
X1 8.2 2.05 4.0 0.0002
X2 0.40 0.18 2.2 0.0319
X3 -1.80 0.56 -3.2 0.0022
F-STATISTIC F-test assesses how well the set of
independent variables, as a group, explains the variation of the dependent variableF-statistic is used to test whether at least
one of the independent variables explains a significant portion of the variation of the dependent variable
0 oneleast at :
0: 43210
ja bH
bbbbH
F-STATISTIC F-statistic is calculated as
Where: SSR = Sum of Square of Regression SSE = Sum of Square of Errors MSR = Mean Regression Sum of Squares MSE = Mean Squared Error
Reject H0 if F-statistic > Fc (critical value)
1
knSSE
kSSR
MSE
MSRF
EX: CALCULATING AND INTERPRETING F-STATISTIC An analyst runs a regression of monthly
value-stock returns on five independent variables over 60 months. The total sum of squares is 460, and the sum of squared errors is 170. Test the null hypothesis at the 5% significance level that all five of the independent variables are equal to zero
The critical F-value for 5 and 54 degrees of freedom at 5% significance level is approximately 2.40
EX: CALCULATING AND INTERPRETING F-STATISTIC The null and alternative hypothesis are
Calculations
F-statistic > F-critical We reject null hypothesis! At least one independent variable is significantly
different than zero
0 oneleast at :
0: 543210
ja bH
bbbbbH
41.1815.3
0.58
15.31560
170
0.585
290
290170460
F
MSE
MSR
SSESSTRSS
EX: JENNY WILSON REALTY
The model is statistically significant The p-value for the F-test is 0.002 r2 = 0.6719 so the model explains about
67% of the variation in selling price (Y) But the F-test is for the entire model and we
can’t tell if one or both of the independent variables are significant
By calculating the p-value of each variable, we can assess the significance of the individual variables
Since the p-value for X1 (square footage) and X2 (age) are both less than the significance level of 0.05, both null hypotheses can be rejected
COEFFICIENT OF DETERMINATION (R2) Multiple coefficient of determination, R2,
can be used to test the overall effectiveness of the entire set of independent variables in explaining the dependent variable.
SST
SSR
SSt
SSESSTR
variationTotal
variationed Unexplain- variationTotal2
ADJUSTED R2
Unfortunately, R2 by itself may not be a reliable measure of the multiple regression modelR2 almost always increases as variables are
added to the modelWe need to take new variables into account
Where n = number of observations k = number of independent variables Ra
2 = adjusted R2
)1(1
11 22 R
kn
nRa
ADJUSTED R2
Whenever there is more than 1 independent variableRa
2 is less than or equal to R2
So adding new variables to the model will increase R2 but may increase or decrease the Ra
2
Ra2 maybe less than 0 if R2 is low
enough
EX: ADJUSTED R2
An analyst runs a regression model of monthly value-stock returns on five independent variables over 60 months. The total sum of squares for the regression is 460, and the sum of squared errors is 170. Calculate Ra
2 = adjusted R2
The R2 of 63% suggests that the five independent variables together explain 63% of the variation in monthly value-stock returns
%6.59596.0)63.01(1560
1601
%6363.0460
170460
2
2
aR
R
EX: ADJUSTED R2
Suppose the analyst now adds four more independent variables to the regression and the R2 increases to 65%, which model the analyst would most likely prefer?
The analyst would prefer the first model because the adjusted Ra
2 is higher and the model has five independent variables as opposed to nine
%7.58587.0)65.01(1960
16012
aR
BINARY OR DUMMY VARIABLES
Binary (or dummy or indicator) variables are special variables created for qualitative data
A dummy variable is assigned a value of 1 if a particular condition is met and a value of 0 otherwise
The number of dummy variables must equal one less than the number of categories of the qualitative variable
JENNY WILSON REALTYSELLING PRICE ($)
SQUARE FOOTAGE AGE CONDITION
95,0001,926 30 Good
119,0002,069 40 Excellent
124,8001,720 30 Excellent
135,0001,396 15 Good
142,0001,706 32 Mint
145,0001,847 38 Mint
159,0001,950 27 Mint
165,0002,323 30 Excellent
182,0002,285 26 Mint
183,0003,752 35 Good
200,0002,300 18 Good
211,0002,525 17 Good
215,0003,800 40 Excellent
219,0001,740 12 Mint
Table 4.5
Jenny’s Qualitative Category Condition• Excellent• Mint• Good
JENNY WILSON REALTY
Jenny believes a better model can be developed if she includes information about the condition of the property
X3 = 1 if house is in excellent condition= 0 otherwise
X4 = 1 if house is in mint condition= 0 otherwise Two dummy variables are used to describe the three categories of condition
No variable is needed for “good” condition since if both X3 and X4 = 0, the house must be in good condition
JENNY WILSON REALTY
JENNY WILSON REALTY
Program 4.3
Model explains about 90% of the variation in selling price
F-value indicates significance
Low p-values indicate each variable is significant
4321 369471623396234356658121 XXXXY ,,,.,ˆ
MODEL BUILDING
The best model is a statistically significant model with a high r2 and few variables
As more variables are added to the model, the r2-value usually increases
For this reason, the adjusted r2 value is often used to determine the usefulness of an additional variable
The adjusted r2 takes into account the number of independent variables in the model
MODEL BUILDING
SSTSSE
SSTSSR
12r
The formula for r2
The formula for adjusted r2
)/(SST)/(SSE
11
1 Adjusted 2
nkn
r
As the number of variables increases, the adjusted r2 gets smaller unless the increase due to the new variable is large enough to offset the change in k
MODEL BUILDING
In general, if a new variable increases the adjusted r2, it should probably be included in the model
In some cases, variables contain duplicate information
When two independent variables are correlated, they are said to be collinear
When more than two independent variables are correlated, multicollinearity exists
When multicollinearity is present, hypothesis tests for the individual coefficients are not valid but the model may still be useful
NONLINEAR REGRESSION
In some situations, variables are not linear
Transformations may be used to turn a nonlinear model into a linear model
** **
** ** *
Linear relationship Nonlinear relationship
**** **
****
*
EX: COLONEL MOTORS
The engineers want to use regression analysis to improve fuel efficiency
They have been asked to study the impact of weight on miles per gallon (MPG)
MPGWEIGHT
(1,000 LBS.) MPGWEIGHT
(1,000 LBS.)
12 4.58 20 3.18
13 4.66 23 2.68
15 4.02 24 2.65
18 2.53 33 1.70
19 3.09 36 1.95
19 3.11 42 1.92
Table 4.6
COLONEL MOTORS
Figure 4.6A
45 –
40 –
35 –
30 –
25 –
20 –
15 –
10 –
5 –
0 – | | | | |
1.00 2.00 3.00 4.00 5.00
MPG
Weight (1,000 lb.)
Linear model110 XbbY ˆ
COLONEL MOTORS
A useful model with a small F-test for significance and a good r2 value
COLONEL MOTORS
Figure 4.6B
45 –
40 –
35 –
30 –
25 –
20 –
15 –
10 –
5 –
0 – | | | | |
1.00 2.00 3.00 4.00 5.00
MPG
Weight (1,000 lb.)
Nonlinear model2
210 weightweight )()(MPG bbb
COLONEL MOTORS
The nonlinear model is a quadratic model
The easiest way to work with this model is to develop a new variable2
2 weight)(X
This gives us a model that can be solved with linear regression software
22110 XbXbbY ˆ
COLONEL MOTORS
A better model with a smaller F-test for significance and a larger adjusted r2 value
21 43230879 XXY ...ˆ
CAUTIONS AND PITFALLS
If the assumptions are not met, the statistical test may not be valid
Correlation does not necessarily mean causation
Multicollinearity makes interpreting coefficients problematic, but the model may still be good
Using a regression model beyond the range of X is questionable, the relationship may not hold outside the sample data
CAUTIONS AND PITFALLS
t-tests for the intercept (b0) may be ignored as this point is often outside the range of the model
A linear relationship may not be the best relationship, even if the F-test returns an acceptable value
A nonlinear relationship can exist even if a linear relationship does not
Just because a relationship is statistically significant doesn't mean it has any practical value