paper: 15, quantitative techniques for management...

22
Module: 34, Multiple Regression Model Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT

Upload: others

Post on 23-Jul-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

Module: 34, Multiple Regression Model

Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT

DECISIONS

Page 2: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

Items Description of Module

Subject Name Management

Paper Name Quantitative Techniques for Management

Module Title Multiple Regression Model

Module Id Module No.- 34

Pre- Requisites Basic knowledge of correlation and simple regression analysis

Objectives To appreciate a multiple regression model

Keywords Multiple regression, regression coefficients, prediction, multicollinearity

QUADRANT-I

Module: 34, Multiple Regression Model

1. Learning Outcome

2. 2. Introduction to Multiple Regression Model and Its Applications

3. Developing a Multiple Linear Regression Model

4. Estimating the Parameters of Proposed Population Regression Model

5. Testing Goodness of Fit of Model

6. Assumptions of Multiple Regression Model

7. Summary

Page 3: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

1. Learning Outcome:

After completing this module the students will be able to:

• Develop a multiple regression model

• Understand hypothesis tests about the regression coefficients

• Calculate a prediction interval for the dependent variable

• Analyze residuals in order to check the validity of model

• Calculating multiple coefficient of determination

• Perform a Durbin-Watson test for autocorrelation in residuals

• Understand the concept of multicollinearity

• Test the goodness of fit of the model

2. Introduction to Multiple Linear Regression

The term “Regression” was first used by Sir Francis Galton in 1877 while studying the

relationship between the height of fathers and sons. The dictionary meaning of regression is

the act of returning back to the average. According to Morris Hamburg, regression analysis

refers to the methods by which estimates are made of the values of one a variable from a

knowledge of the values of one or more other variables and to measurement of the errors

involved in this estimation process. Ya Lun Chou elaborates it further adding that regression

analysis basically attempts to establish the nature of relationship between the variables and

thereby provides mechanism for prediction/ estimation.

In simple linear regression analysis, we basically attempt to predict the value of one variable

from known values of another variable. The variable that is used to estimate the variable of

interest is known as “independent variable” or “explanatory variable” and the variable which

we are trying to predict is termed as “dependent variable” or “explained variable”. Usually,

dependent variable is denoted by Y and independent variable as X. Having established a

strong correlation between two variables, we may predict the value of dependent variable

with the help of the given value of independent variable. For example, if we know that yield

of wheat and amount of rainfall are closely related to each other, we can estimate the amount

of rainfall to achieve a particular wheat production level. This estimation can be made easily

using simple regression analysis that reveals average relationship between the variables.

Page 4: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

In real life, we come across many situations where large number of variables may be

affecting one dependent variable. For example, the production of wheat may be affected by

rainfall, use of urea, use of fertilizers, method of farming, humidity, temperature and so forth.

In this case, if, we take cognizance of only any single independent variable in explaining the

variations of dependent variable (production of wheat), then the magnitude of errors in the

results would be very high. In this situation, it is suggested to take into account all important

independent variables in estimation equation.

In multiple regression analysis, we predict the value of Y, for the given values of two or more

independent variables Xi (2,3,4……….k). This is termed as ‘multiple’ as there are two or

more than two independent variable and ‘linear’ as it assumes a linear relationship between

dependent and independent variables. The population regression model of a dependent

variable on n independent variables is proposed below that shows an average relationship:

Y= β0 + β1X1+ β2X2 + β3X3 + …….+ βnXn + ε ………………………….(1)

In this regression line, Y is dependent variable and X1, X2, X3……Xn are independent

variables. β0 is the Y-intercept of the regression surface and β1, β2, β3,…… βn represent the

slope of the regression surface (also known as response surface - with respect to Xi). One can

easily understand from the following figure that any three points (P, Q and R) or an intercept

(β0) and coefficients (β1, β2, β3,…… βn), define a plane in a three-dimensional surface.

x2

x1

y

R

P

Q

Any three points (P, Q, and R), or an intercept and coefficients of x1 and x2 (0

, 1, and 2), define a plane in a three-dimensional surface.

Plane

Page 5: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

Parameters: Unknown constants in a model are called parameters. For instance, β0, β1, β2,

β3,…… βn are parameters in the above model.

Estimators: An estimator is a rule, formula, an algorithm that is applied to the data in a

specific sample to calculate an estimate of the population parameter.

β0 + β1X1+ β2X2 + β3X3 + …….+ βnXn is the estimator in above model.

Estimates: An estimate is a number or specific value computed or obtained through an

estimator.

Applications of Multiple Regression analysis:

Regression analysis is specialized branch of statistical theory and is of immense use almost in

all the scientific disciplines. It is also very useful particularly in economics as it is

predominantly used in estimating the relationship among various economic variables that

comprise the epitome of economic life. Its applications are extended to almost all the natural,

physical and social sciences. It attempts to accomplish mainly the followings:

With the help of multiple regression model we may predict the unknown value of

dependent variable for the given values of independent variables;

Multiple regression analysis that is used for the estimation never makes one hundred

percent correct estimation for the given values of independent variables. There is always

some difference between the actual value and estimated value of dependent variable that

is known as ‘estimation error’. Multiple regression analysis also computes this error as

standard error of estimation and reveals the accuracy of prediction. The amount of error

depends upon the spread of scatter diagramme which is prepared by plotting the actual

observations of Y, X1 and X2 variables on a plane. In the following figures, one can easily

note that amount of estimation error would be more for more spread of actual observation

and lesser for less spread.

Multiple regression analysis also depicts the relationship/ association between the

dependent variable Y and independent variables Xi. Multiple coefficient of determination

(R2) assesses the variance in dependent variable that has been accounted for by the

regression equation (estimator). In general, the greater the value of R2 the better is the fit

and the more useful the regression equation as a predictive instrument.

Page 6: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

The population regression model of

a dependent variable, Y, on a set of

n independent variables, X1, X2,. . . ,

Xn is suggested by:

Y= 0 + 1X1 + 2X2 + . . . + nXn +

Where

0 = Y-intercept of the regression

surface i , i = 1,2,...,n = slope of

the regression surface (also known

as response surface - with respect

to Xn )

x2

x1

y2

1

0

Multiple Coefficient of Determination (R2) = Explained Variation/ Total Variation

For example, if value of R2 = 0.60

i.e R2 = 60/ 100

It may be interpreted as out of total variations (100 percent), 60 percent variations in Y are

explained by the Xi in the suggested regression model.

3. Developing a Multiple Regression Model

A statistical model is a set of mathematical formulas and assumptions related to a real world

situation. We wish to develop our simple regression model in such a way that explains the

process underlying our data as much as possible. Since it is almost impossible for any model

to explain each and everything due to inherent uncertainty in the real world, therefore we will

always have some remaining errors which occur due to many unknown outside factors

affecting the process generating our data.

A good statistical model is prudent that makes the use of few mathematical terms which

explain the real situation as much as possible. The model attempts to capture the rational

behaviour of our data set and leaves out the factors that are nonsystematic and cannot be

predicted/ estimated. The following figure explains a well defined statistical model.

The errors (ε); also termed as residuals are that part of our data generation that cannot be

estimated by the model because it is not systematic. Hence, errors (ε) constitute a random

component in the model. It is easy to understand that a good statistical model splits our data

Page 7: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

The errors (ε); also termed as residuals are that part of our data generation that cannot be

estimated by the model because it is not systematic. Hence, errors (ε) constitute a random

component in the model. It is easy to understand that a good statistical model splits our data

process into two components namely; systematic component which is well explained by a

mathematical term contained in the model, and a pure random component, the source of

which is absolutely unknown and therefore, it cannot be estimated by the model. The

effectiveness of a statistical model depends upon the amount of error associated with it. There

is an inverse relationship between the effectiveness of model and the amount of error that

means less the error, more effective the model is and more the error, less effective the model

would be.

Proposed Population Multiple Regression Model

As discussed above, the population regression model may be given as below:

Y= β0 + β1X1+ β2X2 + ε ………………………….(2)

Here;

Y is dependent variable and X1, X2 are independent variables. These variables make a

response surface (i.e. estimation line in case of simple regression). Further, above model

contains two components. First, systematic (nonrandom) component which is response

surface (Y= β0 + β1X1+ β2X2) itself and secondly, pure random component that is error term

ε. The systematic component is the equation for the mean of Y, given X. We may represent

the conditional mean of Y, given X, by E(Y) as below:

Data Statistical model

Systematic

component

+

Random errors

Page 8: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

E(Y) = β0 + β1X1+ β2X2 -----------------------------------------------------(3)

One may note that E(Y) represents the expected value of Y. Considering the expected value

in equation (2) with independent variables X1 and X2 makes the error term (ε) zero.

Comparing equations (2) and (3), we can notice that each value of Y comprises the average Y

for the given values of X1 and X2, plus a random error. Thus, the actual value of Y is equal to

the average Y conditional on X, plus a random error ε. Thus,

Y = Average Y for given Xi + Error

Until now, we have described the population model which is assumed true based on the

relationship between X1, X2 and Y. Now, we wish to have an idea about unknown

relationship in population and estimate it from the sample information. For this, we get a

random sample of observations on variables X1, X2 and Y; then compute the parameters b0,

b1, b2 of sample regression surface which are analogous to population parameters β0, β1 and

β2. This is done with the help of method of least squares as discussed below.

4. Estimating Parameters of the Proposed Model

We wish to estimate the parameters (β0, β1 and β2) of proposed model so as to estimate the

value of Y for the given values of Xi. Our model to be effective, we wish to keep random

component of our model at the minimum. For this, we use the ‘method of least squares’ that

predicts the value of Y in such a way that the average of estimated Y for given values of Xi is

always equal to average of actual Y in order to minimize the random component (errors).

This method is also considered as best linear unbiased estimator (BLUE) that gives both

unbiased estimators.

Now we will use Ŷ to show the individual values of the estimate points which lie on

estimation line for a given X. The best fitted estimation line will be as follows:

In a multiple regression model, the least-squares estimators minimize the sum of squared

errors from the estimated regression plane.

Ŷ = b0 + b1X1 + b2X2 + …………..+ bnXn

Page 9: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

Where, Ŷ is estimated value of Y; the value lying on the estimated regression surface. The

terms b0, b1, b2, b3…….bn are the least-squares estimates of the β0, β1, β2 ……. Βn (population

parameters).

The actual value of Y is the estimated value plus an error:

Y = b0 + b1X1 + b2X2 + b3X3 + ……….+ bnXn + ε

Method of least squares minimizes the sum of squared errors with respect to the estimated

coefficients b0, b1, and b2 yields the following normal equations which can be solved for b0,

b1, and b2

We can get the values of b0, b1, and b2 (sample coefficients) by solving above three linear

equations. These values provide the estimates of β0, β1, and β2 (population parameters).

Search Procedure for Multiple Regressions:

The main objective of multiple regression analysis is to study the impact of various predictors

(independent variables) on the dependent variable. This is done by measuring the amount of

variations in dependent variable that can be explained by a group of independent variables.

There are various regression selection criteria that assist us in evaluating the independent

variables, thus improving the efficiency and effectiveness of analysis. We go for a search

procedure in which more than one regression model is developed for a given database. All

models are compared on the basis of selected criteria based on the procedure opted.

All Possible Regressions:

When we have k number of independent variables, this model considers all possible

regressions. In this case, possible regression would be (2k – 1) and this model will take into

account these all regressions. For example, when k = 2; possible regressions would be 3 and

when k = 4, possible regressions would be 15.

2

22211202

212

2

11101

22110

xbxxbxbyx

xxbxbxbyx

xbxbnby

Page 10: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

Entry Method of Regression:

This is an appropriate method particularly when we have small number of independent

variables. In this all independent variables are entered into the multiple regression equation at

the same time. Each independent variable is evaluated assuming that it has been entered after

all other independent variables. The evaluation of each predictor is made by analyzing its

individual contribution in explaining the variations in dependent variable.

Methods of Selection:

This method enables us to make a selection of predictors (independent variables) which are

actually necessary in regression equation and explain almost as much of the variations in

dependent variable as is explained by the complete set. Thus, it helps us in designing an

optimal regression equation along with level of importance of each predictor. It also helps us

in studying the effect of one independent variable by removing the effect of other

independent variables.

Mainly, there are three selection procedures that tend to give the most appropriate regression

equation namely; stepwise selection, forward selection, backward elimination.

Stepwise selection: involves analysis at each step to measure the contribution of the

independent variables entered previously in the equation. This enables us to realize the

contribution of the previous variables now when another independent variable has been

added. We can take a decision pertaining to retaining or deleting the independent variables in

regression equation on the basis of their statistical contribution.

Forward selection: always starts with an empty regression equation and then independent

variables are added one at a time. In the starting, independent variable that has highest

correlation with the dependent variable is added and then second independent variables and

so forth. Independent variables once in the equation remain there.

Backward elimination: is just opposite of forward selection. All independent variables are

entered into the equation first and then each one is deleted one by one which do not

contribute to the regression equation.

5. Testing Goodness of Fit of Model:

Once we have developed a multiple regression model and obtained the values of regression

coefficients β0, β1, and β2 etc. Now, we also need to check its goodness/ adequacy. Adequacy

Page 11: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

can be ascertained by testing the significance of regression coefficients, standard error of

estimate, determining the multiple coefficient of determination, overall significance of the

model and residual analysis for verifying the assumptions of model.

Test for Significance of Coefficients:

Here, we also need to measure the significance of the parameters β0, β1, and β2; in order to take

a decision whether they should be retained in proposed model or not. For that, we may

consider the null and alternative hypothesis as following:

H0: β0 = 0

Ha: β0 ≠ 0

This means if value of ‘β0’ is significantly different (higher or lower) from zero, we may

reject the null hypothesis. This implies that value of ‘β0’ is significant and it must be included

in our model. This is actually done using t-test. Here degree of freedom is (n-k-1).

t = 𝑏0−𝛽0

𝑆.𝐸(𝑏0)

If value of ‘β0’ is found to be significant, then at 5% level of significance we can say that β0

must be retained in model and its value actually lies in{ β0 ± S.E.b0 (t)} range.

Similarly, we can also test the significance for other regression coefficients.

Determination of Standard Error of Estimation:

The next stage in building model is to assess its reliability. By the discussion made so far we

are able to develop an understanding that a regression plane is more accurate as an estimator

when actual observations lie close to this. Here, ‘standard error of estimate’ is a tool that is

usually used to check the reliability of model. The term standard error is actually similar to

standard deviation that is used to measure the variability or dispersion of given observations

about the mean of given data set. On the contrary, the standard error of estimate (Se) measures

the dispersion/variation of the actual observations around the regression plane. This can be

computed as below:

Standard Error of Estimate (Se)

Page 12: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

Se = √SSE

n−k−1

Where

SSE = Sum of error square

n = Number of actual data points (observations)

k = Number of independent variables

Total Deviation = Regression Deviation + Error Deviation

SST = SSR + SSE

x2

x1

y

y

Y Y : Error Deviation

Y Y : Regression DeviationTotal deviation: Y Y

In above figure, it is easy to understand that

Total deviations = (y – ӯ)

Regression deviations (that are explained by our model) = (Ŷ – ӯ)

Error Deviations (that are not explained by our model) = (y – Ŷ)

SSE = Sum of Squared Error = Ʃ(y –Ŷ)²

SSR = Sum of Regression Error = Ʃ(Ŷ–ӯ)²

SST = Sum of Total Error = Ʃ(y –ӯ)²

Se = √𝐒𝐒𝐄

𝐧−𝐤−𝟏 = √

Ʃ(𝐘−Ŷ)²

𝐧−𝐤−𝟏

Mean Square Error (MSE) is an unbiased estimator of the variations of population error that

may be calculated as:

MSE = 𝑆𝑆𝐸

𝑛−𝑘−1

Page 13: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

Therefore, Se = √𝑴𝑺𝑬

It is obvious that larger the standard error of estimate, the greater the dispersion of data points

around the estimation plane. On the other hand, if the error is zero, it means that our model is

a perfect estimator of Y and all data points will essentially lie on plane.

If we assume that all observations are normally distributed around the line, one may easily

note in the following figure that then observations will lie in the following pattern:

± 1 Se = about 68 % data points

± 2 Se = about 95 % data points

± 3 Se = about 100 % data points

Here it is important to note again two important assumptions:

1. The observed values for Y are normally distributed around each estimated value of Y

2. The dispersion of distribution around each estimated value is the same.

Standard error calculated as above is a good instrument to find out a prediction interval

around an estimated value of Ŷ, within which the actual value of Y lies. In the above figure,

we may be 68% confident that actual value of Y lie within ± 1 Se interval of the estimated

plane. Since, prediction interval is based on normal distribution of data points; larger sample

size (n ≥ 30) is required. For small size sample, we cannot get accurate intervals. One may

keep in mind that these prediction intervals are only approximates.

Computation of Multiple Coefficient of Determination (R²):

The multiple coefficient of determination (R²) measures the proportion of variations in

dependent variable Y that is explained by a set of independent variables.

R² = 𝑆𝑆𝑅

𝑆𝑆𝑇 = 1 -

𝑆𝑆𝐸

𝑆𝑆𝑇 =

𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑠

𝑇𝑜𝑡𝑎𝑙 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑠

As the value of R² increases, goodness of model fit also increases as it tends to be more and

more accurate in the estimations of Y. R2 is be viewed as a measure of explanatory power of

an estimator. It shows the % variation explained by all the independent variables together.’

The Adjusted R2:

Page 14: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

If one compares the explanatory power of different regression models, then one should use

the adjusted R2 because it is adjusted for the degree of freedom which may be different for

different models.

Here, on may take note that:

Adjusted R2 is always less than or equal to the R2

It is possible to get negative value of adjusted R2.

Computing F-Statistics:

We use t- statistics to determine overall significance of the model. Whenever, we perform

multiple regression analysis using any software, we usually get an ANOVA table as shown

below as an output.

Sources of

variations

Sum of

squares

Degree of

freedom (df)

Mean

squares

F-ratio

Regression SSR k SSR/ k Ratio of

mean squares Residual SSE n-(k+1) SSE/ n-(k+1)

Total SST (n-1) SST/ (n-1)

Here, calculation of F-ratio is based on following hypothesis:

Ho: β1 = β2 = β3 = β4 = 0 (it means there is no relationship between dependent and independent variables)

Ha: At least one regression coefficient is not ≠ 0

Actually, in multiple regression, F-test determines that at least one independent variable is

significantly correlated with the dependent variable. From above ANOVA table, it is obvious

that:

F = 𝑀𝑆𝑅

𝑀𝑆𝐸 Where, MSR =

𝑆𝑆𝑅

𝑘 and MSE =

𝑆𝑆𝐸

𝑛−𝑘−1

Where

n = Number of observations

k = Number of independent variables

)1()1(

222 RKn

KRR

Page 15: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

If value of F is found to be significant, it will lead to rejection of Ho. It means that dependent

variables is strongly related to at least one independent variable Xi and our model is

significant in estimating the value of Y for the given values of Xi.

6. Assumptions of Multiple Regression Model:

The basic assumptions of multiple regression model are as followings:

(a) Normality of Error: Errors must be normally distributed. It means error term must have

zero mean. Symbolically, Mean [ε] = 0. This can be ensured by plotting a histogram

between residuals and frequency distribution.

(b) Homoscedasticity: It is also known as constant error variance. Symbolically, Var [ε] =

σ². It can be ensured by plotting a graph between residuals and fitted value of dependent

variable. If residuals are randomly distributed/ scattered around zero, assumption of

constant error variance is met otherwise not.

(c) Uncorrelatedness of Residuals and Independent Variables: Residuals/ errors must not

be correlated with any independent variable Xi. It can be ensured by plotting graph

between residuals and each independent variable separately. If residuals are randomly

scattered around zero, we have met assumption otherwise not.

(d) Independence of Error: This assumption is about the serial correlation between the

values of error term. It means that sometime value of error term is correlated with its own

values at some earlier time. This problem is also known as autocorrelation. For this, we

can plot residuals versus time graph. By a visual inspection of this graph, we may detect

the presence of serial correlation. We also have an another statistical technique to detect

autocorrelation discussed as below:

The Durbin - Watson Test - the test uses the following test statistics:

(DW test is used for first order autocorrelation)

Where,

te = Residual at time t

1te = Residual at time (t-1) or just previous observation

t

tt

e

eeDW

2

2

1)(

Page 16: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

As a thumb rule, if DW is 2, there is no positive autocorrelation and if it is zero, then it is

perfect positive autocorrelation. If it is 4, then there is perfect negative autocorrelation.

The Durbin-Watson Statistic (d) has a sampling distribution which has two critical values - dL

and dU.

No autocorrelation if dU < d < (4 -dU)

Positive autocorrelation if 0<d < dL

Negative autocorrelation if (4 - dL) < d < 4

The test is inconclusive if dL < d < dU OR (4- dU) < (4 - dL)

The critical d values are a function of number of observations, the number of parameters

estimated and the level of significance.

All residual plots discussed above are shown in the figure as below:

1.00.50.0-0.5-1.0

99.9

99

90

50

10

1

0.1

Residual

Pe

rce

nt

5.44.84.23.63.0

1.0

0.5

0.0

-0.5

-1.0

Fitted Value

Re

sid

ua

l

0.80.40.0-0.4-0.8

24

18

12

6

0

Residual

Fre

qu

en

cy

65605550454035302520151051

1.0

0.5

0.0

-0.5

-1.0

Observation Order

Re

sid

ua

l

Normal Probability Plot Versus Fits

Histogram Versus Order

Residual Plots for Exports

Violation of Assumptions (Multicollinearity Problem):

When independent variables are exhibiting covariation among themselves, the lack of

independent movement in independent variables is called multicollinearity. It simply means

that independent variables are having some degree of correlation among themselves.

Effects of Multicollinearity:

Minor changes in the data can produce wide swings in the parameters estimates;

Page 17: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

Coefficients may have high standard errors and low significance level;

Coefficients may have wrong signs or an implausible magnitude

Therefore, it is necessary to detect multicollinearity before proceeding for multiple

regression. It can be detected by:

Studying the correlation among all independent variables;

Regressing each independent variable against other independent variables;

Variance Inflation Factor (VIF); the ratio 1/ (1- Rk2) is called VIF and an indicator of

multicollinearity. Higher it is, higher is the degree of multicollinearity.

One can be suggested to take following actions, in case multicollinearity is detected in a data

set:

Do nothing

Obtain more or new data

Drop some variables or re-specify the model specification

7. Summary:

Multiple regression analysis is an important technique to develop a model that may estimate

the value of a dependent variable Y by two or more independent variables Xi. This analysis

signifies an average relationship between dependent and independent variables. This assumes

a linear relationship between the variables and is expressed as a straight line equation. To test

the goodness of fit of our proposed model, we may look at followings:

(a) Making the scatter plot of given observation and fitting the regression plane. If observed

data point are highly scattered around the surface plane then our model may not be fit.

(b) We can test the significance of regression coefficient β0, β1, β2 etc using t-test. If any

parameter is found no significant (that means it is equal to zero), it cannot be included in

our model.

(c) We can ascertain the significance of overall model by F-statistics.

F = 𝑀𝑆𝑅

𝑀𝑆𝐸 =

𝑀𝑒𝑎𝑛 𝑜𝑓 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑠

𝑀𝑒𝑎𝑛 𝑜𝑓 𝑠𝑢𝑚 𝑜𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑢𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑠

Page 18: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

If value of F is found to be significant, it means that dependent variable is strongly related

to at least one of the independent variables and out model fits good.

(d) We can calculate Multiple Coefficient of Determination (R2): if it is greater than 60%, our

model fits good;

(e) Standard error of estimate is also calculated in order to see the adequacy and reliability of

model.

(f) Multiple regression model is based on some assumptions. Therefore, we may go for a

detailed residual analysis so that we can be sure whether we meet all necessary

assumptions or not.

(g) Multicollinearity is a major problem in multiple regression analysis. It means that

independent variables are correlated to each other. It makes the prediction very unstable.

We need to handle this problem carefully before moving ahead.

(h) Many times, in multiple regression dependent variable may not be linearly related to any

independent variable. In that case, first we need to transform those variables and then test

them for linear correlation. If after transformation, they are linearly correlated then we

can go ahead for multiple regression with those transformed variables. There are different

types of transformation which are beyond the purview of this module.

*******

Case: A paint manufacturing company has researched that its sale mainly depends upon the

advertisements in electronic media and sales promotion activities. Company increased its

budget on above two heads considerably. Now company is interested to see the impact of

these two initiatives on the sales. You are given with a random sample of the sales for 15

days. You are required to develop a multiple regression model and predict the impact of

above two techniques on sales.

Also predict the sales when expenditure on advertising in electronic media is Rs120 thousand

and on promotion activities is Rs 60 thousands.

Page 19: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

Solution:

The proposed multiple regression model is as below:

Y= β0 + β1X1+ β2X2

Where

Y = sales of company

X1= Expenditure on advertisement in electronic media

X2 = Expenditure on promotion activities

We can get the values of b0, b1, and b2 (sample coefficients) by solving above three linear

equations. These values provide the estimates of β0, β1, and β2 (population parameters).

By solving above three equations we can get value of b0, b1, and b2 that are equivalent to β0,

β1, and β2.

The following table shows the output of SPSS

Unstandardized Coefficients

Std. Error

Standardized Coefficients

Beta t value Sig.

(Constant) 833.9565767 127.8487

6.52299578 1.93E-05

Adv 7.685798603 2.795796 0.606317935 2.749055591 0.016568

Promotion -1.733 4.54487 -0.10034 0.465437 0.102409

β0 = 833.956; β1 = 7.685799; β2 = -1.733

Now, we will test the significance of above calculated coefficient so that we can take a

decision to include or exclude them in the proposed model. We can do it by using t-test as

described in this module. Here, we consider the results obtained directly from SPSS output.

Days 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Sales

(‘000)

1000 1125 930 710 930 1000 1200 950 1110 1710 1590 1090 995 1400 1450

Adv (‘000) 9 9 18 18 27 27 38 38 52 52 58 58 65 65 75

Promotion

(‘000

28 28 55 55 10 10 28 28 39 39 45 45 53 53 62

2

22211202

212

2

11101

22110

xbxxbxbyx

xxbxbxbyx

xbxbnby

Page 20: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

In above table we can see that value of β0 and β1 is found significant (p< 0.05). Value of β2 is

found insignificant (p > 0.05). Therefore, we will not include β2 in our model. Therefore, now

our model will be as:

Y= β0 + β1X1+ β2X2

Sales = 833.956 + 7.68579 (exp on advertisement)…………………….(1)

The above equation gives us the estimated value of sales for any given value of expenditure

on advertisements.

Y (y-ӯ) (y-ӯ)² X1 X2 Ŷ (Ŷ-ӯ) (Ŷ-ӯ)² (Y-Ŷ) (Y-Ŷ)²

1000 -146 21316 9 28 903.125 -

242.875 58988.2656 96.875 9384.76563

1125 -21 441 9 28 903.125 -

242.875 58988.2656 221.875 49228.5156

930 -216 46656 18 55 972.29 -173.71 30175.1641 -42.29 1788.4441

710 -436 190096 18 55 972.29 -173.71 30175.1641 -262.29 68796.0441

930 -216 46656 27 10 1041.46 -

104.545 10929.657 -

111.455 12422.217

1000 -146 21316 27 10 1041.46 -

104.545 10929.657 -41.455 1718.51702

1200 54 2916 38 28 1125.99 -20.01 400.4001 74.01 5477.4801

950 -196 38416 38 28 1125.99 -20.01 400.4001 -175.99 30972.4801

1110 -36 1296 52 39 1233.58 87.58 7670.2564 -123.58 15272.0164

1710 564 318096 52 39 1233.58 87.58 7670.2564 476.42 226976.016

1590 444 197136 58 45 1279.69 133.69 17873.0161 310.31 96292.2961

1090 -56 3136 58 45 1279.69 133.69 17873.0161 -189.69 35982.2961

995 -151 22801 65 53 1333.49 187.485 35150.6252 -

338.485 114572.095

1400 254 64516 65 53 1333.49 187.485 35150.6252 66.515 4424.24522

1450 304 92416 75 62 1410.34 264.335 69872.9922 39.665 1573.31223

1146.0 1067210.0 1146.0 392247.8 674880.7

The above table contains the detailed information about various statistics.

Sum of square of total variations (SST) = 1067210.0

Sum of square of Regression (SSR) = 392247.8

Sum of square of error (SSE) = 674880.7

Multiple Coefficient of determination (R²)

R² = 𝑆𝑆𝑅

𝑆𝑆𝑇 =

𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑠

𝑇𝑜𝑡𝑎𝑙 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑠

Page 21: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

= 392247.8/ 1067210.0 = 0.40

It means proposed model explains only 40 percent variations in sales. The model is not good

fit for the estimation purpose.

Standard Error of Estimation (Se):

Se = √𝐒𝐒𝐄

𝐧−𝐤−𝟏 = √

Ʃ(𝐘−Ŷ)²

𝐧−𝐤−𝟏

= √𝟔𝟕𝟒𝟖𝟖𝟎.𝟕

𝟏𝟓−𝟐−𝟏 = √

𝟔𝟕𝟒𝟖𝟖𝟎.𝟕

𝟏𝟐 = 227.846

Calculation of F-statistics:

F-statistics is calculated to see that dependent variable is correlated with at least one

independent variable. If value of F is found to be significant, it means that our model is

significant in estimating the value of Y for the given values of Xi.

F = 𝑀𝑆𝑅

𝑀𝑆𝐸 =

𝑀𝑒𝑎𝑛 𝑜𝑓 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑠

𝑀𝑒𝑎𝑛 𝑜𝑓 𝑠𝑢𝑚 𝑜𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑢𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑠

MSR = 𝑆𝑆𝑅

𝑘 and MSE =

𝑆𝑆𝐸

𝑛−𝑘−1

Where

n = Number of observations

k = Number of independent variables

MSR = 𝟔𝟕𝟒𝟖𝟖𝟎.𝟕

2 = 337440

MSE = 𝟔𝟕𝟒𝟖𝟖𝟎.𝟕

15−2−1 =

𝟔𝟕𝟒𝟖𝟖𝟎.𝟕

12 = 56240.1

F = 𝑀𝑆𝑅

𝑀𝑆𝐸 =

337440

56240.1 = 6 (p < 0.05)

Thus, value of F is found to be significant, it means that our model may be used for

estimation purpose.

Estimation of Sales (Y) when X1 is Rs120 thousand and X2 is Rs 60 thousands:

From equation (1)

Page 22: Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT …epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/S... · 2019. 9. 2. · Ya Lun Chou elaborates it further adding that regression

Sales = 833.956 + 7.68579 (exp on advertisement)…………………….(1)

Sales = 833.956 + 7.68579 (120)

Sales = Rs 1756.251 thousands

Residual Analysis:

We can also check the residuals to check the assumptions of multiple regression analysis. For

example, we have plotted the graph as below. It is obvious from the graph that error is

independent. There is no problem of autocorrelation. Similarly, we can also plot other graphs

for residuals.

*******

-500

0

500

1000

0 5 10 15 20

(Y-Ŷ)

(Y-Ŷ)