paper: 15, quantitative techniques for management...
TRANSCRIPT
Module: 34, Multiple Regression Model
Paper: 15, QUANTITATIVE TECHNIQUES FOR MANAGEMENT
DECISIONS
Items Description of Module
Subject Name Management
Paper Name Quantitative Techniques for Management
Module Title Multiple Regression Model
Module Id Module No.- 34
Pre- Requisites Basic knowledge of correlation and simple regression analysis
Objectives To appreciate a multiple regression model
Keywords Multiple regression, regression coefficients, prediction, multicollinearity
QUADRANT-I
Module: 34, Multiple Regression Model
1. Learning Outcome
2. 2. Introduction to Multiple Regression Model and Its Applications
3. Developing a Multiple Linear Regression Model
4. Estimating the Parameters of Proposed Population Regression Model
5. Testing Goodness of Fit of Model
6. Assumptions of Multiple Regression Model
7. Summary
1. Learning Outcome:
After completing this module the students will be able to:
• Develop a multiple regression model
• Understand hypothesis tests about the regression coefficients
• Calculate a prediction interval for the dependent variable
• Analyze residuals in order to check the validity of model
• Calculating multiple coefficient of determination
• Perform a Durbin-Watson test for autocorrelation in residuals
• Understand the concept of multicollinearity
• Test the goodness of fit of the model
2. Introduction to Multiple Linear Regression
The term “Regression” was first used by Sir Francis Galton in 1877 while studying the
relationship between the height of fathers and sons. The dictionary meaning of regression is
the act of returning back to the average. According to Morris Hamburg, regression analysis
refers to the methods by which estimates are made of the values of one a variable from a
knowledge of the values of one or more other variables and to measurement of the errors
involved in this estimation process. Ya Lun Chou elaborates it further adding that regression
analysis basically attempts to establish the nature of relationship between the variables and
thereby provides mechanism for prediction/ estimation.
In simple linear regression analysis, we basically attempt to predict the value of one variable
from known values of another variable. The variable that is used to estimate the variable of
interest is known as “independent variable” or “explanatory variable” and the variable which
we are trying to predict is termed as “dependent variable” or “explained variable”. Usually,
dependent variable is denoted by Y and independent variable as X. Having established a
strong correlation between two variables, we may predict the value of dependent variable
with the help of the given value of independent variable. For example, if we know that yield
of wheat and amount of rainfall are closely related to each other, we can estimate the amount
of rainfall to achieve a particular wheat production level. This estimation can be made easily
using simple regression analysis that reveals average relationship between the variables.
In real life, we come across many situations where large number of variables may be
affecting one dependent variable. For example, the production of wheat may be affected by
rainfall, use of urea, use of fertilizers, method of farming, humidity, temperature and so forth.
In this case, if, we take cognizance of only any single independent variable in explaining the
variations of dependent variable (production of wheat), then the magnitude of errors in the
results would be very high. In this situation, it is suggested to take into account all important
independent variables in estimation equation.
In multiple regression analysis, we predict the value of Y, for the given values of two or more
independent variables Xi (2,3,4……….k). This is termed as ‘multiple’ as there are two or
more than two independent variable and ‘linear’ as it assumes a linear relationship between
dependent and independent variables. The population regression model of a dependent
variable on n independent variables is proposed below that shows an average relationship:
Y= β0 + β1X1+ β2X2 + β3X3 + …….+ βnXn + ε ………………………….(1)
In this regression line, Y is dependent variable and X1, X2, X3……Xn are independent
variables. β0 is the Y-intercept of the regression surface and β1, β2, β3,…… βn represent the
slope of the regression surface (also known as response surface - with respect to Xi). One can
easily understand from the following figure that any three points (P, Q and R) or an intercept
(β0) and coefficients (β1, β2, β3,…… βn), define a plane in a three-dimensional surface.
x2
x1
y
R
P
Q
Any three points (P, Q, and R), or an intercept and coefficients of x1 and x2 (0
, 1, and 2), define a plane in a three-dimensional surface.
Plane
Parameters: Unknown constants in a model are called parameters. For instance, β0, β1, β2,
β3,…… βn are parameters in the above model.
Estimators: An estimator is a rule, formula, an algorithm that is applied to the data in a
specific sample to calculate an estimate of the population parameter.
β0 + β1X1+ β2X2 + β3X3 + …….+ βnXn is the estimator in above model.
Estimates: An estimate is a number or specific value computed or obtained through an
estimator.
Applications of Multiple Regression analysis:
Regression analysis is specialized branch of statistical theory and is of immense use almost in
all the scientific disciplines. It is also very useful particularly in economics as it is
predominantly used in estimating the relationship among various economic variables that
comprise the epitome of economic life. Its applications are extended to almost all the natural,
physical and social sciences. It attempts to accomplish mainly the followings:
With the help of multiple regression model we may predict the unknown value of
dependent variable for the given values of independent variables;
Multiple regression analysis that is used for the estimation never makes one hundred
percent correct estimation for the given values of independent variables. There is always
some difference between the actual value and estimated value of dependent variable that
is known as ‘estimation error’. Multiple regression analysis also computes this error as
standard error of estimation and reveals the accuracy of prediction. The amount of error
depends upon the spread of scatter diagramme which is prepared by plotting the actual
observations of Y, X1 and X2 variables on a plane. In the following figures, one can easily
note that amount of estimation error would be more for more spread of actual observation
and lesser for less spread.
Multiple regression analysis also depicts the relationship/ association between the
dependent variable Y and independent variables Xi. Multiple coefficient of determination
(R2) assesses the variance in dependent variable that has been accounted for by the
regression equation (estimator). In general, the greater the value of R2 the better is the fit
and the more useful the regression equation as a predictive instrument.
The population regression model of
a dependent variable, Y, on a set of
n independent variables, X1, X2,. . . ,
Xn is suggested by:
Y= 0 + 1X1 + 2X2 + . . . + nXn +
Where
0 = Y-intercept of the regression
surface i , i = 1,2,...,n = slope of
the regression surface (also known
as response surface - with respect
to Xn )
x2
x1
y2
1
0
Multiple Coefficient of Determination (R2) = Explained Variation/ Total Variation
For example, if value of R2 = 0.60
i.e R2 = 60/ 100
It may be interpreted as out of total variations (100 percent), 60 percent variations in Y are
explained by the Xi in the suggested regression model.
3. Developing a Multiple Regression Model
A statistical model is a set of mathematical formulas and assumptions related to a real world
situation. We wish to develop our simple regression model in such a way that explains the
process underlying our data as much as possible. Since it is almost impossible for any model
to explain each and everything due to inherent uncertainty in the real world, therefore we will
always have some remaining errors which occur due to many unknown outside factors
affecting the process generating our data.
A good statistical model is prudent that makes the use of few mathematical terms which
explain the real situation as much as possible. The model attempts to capture the rational
behaviour of our data set and leaves out the factors that are nonsystematic and cannot be
predicted/ estimated. The following figure explains a well defined statistical model.
The errors (ε); also termed as residuals are that part of our data generation that cannot be
estimated by the model because it is not systematic. Hence, errors (ε) constitute a random
component in the model. It is easy to understand that a good statistical model splits our data
The errors (ε); also termed as residuals are that part of our data generation that cannot be
estimated by the model because it is not systematic. Hence, errors (ε) constitute a random
component in the model. It is easy to understand that a good statistical model splits our data
process into two components namely; systematic component which is well explained by a
mathematical term contained in the model, and a pure random component, the source of
which is absolutely unknown and therefore, it cannot be estimated by the model. The
effectiveness of a statistical model depends upon the amount of error associated with it. There
is an inverse relationship between the effectiveness of model and the amount of error that
means less the error, more effective the model is and more the error, less effective the model
would be.
Proposed Population Multiple Regression Model
As discussed above, the population regression model may be given as below:
Y= β0 + β1X1+ β2X2 + ε ………………………….(2)
Here;
Y is dependent variable and X1, X2 are independent variables. These variables make a
response surface (i.e. estimation line in case of simple regression). Further, above model
contains two components. First, systematic (nonrandom) component which is response
surface (Y= β0 + β1X1+ β2X2) itself and secondly, pure random component that is error term
ε. The systematic component is the equation for the mean of Y, given X. We may represent
the conditional mean of Y, given X, by E(Y) as below:
Data Statistical model
Systematic
component
+
Random errors
E(Y) = β0 + β1X1+ β2X2 -----------------------------------------------------(3)
One may note that E(Y) represents the expected value of Y. Considering the expected value
in equation (2) with independent variables X1 and X2 makes the error term (ε) zero.
Comparing equations (2) and (3), we can notice that each value of Y comprises the average Y
for the given values of X1 and X2, plus a random error. Thus, the actual value of Y is equal to
the average Y conditional on X, plus a random error ε. Thus,
Y = Average Y for given Xi + Error
Until now, we have described the population model which is assumed true based on the
relationship between X1, X2 and Y. Now, we wish to have an idea about unknown
relationship in population and estimate it from the sample information. For this, we get a
random sample of observations on variables X1, X2 and Y; then compute the parameters b0,
b1, b2 of sample regression surface which are analogous to population parameters β0, β1 and
β2. This is done with the help of method of least squares as discussed below.
4. Estimating Parameters of the Proposed Model
We wish to estimate the parameters (β0, β1 and β2) of proposed model so as to estimate the
value of Y for the given values of Xi. Our model to be effective, we wish to keep random
component of our model at the minimum. For this, we use the ‘method of least squares’ that
predicts the value of Y in such a way that the average of estimated Y for given values of Xi is
always equal to average of actual Y in order to minimize the random component (errors).
This method is also considered as best linear unbiased estimator (BLUE) that gives both
unbiased estimators.
Now we will use Ŷ to show the individual values of the estimate points which lie on
estimation line for a given X. The best fitted estimation line will be as follows:
In a multiple regression model, the least-squares estimators minimize the sum of squared
errors from the estimated regression plane.
Ŷ = b0 + b1X1 + b2X2 + …………..+ bnXn
Where, Ŷ is estimated value of Y; the value lying on the estimated regression surface. The
terms b0, b1, b2, b3…….bn are the least-squares estimates of the β0, β1, β2 ……. Βn (population
parameters).
The actual value of Y is the estimated value plus an error:
Y = b0 + b1X1 + b2X2 + b3X3 + ……….+ bnXn + ε
Method of least squares minimizes the sum of squared errors with respect to the estimated
coefficients b0, b1, and b2 yields the following normal equations which can be solved for b0,
b1, and b2
We can get the values of b0, b1, and b2 (sample coefficients) by solving above three linear
equations. These values provide the estimates of β0, β1, and β2 (population parameters).
Search Procedure for Multiple Regressions:
The main objective of multiple regression analysis is to study the impact of various predictors
(independent variables) on the dependent variable. This is done by measuring the amount of
variations in dependent variable that can be explained by a group of independent variables.
There are various regression selection criteria that assist us in evaluating the independent
variables, thus improving the efficiency and effectiveness of analysis. We go for a search
procedure in which more than one regression model is developed for a given database. All
models are compared on the basis of selected criteria based on the procedure opted.
All Possible Regressions:
When we have k number of independent variables, this model considers all possible
regressions. In this case, possible regression would be (2k – 1) and this model will take into
account these all regressions. For example, when k = 2; possible regressions would be 3 and
when k = 4, possible regressions would be 15.
2
22211202
212
2
11101
22110
xbxxbxbyx
xxbxbxbyx
xbxbnby
Entry Method of Regression:
This is an appropriate method particularly when we have small number of independent
variables. In this all independent variables are entered into the multiple regression equation at
the same time. Each independent variable is evaluated assuming that it has been entered after
all other independent variables. The evaluation of each predictor is made by analyzing its
individual contribution in explaining the variations in dependent variable.
Methods of Selection:
This method enables us to make a selection of predictors (independent variables) which are
actually necessary in regression equation and explain almost as much of the variations in
dependent variable as is explained by the complete set. Thus, it helps us in designing an
optimal regression equation along with level of importance of each predictor. It also helps us
in studying the effect of one independent variable by removing the effect of other
independent variables.
Mainly, there are three selection procedures that tend to give the most appropriate regression
equation namely; stepwise selection, forward selection, backward elimination.
Stepwise selection: involves analysis at each step to measure the contribution of the
independent variables entered previously in the equation. This enables us to realize the
contribution of the previous variables now when another independent variable has been
added. We can take a decision pertaining to retaining or deleting the independent variables in
regression equation on the basis of their statistical contribution.
Forward selection: always starts with an empty regression equation and then independent
variables are added one at a time. In the starting, independent variable that has highest
correlation with the dependent variable is added and then second independent variables and
so forth. Independent variables once in the equation remain there.
Backward elimination: is just opposite of forward selection. All independent variables are
entered into the equation first and then each one is deleted one by one which do not
contribute to the regression equation.
5. Testing Goodness of Fit of Model:
Once we have developed a multiple regression model and obtained the values of regression
coefficients β0, β1, and β2 etc. Now, we also need to check its goodness/ adequacy. Adequacy
can be ascertained by testing the significance of regression coefficients, standard error of
estimate, determining the multiple coefficient of determination, overall significance of the
model and residual analysis for verifying the assumptions of model.
Test for Significance of Coefficients:
Here, we also need to measure the significance of the parameters β0, β1, and β2; in order to take
a decision whether they should be retained in proposed model or not. For that, we may
consider the null and alternative hypothesis as following:
H0: β0 = 0
Ha: β0 ≠ 0
This means if value of ‘β0’ is significantly different (higher or lower) from zero, we may
reject the null hypothesis. This implies that value of ‘β0’ is significant and it must be included
in our model. This is actually done using t-test. Here degree of freedom is (n-k-1).
t = 𝑏0−𝛽0
𝑆.𝐸(𝑏0)
If value of ‘β0’ is found to be significant, then at 5% level of significance we can say that β0
must be retained in model and its value actually lies in{ β0 ± S.E.b0 (t)} range.
Similarly, we can also test the significance for other regression coefficients.
Determination of Standard Error of Estimation:
The next stage in building model is to assess its reliability. By the discussion made so far we
are able to develop an understanding that a regression plane is more accurate as an estimator
when actual observations lie close to this. Here, ‘standard error of estimate’ is a tool that is
usually used to check the reliability of model. The term standard error is actually similar to
standard deviation that is used to measure the variability or dispersion of given observations
about the mean of given data set. On the contrary, the standard error of estimate (Se) measures
the dispersion/variation of the actual observations around the regression plane. This can be
computed as below:
Standard Error of Estimate (Se)
Se = √SSE
n−k−1
Where
SSE = Sum of error square
n = Number of actual data points (observations)
k = Number of independent variables
Total Deviation = Regression Deviation + Error Deviation
SST = SSR + SSE
x2
x1
y
y
Y Y : Error Deviation
Y Y : Regression DeviationTotal deviation: Y Y
In above figure, it is easy to understand that
Total deviations = (y – ӯ)
Regression deviations (that are explained by our model) = (Ŷ – ӯ)
Error Deviations (that are not explained by our model) = (y – Ŷ)
SSE = Sum of Squared Error = Ʃ(y –Ŷ)²
SSR = Sum of Regression Error = Ʃ(Ŷ–ӯ)²
SST = Sum of Total Error = Ʃ(y –ӯ)²
Se = √𝐒𝐒𝐄
𝐧−𝐤−𝟏 = √
Ʃ(𝐘−Ŷ)²
𝐧−𝐤−𝟏
Mean Square Error (MSE) is an unbiased estimator of the variations of population error that
may be calculated as:
MSE = 𝑆𝑆𝐸
𝑛−𝑘−1
Therefore, Se = √𝑴𝑺𝑬
It is obvious that larger the standard error of estimate, the greater the dispersion of data points
around the estimation plane. On the other hand, if the error is zero, it means that our model is
a perfect estimator of Y and all data points will essentially lie on plane.
If we assume that all observations are normally distributed around the line, one may easily
note in the following figure that then observations will lie in the following pattern:
± 1 Se = about 68 % data points
± 2 Se = about 95 % data points
± 3 Se = about 100 % data points
Here it is important to note again two important assumptions:
1. The observed values for Y are normally distributed around each estimated value of Y
2. The dispersion of distribution around each estimated value is the same.
Standard error calculated as above is a good instrument to find out a prediction interval
around an estimated value of Ŷ, within which the actual value of Y lies. In the above figure,
we may be 68% confident that actual value of Y lie within ± 1 Se interval of the estimated
plane. Since, prediction interval is based on normal distribution of data points; larger sample
size (n ≥ 30) is required. For small size sample, we cannot get accurate intervals. One may
keep in mind that these prediction intervals are only approximates.
Computation of Multiple Coefficient of Determination (R²):
The multiple coefficient of determination (R²) measures the proportion of variations in
dependent variable Y that is explained by a set of independent variables.
R² = 𝑆𝑆𝑅
𝑆𝑆𝑇 = 1 -
𝑆𝑆𝐸
𝑆𝑆𝑇 =
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑠
𝑇𝑜𝑡𝑎𝑙 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑠
As the value of R² increases, goodness of model fit also increases as it tends to be more and
more accurate in the estimations of Y. R2 is be viewed as a measure of explanatory power of
an estimator. It shows the % variation explained by all the independent variables together.’
The Adjusted R2:
If one compares the explanatory power of different regression models, then one should use
the adjusted R2 because it is adjusted for the degree of freedom which may be different for
different models.
Here, on may take note that:
Adjusted R2 is always less than or equal to the R2
It is possible to get negative value of adjusted R2.
Computing F-Statistics:
We use t- statistics to determine overall significance of the model. Whenever, we perform
multiple regression analysis using any software, we usually get an ANOVA table as shown
below as an output.
Sources of
variations
Sum of
squares
Degree of
freedom (df)
Mean
squares
F-ratio
Regression SSR k SSR/ k Ratio of
mean squares Residual SSE n-(k+1) SSE/ n-(k+1)
Total SST (n-1) SST/ (n-1)
Here, calculation of F-ratio is based on following hypothesis:
Ho: β1 = β2 = β3 = β4 = 0 (it means there is no relationship between dependent and independent variables)
Ha: At least one regression coefficient is not ≠ 0
Actually, in multiple regression, F-test determines that at least one independent variable is
significantly correlated with the dependent variable. From above ANOVA table, it is obvious
that:
F = 𝑀𝑆𝑅
𝑀𝑆𝐸 Where, MSR =
𝑆𝑆𝑅
𝑘 and MSE =
𝑆𝑆𝐸
𝑛−𝑘−1
Where
n = Number of observations
k = Number of independent variables
)1()1(
222 RKn
KRR
If value of F is found to be significant, it will lead to rejection of Ho. It means that dependent
variables is strongly related to at least one independent variable Xi and our model is
significant in estimating the value of Y for the given values of Xi.
6. Assumptions of Multiple Regression Model:
The basic assumptions of multiple regression model are as followings:
(a) Normality of Error: Errors must be normally distributed. It means error term must have
zero mean. Symbolically, Mean [ε] = 0. This can be ensured by plotting a histogram
between residuals and frequency distribution.
(b) Homoscedasticity: It is also known as constant error variance. Symbolically, Var [ε] =
σ². It can be ensured by plotting a graph between residuals and fitted value of dependent
variable. If residuals are randomly distributed/ scattered around zero, assumption of
constant error variance is met otherwise not.
(c) Uncorrelatedness of Residuals and Independent Variables: Residuals/ errors must not
be correlated with any independent variable Xi. It can be ensured by plotting graph
between residuals and each independent variable separately. If residuals are randomly
scattered around zero, we have met assumption otherwise not.
(d) Independence of Error: This assumption is about the serial correlation between the
values of error term. It means that sometime value of error term is correlated with its own
values at some earlier time. This problem is also known as autocorrelation. For this, we
can plot residuals versus time graph. By a visual inspection of this graph, we may detect
the presence of serial correlation. We also have an another statistical technique to detect
autocorrelation discussed as below:
The Durbin - Watson Test - the test uses the following test statistics:
(DW test is used for first order autocorrelation)
Where,
te = Residual at time t
1te = Residual at time (t-1) or just previous observation
t
tt
e
eeDW
2
2
1)(
As a thumb rule, if DW is 2, there is no positive autocorrelation and if it is zero, then it is
perfect positive autocorrelation. If it is 4, then there is perfect negative autocorrelation.
The Durbin-Watson Statistic (d) has a sampling distribution which has two critical values - dL
and dU.
No autocorrelation if dU < d < (4 -dU)
Positive autocorrelation if 0<d < dL
Negative autocorrelation if (4 - dL) < d < 4
The test is inconclusive if dL < d < dU OR (4- dU) < (4 - dL)
The critical d values are a function of number of observations, the number of parameters
estimated and the level of significance.
All residual plots discussed above are shown in the figure as below:
1.00.50.0-0.5-1.0
99.9
99
90
50
10
1
0.1
Residual
Pe
rce
nt
5.44.84.23.63.0
1.0
0.5
0.0
-0.5
-1.0
Fitted Value
Re
sid
ua
l
0.80.40.0-0.4-0.8
24
18
12
6
0
Residual
Fre
qu
en
cy
65605550454035302520151051
1.0
0.5
0.0
-0.5
-1.0
Observation Order
Re
sid
ua
l
Normal Probability Plot Versus Fits
Histogram Versus Order
Residual Plots for Exports
Violation of Assumptions (Multicollinearity Problem):
When independent variables are exhibiting covariation among themselves, the lack of
independent movement in independent variables is called multicollinearity. It simply means
that independent variables are having some degree of correlation among themselves.
Effects of Multicollinearity:
Minor changes in the data can produce wide swings in the parameters estimates;
Coefficients may have high standard errors and low significance level;
Coefficients may have wrong signs or an implausible magnitude
Therefore, it is necessary to detect multicollinearity before proceeding for multiple
regression. It can be detected by:
Studying the correlation among all independent variables;
Regressing each independent variable against other independent variables;
Variance Inflation Factor (VIF); the ratio 1/ (1- Rk2) is called VIF and an indicator of
multicollinearity. Higher it is, higher is the degree of multicollinearity.
One can be suggested to take following actions, in case multicollinearity is detected in a data
set:
Do nothing
Obtain more or new data
Drop some variables or re-specify the model specification
7. Summary:
Multiple regression analysis is an important technique to develop a model that may estimate
the value of a dependent variable Y by two or more independent variables Xi. This analysis
signifies an average relationship between dependent and independent variables. This assumes
a linear relationship between the variables and is expressed as a straight line equation. To test
the goodness of fit of our proposed model, we may look at followings:
(a) Making the scatter plot of given observation and fitting the regression plane. If observed
data point are highly scattered around the surface plane then our model may not be fit.
(b) We can test the significance of regression coefficient β0, β1, β2 etc using t-test. If any
parameter is found no significant (that means it is equal to zero), it cannot be included in
our model.
(c) We can ascertain the significance of overall model by F-statistics.
F = 𝑀𝑆𝑅
𝑀𝑆𝐸 =
𝑀𝑒𝑎𝑛 𝑜𝑓 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑠
𝑀𝑒𝑎𝑛 𝑜𝑓 𝑠𝑢𝑚 𝑜𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑢𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑠
If value of F is found to be significant, it means that dependent variable is strongly related
to at least one of the independent variables and out model fits good.
(d) We can calculate Multiple Coefficient of Determination (R2): if it is greater than 60%, our
model fits good;
(e) Standard error of estimate is also calculated in order to see the adequacy and reliability of
model.
(f) Multiple regression model is based on some assumptions. Therefore, we may go for a
detailed residual analysis so that we can be sure whether we meet all necessary
assumptions or not.
(g) Multicollinearity is a major problem in multiple regression analysis. It means that
independent variables are correlated to each other. It makes the prediction very unstable.
We need to handle this problem carefully before moving ahead.
(h) Many times, in multiple regression dependent variable may not be linearly related to any
independent variable. In that case, first we need to transform those variables and then test
them for linear correlation. If after transformation, they are linearly correlated then we
can go ahead for multiple regression with those transformed variables. There are different
types of transformation which are beyond the purview of this module.
*******
Case: A paint manufacturing company has researched that its sale mainly depends upon the
advertisements in electronic media and sales promotion activities. Company increased its
budget on above two heads considerably. Now company is interested to see the impact of
these two initiatives on the sales. You are given with a random sample of the sales for 15
days. You are required to develop a multiple regression model and predict the impact of
above two techniques on sales.
Also predict the sales when expenditure on advertising in electronic media is Rs120 thousand
and on promotion activities is Rs 60 thousands.
Solution:
The proposed multiple regression model is as below:
Y= β0 + β1X1+ β2X2
Where
Y = sales of company
X1= Expenditure on advertisement in electronic media
X2 = Expenditure on promotion activities
We can get the values of b0, b1, and b2 (sample coefficients) by solving above three linear
equations. These values provide the estimates of β0, β1, and β2 (population parameters).
By solving above three equations we can get value of b0, b1, and b2 that are equivalent to β0,
β1, and β2.
The following table shows the output of SPSS
Unstandardized Coefficients
Std. Error
Standardized Coefficients
Beta t value Sig.
(Constant) 833.9565767 127.8487
6.52299578 1.93E-05
Adv 7.685798603 2.795796 0.606317935 2.749055591 0.016568
Promotion -1.733 4.54487 -0.10034 0.465437 0.102409
β0 = 833.956; β1 = 7.685799; β2 = -1.733
Now, we will test the significance of above calculated coefficient so that we can take a
decision to include or exclude them in the proposed model. We can do it by using t-test as
described in this module. Here, we consider the results obtained directly from SPSS output.
Days 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Sales
(‘000)
1000 1125 930 710 930 1000 1200 950 1110 1710 1590 1090 995 1400 1450
Adv (‘000) 9 9 18 18 27 27 38 38 52 52 58 58 65 65 75
Promotion
(‘000
28 28 55 55 10 10 28 28 39 39 45 45 53 53 62
2
22211202
212
2
11101
22110
xbxxbxbyx
xxbxbxbyx
xbxbnby
In above table we can see that value of β0 and β1 is found significant (p< 0.05). Value of β2 is
found insignificant (p > 0.05). Therefore, we will not include β2 in our model. Therefore, now
our model will be as:
Y= β0 + β1X1+ β2X2
Sales = 833.956 + 7.68579 (exp on advertisement)…………………….(1)
The above equation gives us the estimated value of sales for any given value of expenditure
on advertisements.
Y (y-ӯ) (y-ӯ)² X1 X2 Ŷ (Ŷ-ӯ) (Ŷ-ӯ)² (Y-Ŷ) (Y-Ŷ)²
1000 -146 21316 9 28 903.125 -
242.875 58988.2656 96.875 9384.76563
1125 -21 441 9 28 903.125 -
242.875 58988.2656 221.875 49228.5156
930 -216 46656 18 55 972.29 -173.71 30175.1641 -42.29 1788.4441
710 -436 190096 18 55 972.29 -173.71 30175.1641 -262.29 68796.0441
930 -216 46656 27 10 1041.46 -
104.545 10929.657 -
111.455 12422.217
1000 -146 21316 27 10 1041.46 -
104.545 10929.657 -41.455 1718.51702
1200 54 2916 38 28 1125.99 -20.01 400.4001 74.01 5477.4801
950 -196 38416 38 28 1125.99 -20.01 400.4001 -175.99 30972.4801
1110 -36 1296 52 39 1233.58 87.58 7670.2564 -123.58 15272.0164
1710 564 318096 52 39 1233.58 87.58 7670.2564 476.42 226976.016
1590 444 197136 58 45 1279.69 133.69 17873.0161 310.31 96292.2961
1090 -56 3136 58 45 1279.69 133.69 17873.0161 -189.69 35982.2961
995 -151 22801 65 53 1333.49 187.485 35150.6252 -
338.485 114572.095
1400 254 64516 65 53 1333.49 187.485 35150.6252 66.515 4424.24522
1450 304 92416 75 62 1410.34 264.335 69872.9922 39.665 1573.31223
1146.0 1067210.0 1146.0 392247.8 674880.7
The above table contains the detailed information about various statistics.
Sum of square of total variations (SST) = 1067210.0
Sum of square of Regression (SSR) = 392247.8
Sum of square of error (SSE) = 674880.7
Multiple Coefficient of determination (R²)
R² = 𝑆𝑆𝑅
𝑆𝑆𝑇 =
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑠
𝑇𝑜𝑡𝑎𝑙 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑠
= 392247.8/ 1067210.0 = 0.40
It means proposed model explains only 40 percent variations in sales. The model is not good
fit for the estimation purpose.
Standard Error of Estimation (Se):
Se = √𝐒𝐒𝐄
𝐧−𝐤−𝟏 = √
Ʃ(𝐘−Ŷ)²
𝐧−𝐤−𝟏
= √𝟔𝟕𝟒𝟖𝟖𝟎.𝟕
𝟏𝟓−𝟐−𝟏 = √
𝟔𝟕𝟒𝟖𝟖𝟎.𝟕
𝟏𝟐 = 227.846
Calculation of F-statistics:
F-statistics is calculated to see that dependent variable is correlated with at least one
independent variable. If value of F is found to be significant, it means that our model is
significant in estimating the value of Y for the given values of Xi.
F = 𝑀𝑆𝑅
𝑀𝑆𝐸 =
𝑀𝑒𝑎𝑛 𝑜𝑓 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑠
𝑀𝑒𝑎𝑛 𝑜𝑓 𝑠𝑢𝑚 𝑜𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑢𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑠
MSR = 𝑆𝑆𝑅
𝑘 and MSE =
𝑆𝑆𝐸
𝑛−𝑘−1
Where
n = Number of observations
k = Number of independent variables
MSR = 𝟔𝟕𝟒𝟖𝟖𝟎.𝟕
2 = 337440
MSE = 𝟔𝟕𝟒𝟖𝟖𝟎.𝟕
15−2−1 =
𝟔𝟕𝟒𝟖𝟖𝟎.𝟕
12 = 56240.1
F = 𝑀𝑆𝑅
𝑀𝑆𝐸 =
337440
56240.1 = 6 (p < 0.05)
Thus, value of F is found to be significant, it means that our model may be used for
estimation purpose.
Estimation of Sales (Y) when X1 is Rs120 thousand and X2 is Rs 60 thousands:
From equation (1)
Sales = 833.956 + 7.68579 (exp on advertisement)…………………….(1)
Sales = 833.956 + 7.68579 (120)
Sales = Rs 1756.251 thousands
Residual Analysis:
We can also check the residuals to check the assumptions of multiple regression analysis. For
example, we have plotted the graph as below. It is obvious from the graph that error is
independent. There is no problem of autocorrelation. Similarly, we can also plot other graphs
for residuals.
*******
-500
0
500
1000
0 5 10 15 20
(Y-Ŷ)
(Y-Ŷ)