correlation – regression - Τομέας Μαθηματικών / …fouskakis/5-6.correlation -...
TRANSCRIPT
Correlation Correlation –– RegressionRegression
ΔιατμηματικόΔιατμηματικό
ΠΜΣΠΜΣΕπαγγελματικήΕπαγγελματική
καικαι
ΠεριβαλλοντικήΠεριβαλλοντική
ΥγείαΥγεία--ΔιαχείρισηΔιαχείριση
καικαι
ΟικονομικήΟικονομική ΑποτίμησηΑποτίμηση
ΔημήτρηςΔημήτρης
ΦουσκάκηςΦουσκάκης
CorrelationCorrelation
To assess whether two variables are To assess whether two variables are linearlylinearly
associatedassociated, that is, if the values , that is, if the values
of one variable tend to be higher or lower of one variable tend to be higher or lower for higher values of the other variable. for higher values of the other variable. Methods for studying the association Methods for studying the association between categorical variables will between categorical variables will introduce in the next chapter. Here we will introduce in the next chapter. Here we will consider continuous variables using the consider continuous variables using the method known as method known as correlationcorrelation. .
CorrelationCorrelationTherefore correlation is the method of analysis to use when studying the possible linear association between two continuous variables. If we want to measure the degree of linear association, this can be done by calculating the sample
correlation coefficient (also
called Pearson’s r), often loosely just called correlation. It is a coefficient, denoted by r, that does not depend on the units that are used to measure the data and is bounded between -1 and 1.
i i
2 2i i
(x x)(y y)r
(x x) (y y)
− −=
− −∑∑ ∑
CorrelationCorrelationA negative coefficient means that the data are clustered around lines with a negative slope. That is, as one variable increases, the other one decreases.The closer r is to 1 the stronger the positive linear association between the variables.The closer r is to -1 the stronger the negative linear association between the variables.When r is equal to 1 or -1 there is total linear association between the variables, this implies that all points lie on a line.When r is equal to 0 there is no linear association between the two variables.
CorrelationCorrelation
Confidence Interval for Confidence Interval for ρρWe can obtain a 95% C.I. for the We can obtain a 95% C.I. for the correlation coefficient in the population correlation coefficient in the population ρρ
1 2
1 2
2z 2z
2z 2z
1 2
e 1 e 1, , where e 1 e 1
z z 1.96 / n 3 and z z 1.96 / n 3 1 1 rand z ln2 1 r
⎛ ⎞− −⎜ ⎟+ +⎝ ⎠
= − − = + −
+⎛ ⎞= ⎜ ⎟−⎝ ⎠
CorrelationCorrelation
Hypothesis Testing for Hypothesis Testing for ρρHH00
: : ρρ
= 0 = 0 vsvs
HH00
: : ρρ
≠≠
0 0
n 22
n 2Under the null hypothesis t r t1 r −
−=
−∼
Therefore
reject
Ho
if t > tn-2,α/2
OR t < -
tn-2,α/2
, α: significance level.If t > 0 the p-value is the area of St(n-2) to the right of t.If t < 0 the p-value is the area of St(n-2) to the left of t.
CorrelationCorrelation -- AssumptionsAssumptions
The correlation coefficient can be calculated for The correlation coefficient can be calculated for any data. However, in order to construct the any data. However, in order to construct the hypothesis test for hypothesis test for ρρ
at least one variable should at least one variable should
be normally distributed. For the calculation of a be normally distributed. For the calculation of a valid confidence interval both variables should valid confidence interval both variables should have the normal distribution. Also the two have the normal distribution. Also the two variables should be independent (if not you can variables should be independent (if not you can calculate for example calculate for example intraclassintraclass
correlation as correlation as
we mentioned on the previous chapter). we mentioned on the previous chapter).
SpearmansSpearmans Rank CorrelationRank CorrelationRank in total the values of each variable.Rank in total the values of each variable.Calculate the Pearson r on the ranks of the data and call Calculate the Pearson r on the ranks of the data and call this this rrss
. . Confidence intervalConfidence interval. The distribution of . The distribution of rrss
is similar to is similar to that of r for samples larger than about 10, so a CI for that of r for samples larger than about 10, so a CI for ρρσσ
((the population rank correlation) can be obtain using the the population rank correlation) can be obtain using the same formula as before with r.same formula as before with r.Hypothesis testing (for large samples only > 30) Hypothesis testing (for large samples only > 30)
HH00
: : ρρσσ
= 0 = 0 vsvs
HH00
: : ρρσσ
≠≠
0 0
s n 22s
n 2Under the null hypothesis t r t1 r −
−=
−∼
Therefore
reject
Ho
if t > tn-2,α/2
OR t < -
tn-2,α/2
, α: significance level.If t > 0 the p-value is the area of St(n-2) to the right of t.If t < 0 the p-value is the area of St(n-2) to the left of t.
Ecological correlationsEcological correlations
Correlations based on rates or averages can be Correlations based on rates or averages can be misleading.misleading.
Example 1:Example 1:
Relationship between the rate of Relationship between the rate of cigarette smoking (per capita) and the rate of cigarette smoking (per capita) and the rate of deaths from lung cancer in 11 countries gave deaths from lung cancer in 11 countries gave correlation 0.7. However, it is not countries correlation 0.7. However, it is not countries which smoke and get cancer, but people. To which smoke and get cancer, but people. To measure the strength of the relationship for measure the strength of the relationship for people, we must have data for individual people.people, we must have data for individual people.
Ecological correlationsEcological correlationsExample 2:Example 2:
From Current Population Survey data for 1993, From Current Population Survey data for 1993,
you can compute the correlation between income and you can compute the correlation between income and education for men age 25education for men age 25--54 in US: r=0.44. For each 54 in US: r=0.44. For each state you can compute the average educational level state you can compute the average educational level and income. Finally, you can compute the correlation and income. Finally, you can compute the correlation between the pairs of averages: this is 0.64!! If you use between the pairs of averages: this is 0.64!! If you use the correlation for the states to estimate the correlation the correlation for the states to estimate the correlation for the individuals, you would be way off. The reason is for the individuals, you would be way off. The reason is that within each state, there is a lot of spread around the that within each state, there is a lot of spread around the averages. Replacing the states by their averages averages. Replacing the states by their averages eliminates the spread, and gives a misleading eliminates the spread, and gives a misleading impression of tight clustering. impression of tight clustering.
Association vs. CausationAssociation vs. CausationFor school children, shoe size is strongly correlated with reading skills. However, learning new words does not make the feet get bigger. Instead, there is a third factor involved-age. As children get older, they learn to read better and they outgrow their shoes.
Correlation measures association. But association does not necessarily show causation. It may only show that both variables are simultaneously influenced by some third variable.
Regression Regression -- IntroductionIntroductionBasic idea:Basic idea:
Use data to identify Use data to identify relationshipsrelationships
among among variables and use these relationships to make variables and use these relationships to make predictionspredictions..Regression analysis describes the relationship between Regression analysis describes the relationship between two (or more) variables.two (or more) variables.Examples:Examples:
Income and educational level.Income and educational level.Demand for electricity and the weather.Demand for electricity and the weather.Home sales and interest rates.Home sales and interest rates.
Simple ExampleSimple Example
A linear model for hours worked:A linear model for hours worked:Hours workedHours worked = = αα
+ + ββ**perper--capita GDPcapita GDP
Where:Where:Hours of work: Hours of work: dependent variable (Y)dependent variable (Y)GDP perGDP per--capita: capita: independent variable (X)independent variable (X)αα : intercept (or baseline), : intercept (or baseline), ββ: slope are the : slope are the regression coefficientsregression coefficients
Simple ExampleSimple ExampleThe slope of this line gives:The slope of this line gives:
If If ββ
> 0, hours worked increase with the > 0, hours worked increase with the level of income. If level of income. If ββ
< 0, the work week < 0, the work week
gets shorter as a country develops.gets shorter as a country develops.
Change in Hours WorkedChange in GDP per capita
β =
Simple ExampleSimple Example
We want to find coefficient values that give a We want to find coefficient values that give a good good ‘‘fitfit’’
of the dataof the data
Plot of the data is called a Plot of the data is called a scatter diagramscatter diagramIt describes the relationship between Hours It describes the relationship between Hours Worked and GDP perWorked and GDP per--capita for several capita for several countriescountries
Scatter Diagram: Hours Worked and GDP per Capita
30.0
35.0
40.0
45.0
50.0
55.0
0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000
GDP per capita
Wee
kly
Hou
rs W
orke
d
So Many Choices...
30,0
35,0
40,0
45,0
50,0
55,0
0 5.000 10.000 15.000 20.000 25.000 30.000 35.000 40.000
GDP per capita
Wee
kly
Hou
rs W
orke
d
Line C
Simple ExampleSimple ExampleThe The regression lineregression line
is the line that best is the line that best
summarizes the data.summarizes the data.More precisely, itMore precisely, it’’s the line that s the line that minimizes the distance between every minimizes the distance between every point in the scatter diagram and the point in the scatter diagram and the corresponding point in the line.corresponding point in the line.This method of estimating the This method of estimating the regression line is called regression line is called least squares.least squares.
Scatter Diagram and Regression Line
30.0
35.0
40.0
45.0
50.0
55.0
0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000
GDP per capita
Wee
kly
Hou
rs W
orke
d
Simple ExampleSimple ExampleIn our example the regression line is:In our example the regression line is:
Hours WorkedHours Worked = 45.3 = 45.3 ––
0.00025 0.00025 Per capita GDPPer capita GDP
(1.52) (0.00007)(1.52) (0.00007)
A $1,000 increase in GDP perA $1,000 increase in GDP per--capita reduces by capita reduces by a quarter of an hour a quarter of an hour The standard errors (in parenthesis) are a The standard errors (in parenthesis) are a measure of the statistical precision with which measure of the statistical precision with which the coefficients are estimatedthe coefficients are estimated
General Concepts General Concepts
i Estrior
(mg/24hr)xi
Birthweight
(g/100)yi
i Estrior
(mg/24hr)xi
Birthweight
(g/100)yi
1 7 25 17 17 32
2 9 25 18 25 32
3 9 25 19 27 34
4 12 27 20 15 34
5 14 27 21 15 34
6 16 27 22 15 35
7 16 24 23 16 35
8 14 30 24 19 34
9 16 30 25 18 35
10 16 31 26 17 36
11 17 30 27 18 37
12 19 31 28 20 38
13 21 30 29 22 40
14 24 28 30 25 39
15 15 32 31 24 43
16 18 32
This is a study to This is a study to relate relate birthweightbirthweight
to to
the the estriolestriol
level of level of pregnant women. Let pregnant women. Let x = x = estriolestriol
level and y level and y
= = birtweightbirtweight. From a . From a scatter plot of the scatter plot of the data it appears to be data it appears to be a linear relationship. a linear relationship.
General Concepts General Concepts
2530
3540
45bi
rthw
eigh
t5 10 15 20 25
estriol
General ConceptsGeneral Concepts
We can postulate a relationship between y and x that is We can postulate a relationship between y and x that is of the form E(Y|X)=of the form E(Y|X)=αα++ββXX. . That is, for a given That is, for a given estriolestriol
level level
x, the average x, the average birthweightbirthweight
is is αα++ββx.x.The line y = The line y = αα++ββx is the x is the regression lineregression line
where where αα
is the is the
intercept and intercept and ββ
is the slope.is the slope.The relationship y The relationship y = = αα++ββx is not expected to hold exactly x is not expected to hold exactly for every woman. For example not all women with a for every woman. For example not all women with a given given estriolestriol
level have babies with identical level have babies with identical birthweightsbirthweights. .
Thus an error term e, which represents the variance of Thus an error term e, which represents the variance of birthweightbirthweight
among all babies of women with a given among all babies of women with a given
estriolestriol
level x, is introduced into the model. level x, is introduced into the model. Lets assume Lets assume that e follows a normal distribution with mean 0 and that e follows a normal distribution with mean 0 and variance variance σσ22. . The full linear regression model takes the The full linear regression model takes the form Y = form Y = αα++ββX+eX+e
Dependent variable Independent variable or predictor
General ConceptsGeneral Concepts
The interpretation of the regression line is that The interpretation of the regression line is that for a woman with for a woman with estriolestriol
level x the level x the
corresponding corresponding birtweightbirtweight
will be normally will be normally distributed with mean distributed with mean αα++ββx and variance x and variance σσ22. . If If σσ22=0 then every point will fall exactly on the =0 then every point will fall exactly on the regression line.regression line.The interpretation of The interpretation of ββ
is the following. If is the following. If ββ>0 >0
then as x increases the expected value of Y then as x increases the expected value of Y given X will increase. If given X will increase. If ββ<<0 0 then as X increases then as X increases the expected value of Y given X will decrease. If the expected value of Y given X will decrease. If ββ==00
then there is no linear relationship between then there is no linear relationship between
X and Y.X and Y.
General ConceptsGeneral Concepts
σ2
= 0 perfect fit σ2
> 0 imperfect fit
General ConceptsGeneral Concepts
β
> 0 β
< 0 β
= 0
Fitting Regression Lines Fitting Regression Lines –– Least Least Squares MethodSquares Method
y = α
+ βx
di
(xi
,yi
)
(xi
, )iy
Fitting Regression Lines Fitting Regression Lines –– Least Least Squares MethodSquares Method
The The Least Squares lineLeast Squares line
is the line y = ais the line y = a++bxbx
that that minimizes the sum of square distances of the minimizes the sum of square distances of the sample points from the line given by sample points from the line given by
This method of estimating the parameters of the This method of estimating the parameters of the regression line is known as the method of regression line is known as the method of least least squaressquares. .
n n2 2i i i
i 1 i 1
S d (y a bx )= =
= = − −∑ ∑
Estimates of α
and β
Fitting Regression Lines Fitting Regression Lines –– Least Least Squares MethodSquares Method
The The raw sum of squares for xraw sum of squares for x
The The corrected sum of squares for x, ycorrected sum of squares for x, y
n2i
i 1x
=∑
2n n n2 2
xx i i ii 1 i 1 i 1
L (x x) x x n= = =
⎛ ⎞= − = −⎜ ⎟⎝ ⎠
∑ ∑ ∑
yn
2i
i 1y
=∑
2n n n2 2
yy i i ii 1 i 1 i 1
L (y y) y y n= = =
⎛ ⎞= − = −⎜ ⎟⎝ ⎠
∑ ∑ ∑
Fitting Regression Lines Fitting Regression Lines –– Least Least Squares MethodSquares Method
Notice that Notice that LLxxxx
and and LLyyyy
are the numerators are the numerators of the sample variances of x (of the sample variances of x (i.ei.e
) and of ) and of
y (i.e. ) . y (i.e. ) . The The raw sum of cross productsraw sum of cross products
The The corrected sum of cross productscorrected sum of cross products
2XS
2YS
n
i ii 1
x y=∑
n n n n
xy i i i i i ii 1 i 1 i 1 i 1
L (x x)(y y) x y x y n= = = =
⎛ ⎞⎛ ⎞= − − = − ⎜ ⎟⎜ ⎟
⎝ ⎠⎝ ⎠∑ ∑ ∑ ∑
Fitting Regression Lines Fitting Regression Lines –– Least Least Squares MethodSquares Method
The coefficients of the leastThe coefficients of the least--square line square line y = a + y = a + bxbx
are given by:are given by:
The predicted, or average, value of Y for a The predicted, or average, value of Y for a given value of x, as estimated from the given value of x, as estimated from the fitted regression line is denoted by fitted regression line is denoted by
n
i ix yi 1
n2 x x
ii 1
( x x )( y y ) Lb an d a y b x
L( x x )
=
=
− −= = = −
−
∑
∑
Y a bx= +
InferenceInference
y
x
(x, y)
Regression Liney = a
+ bx
i i(x , y ) Residual Component
i iˆ(y y )−
Regression Component
iˆ(y y)−i(y y)− i iˆ(x , y )
sample point
InferenceInferenceFirst notice that the point always falls on the First notice that the point always falls on the regression line. regression line. For any sample point the For any sample point the residualresidual
or residual or residual
componentcomponent
is defined byis defined byFor any sample point the For any sample point the regressionregression
componentcomponent
is is
defined bydefined byPlease notice that the deviationPlease notice that the deviation
A good fitting regression line will have regression A good fitting regression line will have regression components large in absolute value relative to the residual components large in absolute value relative to the residual components, where the opposite is true for poor fitting components, where the opposite is true for poor fitting regression lines.regression lines.
(x, y)
i i(x , y )i iˆ(y y )−
i i(x , y )iˆ(y y)−
i i i iˆ ˆ(y y) (y y ) (y y)− = − + −
InferenceInference
InferenceInference
n n n2 2 2
i i i ii 1 i 1 i 1
ˆ ˆ(y y) (y y ) (y y)
Total SS Res SS Reg SS= = =
− = − + −
= +
∑ ∑ ∑
2 2xy xx xy xx
2yy xy xx
Reg SS bL b L L / L
Res SS Total SS Reg SS L L / L
= = =
= − = −
2y x
Reg MS Reg SS/kRes MS Res SS/(n-k-1) s
k the number of predictors (in the simple linear regression case k=1)
⋅
=
= ≡
=
an estimate of σ2, the variance of y given x
InferenceInferenceF test for Simple Linear RegressionF test for Simple Linear Regression
HH00
: : ββ
= 0 = 0 vsvs
HH11
: : ββ
≠≠
001.
Compute F = Reg
MS / Res MS that follows F1,n-2
distribution under Ho.
2.
If α
is the significant level reject Ho
if F > F1,n-2,α
3.
The p value = P(F1,n-2 > F)
All these results are usually summarized in an ANOVA table
InferenceInference
Coefficient of DeterminationCoefficient of DeterminationA summary measure of goodness of fit frequently A summary measure of goodness of fit frequently referred to in the literature is referred to in the literature is the coefficient of the coefficient of determinationdetermination
RR22
= = RegReg
SS / Total SS SS / Total SS
RR22
is the proportion of the total variation of the observed is the proportion of the total variation of the observed
values of Y that is accounted for by the regression values of Y that is accounted for by the regression equation of the independent variables. Always 0 equation of the independent variables. Always 0 ≤≤
RR22
≤≤
1. If it is equal to 1 then all variation in Y can be 1. If it is equal to 1 then all variation in Y can be explained by variation in X, and all data points fall on the explained by variation in X, and all data points fall on the regression line. In other words, once X is know Y can be regression line. In other words, once X is know Y can be predicted exactly with no error or variability in the predicted exactly with no error or variability in the prediction. If prediction. If RR22
= 0 then X gives no information about Y, = 0 then X gives no information about Y,
and the variance of Y is the same with or without and the variance of Y is the same with or without knowing X. knowing X.
Coefficient of DeterminationCoefficient of Determination0
510
15Fi
tted
valu
es/v
ar6
0 5 10 15var5
Fitted values var6
.51
1.5
22.
5Fi
tted
valu
es/A
dver
tisin
g ex
pend
iture
s ($
mill
ion)
10 20 30 40var4
Fitted values Advertising expenditures ($ million)
RR2 2 = 0.69= 0.69
RR2 2 = 0.46= 0.46
The value of The value of RR22
is frequently used to measure the extent is frequently used to measure the extent
to which the regression model fits the data. This is to which the regression model fits the data. This is WRONG! There are other ways to determine whether a WRONG! There are other ways to determine whether a linear regression is valid or not.linear regression is valid or not.
Coefficient of Determination and Coefficient of Determination and Sample CorrelationSample Correlation
It can be proved that It can be proved that RR22= [= [r(X,Yr(X,Y) ]) ]22
Thus RThus R2 2 is the square of the sample correlation coefficient between is the square of the sample correlation coefficient between the independent variable Y and the dependent variable X.the independent variable Y and the dependent variable X.
The estimate of the slope b in the simple linear The estimate of the slope b in the simple linear regression model can be written asregression model can be written as
x
y
Sb r(x, y)S
= ×
where Swhere SXX
and Sand SYY
are the sample standard are the sample standard deviations of X, Ydeviations of X, Y
respectively.respectively.
InferenceInference
T test for Simple Linear RegressionT test for Simple Linear Regression
HH00
: : ββ
= 0 = 0 vsvs
HH11
: : ββ
≠≠
00
1.
Compute . This follows the t distribution with n-2 df
under Ho.
2.
If α
is the significant level reject Ho
if t > tn-2,α/2
or t < -
tn-2,α/2
3.
The p value is given byp = 2 ×
(area to the left of t under the t distribution with n-2 df) if t < 0
p = 2 ×
(area to the right of t under the t distribution with n-2 df) if t ≥
0
y x xxt b / se(b), where se(b) s / L⋅= =
InferenceInference
InferenceInference
Confidence IntervalsConfidence Intervals
Two sided 100% Two sided 100% ××
(1(1--αα) ) CI for the CI for the parameters of a regression line parameters of a regression line αα
and and ββ
n 2, / 2 n 2, / 2
2
y x xx y xxx
b t se(b) and a t se(a)
1 xwhere se(b) = s / L and se(a) = s /n L
− α − α
⋅ ⋅
± ±
+
PredictionPrediction
Suppose we wish to make predictions from a Suppose we wish to make predictions from a regression line for an individual observation with regression line for an individual observation with independent variable x that was not used in independent variable x that was not used in constructing the regression line. The distribution constructing the regression line. The distribution of observed Y values for the subset of of observed Y values for the subset of individuals with independent variable x is normal individuals with independent variable x is normal with mean and standard deviation with mean and standard deviation Y a bx= +
2
y xxx
1 (x x)ˆse(Y) s 1n L⋅
−= + +
Furthermore a two sided 100% two sided 100% ××
(1(1--αα) ) CICI
of the observed values (prediction interval for Y) is given by
n 2, / 2ˆ ˆY t se(Y)− α±
ExampleExample
n = 15 observationsn = 15 observationsY = First Year Sales ($ million)Y = First Year Sales ($ million)X = Advertising Expenditures ($ million)X = Advertising Expenditures ($ million)Try to fit a simple linear regression model:Try to fit a simple linear regression model:
Y = Y = αα
+ + ββXX
ExampleExampleY = First Year sales (million $) X = Advertising expenditures (million $)
101.8 1.3
44.4 .7
108.3 1.4
85.1 .5
77.1 .5
158.7 1.9
180.4 1.2
64.2 .4
74.6 .6
143.4 1.3
120.6 1.6
69.7 1
67.8 .8
106.7 .6
119.6 1.1
ExampleExample2n n n
2 2xx i i i
i 1 i 1 i 1
L (x x) x x n 2.869333= = =
⎛ ⎞= − = − =⎜ ⎟⎝ ⎠
∑ ∑ ∑2n n n
2 2yy i i i
i 1 i 1 i 1
L (y y) y y n 20405.1= = =
⎛ ⎞= − = − =⎜ ⎟⎝ ⎠
∑ ∑ ∑
n n n n
xy i i i i i ii 1 i 1 i 1 i 1
L (x x)(y y) x y x y n 171.2393= = = =
⎛ ⎞⎛ ⎞= − − = − =⎜ ⎟⎜ ⎟
⎝ ⎠⎝ ⎠∑ ∑ ∑ ∑
ExampleExample
n
i ixyi 1
n2 xx
ii 1
(x x)(y y) Lb 59.67914
L(x x)
and
a y bx 42.21205
=
=
− −= = =
−
= − =
∑
∑
ExampleExample40
6080
100
120
140
Fitte
d va
lues
/Firs
t Yea
r Sal
es ($
milli
on)
.5 1 1.5 2 2.5Advertising expenditures ($ million)
Fitted values First Year Sales ($ million)
ExampleExamplen
2yy i
i 1Total SS L (y y) 20405.01
=
= = − =∑2 2
xy xx xy xx
2yy xy xx
Reg SS bL b L L / L 10219.42
Res SS Total SS Reg SS L L / L 10185.5919
= = = =
= − = − =
2y x
Reg MS Reg SS/k Reg SS 10219.42Res MS Res SS/(n-k-1) Res SS/(n-2) s 783.507067⋅
= = =
= = ≡ =
RR2 2 = = RegReg
SS / Total SS = 0.5008SS / Total SS = 0.5008
ExampleExampleF testF test
F = F = RegReg
MS / Res MS = 13.04MS / Res MS = 13.04p value = P(Fp value = P(Fk,nk,n--kk--22
> F) = 0.0032> F) = 0.0032t testt test
P value = 2 P value = 2 ××
(area to the right of t under the t distribution with 13 (area to the right of t under the t distribution with 13 dfdf) = 0.003) = 0.003
y x xxse(b) s / L 16.5246
t b / se(b) 3.61⋅= =
= =
ExampleExample
n 2, / 2
2
y xxx
n 2, / 2
b t se(b) = 59.67914 2.160369 16.5246 (23.97991, 95.37837)
1 xse(a) = s / 17.93509n L
a t se(a) 42.21205 2.160369 17.93509(3.465644, 80.95847)
− α
⋅
− α
± ± ×
=
+ =
± = ± × =
=
ExampleExample
Suppose we want in the future to spend x = 0.9 Suppose we want in the future to spend x = 0.9 (million $) in advertising and we wish to predict (million $) in advertising and we wish to predict the first year sales (in million $). According to the the first year sales (in million $). According to the regression line we will gain 59.67914 regression line we will gain 59.67914 ××
0.90.9
+ +
42.21205 = 95.92 million $. This estimate has a 42.21205 = 95.92 million $. This estimate has a standard errorstandard error
A 95% CI for the first year sales (in million $) A 95% CI for the first year sales (in million $) when we spend 0.9 million $ in advertising is when we spend 0.9 million $ in advertising is
2
y xxx
1 (x x)ˆse(Y) s 1 28.95n L⋅
−= + + =
n 2, / 2ˆ ˆY t se(y) 95.92 2.160369 28.95 (33.37,158.46)− α± = ± × =
Computer OutputComputer Output
Source
SS
df
MS
Number of obs
= 15F( 1, 13)
= 13.04Model
10219.4158
1 10219.4158
Prob
> F
= 0.0032Residual
10185.5919
13 783.507067
R-squared
= 0.5008Adj
R-squared
= 0.4624Total
20405.0077
14 1457.50055
Root MSE
= 27.991
sales
Coef. Std. Err. t
P>t
[95% Conf.
Interval]
advertisin~s
59.67914 16.5246 3.61
0.003
23.97991
95.37837_cons
42.21205 17.93509 2.35
0.035
3.465644
80.95847
Coefficient ofCoefficient ofDeterminationDetermination
Regression CoefficientsRegression CoefficientsPP--valuesvaluest t --
testtestConfidence Intervals Confidence Intervals for the coefficientsfor the coefficientsY = 59.67914 x + 42.21205Y = 59.67914 x + 42.21205
Attempt to take intoAttempt to take intoaccount the samplingaccount the sampling
F test
y xs ⋅
Multiple RegressionMultiple Regression
Simple Linear Regression is a model to Simple Linear Regression is a model to predict the value of one variable from predict the value of one variable from another.another.Multiple Regression is a natural extension Multiple Regression is a natural extension of this model: We use it to predict values of this model: We use it to predict values of an outcome from several predictors.of an outcome from several predictors.
Multiple RegressionMultiple Regression
Suppose we have k independent variables Suppose we have k independent variables XX11
,,……,,XXkk
and a dependent variable Y. Then and a dependent variable Y. Then
the the multiple linear regression multiple linear regression is of the is of the form form
We estimate We estimate αα, , ββ11
,...,...ββκκ
by a, bby a, b11
, , bbkk
using using the method of least squares, where we the method of least squares, where we minimize the sum of minimize the sum of
k
j jj 1
Y X e=
= α + β +∑
2k
j jj 1
Y X=
⎡ ⎤⎛ ⎞− α + β⎢ ⎥⎜ ⎟
⎢ ⎥⎝ ⎠⎣ ⎦∑
2N(0, )σ
Multiple RegressionMultiple RegressionIn the multiple linear regression of the formIn the multiple linear regression of the form
the the ββjj
ss
are referred to as are referred to as partial regression partial regression coefficientscoefficients. The . The ββjj
represents the average increase in represents the average increase in Y per unit increase in Y per unit increase in XXjj
with all other variables held with all other variables held constantconstant
(or stated another way after adjusting for all (or stated another way after adjusting for all
other variables in the model) and is estimated by other variables in the model) and is estimated by bbjj
..
k
j jj 1
Y X e=
= α + β +∑
Multiple RegressionMultiple RegressionPartial regression coefficients differ from single linear Partial regression coefficients differ from single linear regression coefficients. The later represent the average regression coefficients. The later represent the average increase in Y per unit increase in X, without considering increase in Y per unit increase in X, without considering any other independent variables. If there are strong any other independent variables. If there are strong relationships among independent variables in a multiple relationships among independent variables in a multiple regression model, then the partial regression coefficients regression model, then the partial regression coefficients may differ considerably from the simple linear regression may differ considerably from the simple linear regression coefficients obtained from considering each independent coefficients obtained from considering each independent variable separately. variable separately. It is possible that a independent variable XIt is possible that a independent variable X11
will seem to will seem to have an important effect on Y when considered by itself, have an important effect on Y when considered by itself, but will not be significant after adjusting for another but will not be significant after adjusting for another variable Xvariable X22
. This usually occurs when X. This usually occurs when X11
and Xand X22
are are strongly related to each other and Xstrongly related to each other and X2 2 is related to Y. We is related to Y. We refer to Xrefer to X22
as a as a confounderconfounder
of the relationship between of the relationship between Y and XY and X11
..
InferenceInferenceF test for Multiple Linear RegressionF test for Multiple Linear Regression
HH00
: : ββ11
= = ……==ββκκ
==0 0 vsvs
HH11
: : at least one at least one ββjj
≠≠
00
1.
Compute F = Reg
MS / Res MS that follows Fk,n-k-1
distribution under Ho.
2.
If α
is the significant level reject Ho
if F > Fk,n-k-1,α
3.
The p value = P(Fk,n-k-1 > F)
ij ij
n2
i ii 1
n2
ii 1
k
i jj 1
ˆRes SS (y y )
Reg SS Total SS - Res SS
Total SS (y y)
y a b x , where x is the jth independent
variable for the ith subject, j 1, ,.k, i 1,.., n
=
=
=
= −
=
= −
= +
= =
∑
∑
∑
InferenceInference
T test for Multiple Linear RegressionT test for Multiple Linear Regression
1.
Compute . This follows the t distribution with n-k-1 df
under Ho.
2.
If α
is the significant level reject Ho
if t > tn-k-1,α/2
or t < -tn-k-1,α/2
3.
The p value is given byp = 2 ×
(area to the left of t under the t distribution with n-k-1 df) if t < 0
p = 2 ×
(area to the right of t under the t distribution with n-k-1 df) if t ≥
0
i it b / se(b )=
HH00
: : ββii
= 0 all other = 0 all other ββj j ≠≠
00
vsvs
HH11
: : ββii
≠≠
00 & ββjj
≠≠
00
Predicting Sales of a product based Predicting Sales of a product based on Multiple Factorson Multiple Factors
Table: Sales of Nature-Bar, advertising expenditures, promotion expenditures, and competitors’ sales, by region, for 1998.
Region Sales
($million) Yi
Advertising Expenditures
($million) X1i
Promotions Expenditures
($million) X2i
Competitors’ Sales
($million) X3i
Selkirk 101.8 1.3 0.2 20.40 Susquehanna 44.4 0.7 0.2 30.50
Kittery 108.3 1.4 0.3 24.60 Acton 85.1 0.5 0.4 19.60
Finger Lakes 77.1 0.5 0.6 25.50 Berkshire 158.7 1.9 0.4 21.70 Central 180.4 1.2 1.0 6.80
Providence 64.2 0.4 0.4 12.60 Nashua 74.6 0.6 0.5 31.30 Dunster 143.4 1.3 0.6 18.60 Endicott 120.6 1.6 0.8 19.90
Five-Towns 69.7 1.0 0.3 25.50 Waldeboro 67.8 0.8 0.2 27.40
Jackson 106.7 0.6 0.5 24.30 Stowe 119.6 1.1 0.3 13.70
Predicting Sales of a product based on Predicting Sales of a product based on Multiple FactorsMultiple Factors
Y: dependent variable Y: dependent variable ––
sales of nature barsales of nature bark = 3 independent variablesk = 3 independent variablesXX11
= advertising expenditures= advertising expenditures
XX22
= promotional expenditures= promotional expendituresXX33
= competitors= competitors’’
salessales
n = 15 number of observationsn = 15 number of observations
Predicting Sales of a product based on Predicting Sales of a product based on Multiple FactorsMultiple Factors
With our data it comes out that:With our data it comes out that:Y = 65.705 + 48.979xY = 65.705 + 48.979x11
+ 59.654x+ 59.654x22
––
1.838x1.838x33
Based on the above regression let suppose that we want to Based on the above regression let suppose that we want to predict sales of Naturepredict sales of Nature--Bar for next year in the Nashua Bar for next year in the Nashua region given that we are planning to spend $0.7 million on region given that we are planning to spend $0.7 million on advertising, $0.6 million on promotions and we estimate that advertising, $0.6 million on promotions and we estimate that competitorscompetitors’’
sales will remain flat at their current level of sales will remain flat at their current level of
$31.30 million.$31.30 million.
Y = 65.705 + 48.979 0.7 + 59.654 0.6 Y = 65.705 + 48.979 0.7 + 59.654 0.6 ––
1.838 31.30 = $78.253 million1.838 31.30 = $78.253 million
• • •
Computer outputComputer output
Source | SS Source | SS dfdf
MS Number of MS Number of obsobs
= 15= 15--------------------------++----------------------------------------------------------------------------------------------------
F( 3, 11) = 18.29F( 3, 11) = 18.29Model | 16997.5351 3 5665.84503 Model | 16997.5351 3 5665.84503 ProbProb
> F = 0.0001> F = 0.0001Residual| 3407.47258 11 309.770235 RResidual| 3407.47258 11 309.770235 R--squared = 0.8330squared = 0.8330--------------------------++------------------------------------------------------------------------------------------------------
AdjAdj
RR--squared = 0.7875squared = 0.7875Total | 20405.0077 14 1457.50055 RoTotal | 20405.0077 14 1457.50055 Root MSE = 17.6ot MSE = 17.6
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------sales | sales | CoefCoef. Std. Err. t P>|t| [95% Conf. Interval]. Std. Err. t P>|t| [95% Conf. Interval]
--------------------------++------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Advertising | 48.97876 10.65787 4.60 0.001 25.520Advertising | 48.97876 10.65787 4.60 0.001 25.52096 72.4365796 72.43657promotion | 59.65425 23.6247 2.53 0.028 7.656promotion | 59.65425 23.6247 2.53 0.028 7.656646 111.6519646 111.6519competitors | competitors | --1.837632 .8137517 1.837632 .8137517 --2.26 0.045 2.26 0.045 --3.628687 3.628687 --.0465762.0465762
_cons | 65.70461 27.73107 2.37 0.037 4._cons | 65.70461 27.73107 2.37 0.037 4.668938 126.7403668938 126.7403----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ValidationValidation
LinearityLinearityNormality of the residuals Normality of the residuals HeteroscedasticityHeteroscedasticityAutocorrelationAutocorrelation
LinearityLinearity
The dependent variable Y depends linearly on the values of The dependent variable Y depends linearly on the values of the independent variables the independent variables When k = 1When k = 1
check thatcheck that
with a scatter plotwith a scatter plot
With k > 1 rely on common sense.With k > 1 rely on common sense.Check value of RCheck value of R22
but as discussed before with caution.but as discussed before with caution.
You might need to add a quadratic term for example if there You might need to add a quadratic term for example if there is a problem with linearity, or transform both the dependent is a problem with linearity, or transform both the dependent and independent variables. and independent variables.
NormalityNormality
i i i i 1 1i 2 2i 3 3iˆy y y a b x b x b xε = − = − + + +
The linear regressionThe linear regressionmodel model Y = Y = αα
+ + ββ11
XX11
+ + ββ22
XX2 2 + + ββ33
XX33
++
ee
assumes that eassumes that e
~~ΝΝ(0,(0,σσ22))..
In order to check that plot In order to check that plot a histogram of the a histogram of the regression residualsregression residuals
01
23
4Fr
eque
ncy
-30 -20 -10 0 10 20Residuals
If there is evidence for no normality, you might need If there is evidence for no normality, you might need to transform your variables, usually the dependent.to transform your variables, usually the dependent.
HeteroscedasticityHeteroscedasticity
The linear regression model The linear regression model YYii
= = αα
+ + ββ11
xx1i1i
+ + ββ22
xx2i2i
+ + ββ33
xx3i3i
+ + eeii
assumesassumes
that the variance of Ythat the variance of Yii
is constant (i.e., is constant (i.e., σσ22)). . This property is called This property is called homoscedasticityhomoscedasticity. . Plot residuals versus the independent variables or Plot residuals versus the independent variables or versus the fitted values and check that there is no versus the fitted values and check that there is no pattern.pattern.If there is a pattern you need to transform your If there is a pattern you need to transform your dependent variable.dependent variable.
iY
HeteroscedasticityHeteroscedasticity-4
0-2
00
2040
Res
idua
ls
.5 1 1.5 2advertising_expenditures
HeteroscedasticityHeteroscedasticity
AutocorrelationAutocorrelation
The linear regression model The linear regression model YYii
= = αα
+ + ββ11
xx1i1i
+ + ββ22
xx2i2i
+ + ββ33
xx3i3i
+ + eeiiassumes that assumes that eeii
~~
ΝΝ(0,(0,σσ22)), with , with eeii
independent. The independent. The phenomenon of autocorrelation can occur if the assumption phenomenon of autocorrelation can occur if the assumption of independence is violated. of independence is violated. Suppose that the regression model is specified with a time Suppose that the regression model is specified with a time component (data for the last 14 weeks) component (data for the last 14 weeks) Plot the residuals in time order of the observations and see if Plot the residuals in time order of the observations and see if there is any kind of a pattern. there is any kind of a pattern. If there is such a pattern then incorporate time as one of the If there is such a pattern then incorporate time as one of the independent variables. independent variables.
AutocorrelationAutocorrelation
AutocorrelationAutocorrelation-4
0-2
00
2040
Res
idua
ls
0 5 10 15observation_number
Warnings and IssuesWarnings and Issues
1.1.
OverspecificationOverspecification
by the addition of too many by the addition of too many Independent Variables.Independent Variables.Use only the independent variables that Use only the independent variables that make sense. It is true that the more the make sense. It is true that the more the better, since Rbetter, since R22
cannot be decreased by cannot be decreased by
adding variables, but the simpler your model adding variables, but the simpler your model the better. the better.
n n ≥≥
5(k+2)5(k+2)Use Use stepwise multiple regressionstepwise multiple regression
(start from (start from
the null model and add the the null model and add the ““bestbest””
variables at variables at each time until Reach time until R2 2 is quite large, or its is quite large, or its increase is too small. increase is too small.
Warnings and IssuesWarnings and Issues
2.2.
Extrapolating beyond the Range of the Data.Extrapolating beyond the Range of the Data.Y = 65.705 + 48.979XY = 65.705 + 48.979X11
+ 59.654X+ 59.654X22
––
1.838X1.838X33
Notice that all of the advertising Notice that all of the advertising expenditures (Xexpenditures (X11
) for the regions in the table ) for the regions in the table with the data are between $0.4 and $1.9. with the data are between $0.4 and $1.9. The regression model is valid in this range. The regression model is valid in this range. Thus it would be unwise to use the model to Thus it would be unwise to use the model to predict sales if we had spend for advertising predict sales if we had spend for advertising purposes $10 million. purposes $10 million.
Warnings and IssuesWarnings and Issues3.3.
MulticollinearityMulticollinearity..
Two independent variables are highly Two independent variables are highly correlated. Should suspect it if Rcorrelated. Should suspect it if R2 2 is high but is high but one or more of the variables does not pass one or more of the variables does not pass the significant test. Check all correlations the significant test. Check all correlations before running regression. If before running regression. If multicollinearitymulticollinearity
occurs, drop one of the independent variables occurs, drop one of the independent variables that is highly correlated with another one. that is highly correlated with another one.
MulticollinearityMulticollinearityTable:Undergraduate grade point average (GPA), GMAT score and graduate school grade point average (GPA) for 25 MBA students
Student
Number
Undergraduate GPA
GMAT Graduate School GPA
1 3.9 640 4.0 2 3.9 644 4.0 3 3.1 557 3.1 4 3.2 550 3.1 5 3.0 547 3.0 6 3.5 589 3.5 7 3.0 533 3.1 8 3.5 600 3.5 9 3.2 630 3.1 10 3.2 548 3.2 11 3.2 600 3.8 12 3.7 633 4.1 13 3.9 546 2.9 14 3.0 602 3.7 15 3.7 614 3.8 16 3.8 644 3.9 17 3.9 634 3.6 18 3.7 572 3.1 19 3.0 570 3.3 20 3.2 656 4.0 21 3.9 574 3.1 22 3.1 636 3.7 23 3.7 635 3.7 24 4.0 654 3.9 25 3.8 633 3.8
MulticollinearityMulticollinearity
Graduate GPA = 0.09540 + 1.13 (Under. GPA) Graduate GPA = 0.09540 + 1.13 (Under. GPA) --0.0088 (GMAT)0.0088 (GMAT)
RR22
= 0.960= 0.960
Corr(underCorr(under. GPA, GMAT) = 0.895. GPA, GMAT) = 0.895
Graduate GPA = Graduate GPA = --0.1287 + 1.0413 (Under. GPA)0.1287 + 1.0413 (Under. GPA)
RR22
= 0.958= 0.958
Not significantNot significant
significantsignificant
OutliersOutliers
Observations that lie outside the overall pattern Observations that lie outside the overall pattern of the other observations. of the other observations. Observations with large residualsObservations with large residualsObservations falling far from the regression line Observations falling far from the regression line while not following the pattern of the relationship while not following the pattern of the relationship apparent in the othersapparent in the others
OutliersOutliers
OutliersOutliers
Outliers can distort the regression results. Therefore many scieOutliers can distort the regression results. Therefore many scientists ntists remove them to have a better fitting. But be CAREFUL! Remove outremove them to have a better fitting. But be CAREFUL! Remove outliers liers only if you are sure that it is a bad data point. only if you are sure that it is a bad data point. Transforming data is one Transforming data is one way to soften the impact of outliers since the most commonly useway to soften the impact of outliers since the most commonly used d expressions, square roots and logarithms, shrink larger values texpressions, square roots and logarithms, shrink larger values to a o a much greater extent than they shrink smaller valuesmuch greater extent than they shrink smaller values..Outliers should be investigated carefully. Often they contain vaOutliers should be investigated carefully. Often they contain valuable luable information about the process under investigation or the data gainformation about the process under investigation or the data gathering thering and recording process. Before considering the possible eliminatiand recording process. Before considering the possible elimination of on of these points from the data, one should try to understand why thethese points from the data, one should try to understand why they y appeared and whether it is likely similar values will continue tappeared and whether it is likely similar values will continue to appear. o appear. Of course, outliers are often bad data points. Of course, outliers are often bad data points.
Other Types of RegressionOther Types of Regression
Non linearNon linear
(e.g. add a quadratic term)(e.g. add a quadratic term)
Other Types of RegressionOther Types of Regression
Logistic RegressionLogistic Regression. The dependent . The dependent variable Y is binary (common in medical variable Y is binary (common in medical research)research)Poisson RegressionPoisson Regression. The dependent . The dependent variable Y is categorical. variable Y is categorical.
Dummy VariablesDummy VariablesWe would like to use linear regression to We would like to use linear regression to predict the effect that a particular predict the effect that a particular phenomenon has on the value of the phenomenon has on the value of the dependant variable, where the dependant variable, where the phenomenon in question either takes phenomenon in question either takes place or not.place or not.
Dummy VariablesDummy VariablesTable: Annual Repair Costs for 19 vehicles at an automobile dealership
Vehicle Age of Vehicle
(Years)
Automatic Transmission (Yes=1, No=0)
Annual Repair Costs ($)
1 3 1 956 2 4 0 839 3 6 0 1257 4 5 1 1225 5 4 1 1288 6 2 1 728 7 4 0 961 8 8 1 1588 9 7 0 1524
10 4 0 875 11 3 1 999 12 5 1 1295 13 3 0 884 14 2 1 789 15 4 0 785 16 3 1 923 17 4 1 1223 18 9 0 1770 19 2 1 692
Dummy VariablesDummy VariablesRepair Cost = Repair Cost = αα
+ + ββ11
XX11
+ + ββ22
XX22
+ e ,+ e ,
where ewhere e
~~ΝΝ(0,(0,σσ22))
AgeAge
Dummy VariableDummy Variable(X(X22
=1 or 0 =1 or 0 depending on weatherdepending on weatheror not the vehicle hasor not the vehicle hasan automatic an automatic transmtransm.).)
RR22
= 0.913= 0.913
CoeffCoeff. St.Err.. St.Err.
Intercept 288.133 72.332Intercept 288.133 72.332Age 160.730 12.424Age 160.730 12.424Automatic 176.964 48.335Automatic 176.964 48.335Repair Cost = 288.133 + 160.730 XRepair Cost = 288.133 + 160.730 X11
+ 176.964 X+ 176.964 X22
Estimate of the additional annual repair cost if you Estimate of the additional annual repair cost if you have an automatic transmission have an automatic transmission
Dummy VariablesDummy Variables
Suppose we have a categorical variable C Suppose we have a categorical variable C with k categories. To represent that with k categories. To represent that variable in a multiple regression model we variable in a multiple regression model we construct kconstruct k--1 dummy variables of the form1 dummy variables of the form
1 if subject is in category 2
1 0 otherwise1 if subject is in category 3
2 0 otherwise
1 if subject is in category k
k 1 0 otherwise
X
X
X
{{
{−
=
=
=
The category omitted (category 1) is referred to as the reference group. It is arbitrary which group is assigned to be the reference group.
Dummy VariablesDummy VariablesTo relate the categorical variable C to an outcome To relate the categorical variable C to an outcome ΥΥ, we use , we use the multiple regression model Y = the multiple regression model Y = αα
+ + ββ11
XX11
++……++ββkk--11
XXkk--11
+e. +e. How can we compare categories from this model? From the How can we compare categories from this model? From the above equation the average above equation the average ΥΥ
for subjects in category 2 is for subjects in category 2 is
αα++ββ11
. . Thus Thus ββ11
represents the difference between the average represents the difference between the average value of y for subjects in category 2 and the average value of value of y for subjects in category 2 and the average value of Y for subjects in the reference category. Similarly Y for subjects in the reference category. Similarly ββj j represents represents the difference between the average value of Y for subjects in the difference between the average value of Y for subjects in category (j+1) and the average value of Y for subjects in the category (j+1) and the average value of Y for subjects in the reference category. reference category. A fixedA fixed--effects one way ANOVA model can be represented by effects one way ANOVA model can be represented by a multiple linear regression model based on a dummy variable a multiple linear regression model based on a dummy variable specification for the grouping variable. specification for the grouping variable.