correlation – regression - Τομέας Μαθηματικών / …fouskakis/5-6.correlation -...

Correlation Correlation –– RegressionRegression

ΔιατμηματικόΔιατμηματικό

ΠΜΣΠΜΣΕπαγγελματικήΕπαγγελματική

καικαι

ΠεριβαλλοντικήΠεριβαλλοντική

ΥγείαΥγεία--ΔιαχείρισηΔιαχείριση

καικαι

ΟικονομικήΟικονομική ΑποτίμησηΑποτίμηση

ΔημήτρηςΔημήτρης

ΦουσκάκηςΦουσκάκης

CorrelationCorrelation

To assess whether two variables are To assess whether two variables are linearlylinearly

associatedassociated, that is, if the values , that is, if the values

of one variable tend to be higher or lower of one variable tend to be higher or lower for higher values of the other variable. for higher values of the other variable. Methods for studying the association Methods for studying the association between categorical variables will between categorical variables will introduce in the next chapter. Here we will introduce in the next chapter. Here we will consider continuous variables using the consider continuous variables using the method known as method known as correlationcorrelation. .

CorrelationCorrelationTherefore correlation is the method of analysis to use when studying the possible linear association between two continuous variables. If we want to measure the degree of linear association, this can be done by calculating the sample

correlation coefficient (also

called Pearson’s r), often loosely just called correlation. It is a coefficient, denoted by r, that does not depend on the units that are used to measure the data and is bounded between -1 and 1.

i i

2 2i i

(x x)(y y)r

(x x) (y y)

− −=

− −∑∑ ∑

CorrelationCorrelationA negative coefficient means that the data are clustered around lines with a negative slope. That is, as one variable increases, the other one decreases.The closer r is to 1 the stronger the positive linear association between the variables.The closer r is to -1 the stronger the negative linear association between the variables.When r is equal to 1 or -1 there is total linear association between the variables, this implies that all points lie on a line.When r is equal to 0 there is no linear association between the two variables.


Confidence Interval for Confidence Interval for ρρWe can obtain a 95% C.I. for the We can obtain a 95% C.I. for the correlation coefficient in the population correlation coefficient in the population ρρ

1 2

1 2

2z 2z

2z 2z

1 2

e 1 e 1, , where e 1 e 1

z z 1.96 / n 3 and z z 1.96 / n 3 1 1 rand z ln2 1 r

⎛ ⎞− −⎜ ⎟+ +⎝ ⎠

= − − = + −

+⎛ ⎞= ⎜ ⎟−⎝ ⎠


Hypothesis Testing for Hypothesis Testing for ρρHH00

: : ρρ

= 0 = 0 vsvs

HH00

: : ρρ

≠≠

0 0

n 22

n 2Under the null hypothesis t r t1 r −

−=

−∼

Therefore

reject

Ho

if t > tn-2,α/2

OR t < -

tn-2,α/2

, α: significance level.If t > 0 the p-value is the area of St(n-2) to the right of t.If t < 0 the p-value is the area of St(n-2) to the left of t.

CorrelationCorrelation -- AssumptionsAssumptions

The correlation coefficient can be calculated for The correlation coefficient can be calculated for any data. However, in order to construct the any data. However, in order to construct the hypothesis test for hypothesis test for ρρ

at least one variable should at least one variable should

be normally distributed. For the calculation of a be normally distributed. For the calculation of a valid confidence interval both variables should valid confidence interval both variables should have the normal distribution. Also the two have the normal distribution. Also the two variables should be independent (if not you can variables should be independent (if not you can calculate for example calculate for example intraclassintraclass

correlation as correlation as

we mentioned on the previous chapter). we mentioned on the previous chapter).

SpearmansSpearmans Rank CorrelationRank CorrelationRank in total the values of each variable.Rank in total the values of each variable.Calculate the Pearson r on the ranks of the data and call Calculate the Pearson r on the ranks of the data and call this this rrss

. . Confidence intervalConfidence interval. The distribution of . The distribution of rrss

is similar to is similar to that of r for samples larger than about 10, so a CI for that of r for samples larger than about 10, so a CI for ρρσσ

((the population rank correlation) can be obtain using the the population rank correlation) can be obtain using the same formula as before with r.same formula as before with r.Hypothesis testing (for large samples only > 30) Hypothesis testing (for large samples only > 30)

HH00

: : ρρσσ

= 0 = 0 vsvs

HH00

: : ρρσσ

≠≠

0 0

s n 22s

n 2Under the null hypothesis t r t1 r −

−=

−∼

Therefore

reject

Ho

if t > tn-2,α/2

OR t < -

tn-2,α/2

, α: significance level.If t > 0 the p-value is the area of St(n-2) to the right of t.If t < 0 the p-value is the area of St(n-2) to the left of t.

Ecological correlationsEcological correlations

Correlations based on rates or averages can be Correlations based on rates or averages can be misleading.misleading.

Example 1:Example 1:

Relationship between the rate of Relationship between the rate of cigarette smoking (per capita) and the rate of cigarette smoking (per capita) and the rate of deaths from lung cancer in 11 countries gave deaths from lung cancer in 11 countries gave correlation 0.7. However, it is not countries correlation 0.7. However, it is not countries which smoke and get cancer, but people. To which smoke and get cancer, but people. To measure the strength of the relationship for measure the strength of the relationship for people, we must have data for individual people.people, we must have data for individual people.

Ecological correlationsEcological correlationsExample 2:Example 2:

From Current Population Survey data for 1993, From Current Population Survey data for 1993,

you can compute the correlation between income and you can compute the correlation between income and education for men age 25education for men age 25--54 in US: r=0.44. For each 54 in US: r=0.44. For each state you can compute the average educational level state you can compute the average educational level and income. Finally, you can compute the correlation and income. Finally, you can compute the correlation between the pairs of averages: this is 0.64!! If you use between the pairs of averages: this is 0.64!! If you use the correlation for the states to estimate the correlation the correlation for the states to estimate the correlation for the individuals, you would be way off. The reason is for the individuals, you would be way off. The reason is that within each state, there is a lot of spread around the that within each state, there is a lot of spread around the averages. Replacing the states by their averages averages. Replacing the states by their averages eliminates the spread, and gives a misleading eliminates the spread, and gives a misleading impression of tight clustering. impression of tight clustering.

Association vs. CausationAssociation vs. CausationFor school children, shoe size is strongly correlated with reading skills. However, learning new words does not make the feet get bigger. Instead, there is a third factor involved-age. As children get older, they learn to read better and they outgrow their shoes.

Correlation measures association. But association does not necessarily show causation. It may only show that both variables are simultaneously influenced by some third variable.

Regression Regression -- IntroductionIntroductionBasic idea:Basic idea:

Use data to identify Use data to identify relationshipsrelationships

among among variables and use these relationships to make variables and use these relationships to make predictionspredictions..Regression analysis describes the relationship between Regression analysis describes the relationship between two (or more) variables.two (or more) variables.Examples:Examples:

Income and educational level.Income and educational level.Demand for electricity and the weather.Demand for electricity and the weather.Home sales and interest rates.Home sales and interest rates.

Simple ExampleSimple Example

A linear model for hours worked:A linear model for hours worked:Hours workedHours worked = = αα

+ + ββ**perper--capita GDPcapita GDP

Where:Where:Hours of work: Hours of work: dependent variable (Y)dependent variable (Y)GDP perGDP per--capita: capita: independent variable (X)independent variable (X)αα : intercept (or baseline), : intercept (or baseline), ββ: slope are the : slope are the regression coefficientsregression coefficients

Simple ExampleSimple ExampleThe slope of this line gives:The slope of this line gives:

If If ββ

> 0, hours worked increase with the > 0, hours worked increase with the level of income. If level of income. If ββ

< 0, the work week < 0, the work week

gets shorter as a country develops.gets shorter as a country develops.

Change in Hours WorkedChange in GDP per capita

β =

Simple ExampleSimple Example

We want to find coefficient values that give a We want to find coefficient values that give a good good ‘‘fitfit’’

of the dataof the data

Plot of the data is called a Plot of the data is called a scatter diagramscatter diagramIt describes the relationship between Hours It describes the relationship between Hours Worked and GDP perWorked and GDP per--capita for several capita for several countriescountries

Scatter Diagram: Hours Worked and GDP per Capita

30.0

35.0

40.0

45.0

50.0

55.0

0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000

GDP per capita

Wee

kly

Hou

rs W

orke

d

So Many Choices...

30,0

35,0

40,0

45,0

50,0

55,0

0 5.000 10.000 15.000 20.000 25.000 30.000 35.000 40.000

GDP per capita

Wee

kly

Hou

rs W

orke

d

Line C

Simple ExampleSimple ExampleThe The regression lineregression line

is the line that best is the line that best

summarizes the data.summarizes the data.More precisely, itMore precisely, it’’s the line that s the line that minimizes the distance between every minimizes the distance between every point in the scatter diagram and the point in the scatter diagram and the corresponding point in the line.corresponding point in the line.This method of estimating the This method of estimating the regression line is called regression line is called least squares.least squares.

Scatter Diagram and Regression Line

30.0

35.0

40.0

45.0

50.0

55.0

0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000

GDP per capita

Wee

kly

Hou

rs W

orke

d

Simple ExampleSimple ExampleIn our example the regression line is:In our example the regression line is:

Hours WorkedHours Worked = 45.3 = 45.3 ––

0.00025 0.00025 Per capita GDPPer capita GDP

(1.52) (0.00007)(1.52) (0.00007)

A $1,000 increase in GDP perA $1,000 increase in GDP per--capita reduces by capita reduces by a quarter of an hour a quarter of an hour The standard errors (in parenthesis) are a The standard errors (in parenthesis) are a measure of the statistical precision with which measure of the statistical precision with which the coefficients are estimatedthe coefficients are estimated

General Concepts General Concepts

i Estrior

(mg/24hr)xi

Birthweight

(g/100)yi

i Estrior

(mg/24hr)xi

Birthweight

(g/100)yi

1 7 25 17 17 32

2 9 25 18 25 32

3 9 25 19 27 34

4 12 27 20 15 34

5 14 27 21 15 34

6 16 27 22 15 35

7 16 24 23 16 35

8 14 30 24 19 34

9 16 30 25 18 35

10 16 31 26 17 36

11 17 30 27 18 37

12 19 31 28 20 38

13 21 30 29 22 40

14 24 28 30 25 39

15 15 32 31 24 43

16 18 32

This is a study to This is a study to relate relate birthweightbirthweight

to to

the the estriolestriol

level of level of pregnant women. Let pregnant women. Let x = x = estriolestriol

level and y level and y

= = birtweightbirtweight. From a . From a scatter plot of the scatter plot of the data it appears to be data it appears to be a linear relationship. a linear relationship.

General Concepts General Concepts

2530

3540

45bi

rthw

eigh

t5 10 15 20 25

estriol

General ConceptsGeneral Concepts

We can postulate a relationship between y and x that is We can postulate a relationship between y and x that is of the form E(Y|X)=of the form E(Y|X)=αα++ββXX. . That is, for a given That is, for a given estriolestriol

level level

x, the average x, the average birthweightbirthweight

is is αα++ββx.x.The line y = The line y = αα++ββx is the x is the regression lineregression line

where where αα

is the is the

intercept and intercept and ββ

is the slope.is the slope.The relationship y The relationship y = = αα++ββx is not expected to hold exactly x is not expected to hold exactly for every woman. For example not all women with a for every woman. For example not all women with a given given estriolestriol

level have babies with identical level have babies with identical birthweightsbirthweights. .

Thus an error term e, which represents the variance of Thus an error term e, which represents the variance of birthweightbirthweight

among all babies of women with a given among all babies of women with a given

estriolestriol

level x, is introduced into the model. level x, is introduced into the model. Lets assume Lets assume that e follows a normal distribution with mean 0 and that e follows a normal distribution with mean 0 and variance variance σσ22. . The full linear regression model takes the The full linear regression model takes the form Y = form Y = αα++ββX+eX+e

Dependent variable Independent variable or predictor


The interpretation of the regression line is that The interpretation of the regression line is that for a woman with for a woman with estriolestriol

level x the level x the

corresponding corresponding birtweightbirtweight

will be normally will be normally distributed with mean distributed with mean αα++ββx and variance x and variance σσ22. . If If σσ22=0 then every point will fall exactly on the =0 then every point will fall exactly on the regression line.regression line.The interpretation of The interpretation of ββ

is the following. If is the following. If ββ>0 >0

then as x increases the expected value of Y then as x increases the expected value of Y given X will increase. If given X will increase. If ββ<<0 0 then as X increases then as X increases the expected value of Y given X will decrease. If the expected value of Y given X will decrease. If ββ==00

then there is no linear relationship between then there is no linear relationship between

X and Y.X and Y.


σ2

= 0 perfect fit σ2

> 0 imperfect fit


β

> 0 β

< 0 β

= 0

Fitting Regression Lines Fitting Regression Lines –– Least Least Squares MethodSquares Method

y = α

+ βx

di

(xi

,yi

)

(xi

, )iy


The The Least Squares lineLeast Squares line

is the line y = ais the line y = a++bxbx

that that minimizes the sum of square distances of the minimizes the sum of square distances of the sample points from the line given by sample points from the line given by

This method of estimating the parameters of the This method of estimating the parameters of the regression line is known as the method of regression line is known as the method of least least squaressquares. .

n n2 2i i i

i 1 i 1

S d (y a bx )= =

= = − −∑ ∑

Estimates of α

and β


The The raw sum of squares for xraw sum of squares for x

The The corrected sum of squares for x, ycorrected sum of squares for x, y

n2i

i 1x

=∑

2n n n2 2

xx i i ii 1 i 1 i 1

L (x x) x x n= = =

⎛ ⎞= − = −⎜ ⎟⎝ ⎠

∑ ∑ ∑

yn

2i

i 1y

=∑

2n n n2 2

yy i i ii 1 i 1 i 1

L (y y) y y n= = =

⎛ ⎞= − = −⎜ ⎟⎝ ⎠

∑ ∑ ∑


Notice that Notice that LLxxxx

and and LLyyyy

are the numerators are the numerators of the sample variances of x (of the sample variances of x (i.ei.e

) and of ) and of

y (i.e. ) . y (i.e. ) . The The raw sum of cross productsraw sum of cross products

The The corrected sum of cross productscorrected sum of cross products

2XS

2YS

n

i ii 1

x y=∑

n n n n

xy i i i i i ii 1 i 1 i 1 i 1

L (x x)(y y) x y x y n= = = =

⎛ ⎞⎛ ⎞= − − = − ⎜ ⎟⎜ ⎟

⎝ ⎠⎝ ⎠∑ ∑ ∑ ∑


The coefficients of the leastThe coefficients of the least--square line square line y = a + y = a + bxbx

are given by:are given by:

The predicted, or average, value of Y for a The predicted, or average, value of Y for a given value of x, as estimated from the given value of x, as estimated from the fitted regression line is denoted by fitted regression line is denoted by

n

i ix yi 1

n2 x x

ii 1

( x x )( y y ) Lb an d a y b x

L( x x )

=

=

− −= = = −

−

∑

∑

Y a bx= +

InferenceInference

y

x

(x, y)

Regression Liney = a

+ bx

i i(x , y ) Residual Component

i iˆ(y y )−

Regression Component

iˆ(y y)−i(y y)− i iˆ(x , y )

sample point

InferenceInferenceFirst notice that the point always falls on the First notice that the point always falls on the regression line. regression line. For any sample point the For any sample point the residualresidual

or residual or residual

componentcomponent

is defined byis defined byFor any sample point the For any sample point the regressionregression

componentcomponent

is is

defined bydefined byPlease notice that the deviationPlease notice that the deviation

A good fitting regression line will have regression A good fitting regression line will have regression components large in absolute value relative to the residual components large in absolute value relative to the residual components, where the opposite is true for poor fitting components, where the opposite is true for poor fitting regression lines.regression lines.

(x, y)

i i(x , y )i iˆ(y y )−

i i(x , y )iˆ(y y)−

i i i iˆ ˆ(y y) (y y ) (y y)− = − + −

InferenceInference

InferenceInference

n n n2 2 2

i i i ii 1 i 1 i 1

ˆ ˆ(y y) (y y ) (y y)

Total SS Res SS Reg SS= = =

− = − + −

= +

∑ ∑ ∑

2 2xy xx xy xx

2yy xy xx

Reg SS bL b L L / L

Res SS Total SS Reg SS L L / L

= = =

= − = −

2y x

Reg MS Reg SS/kRes MS Res SS/(n-k-1) s

k the number of predictors (in the simple linear regression case k=1)

⋅

=

= ≡

=

an estimate of σ2, the variance of y given x

InferenceInferenceF test for Simple Linear RegressionF test for Simple Linear Regression

HH00

: : ββ

= 0 = 0 vsvs

HH11

: : ββ

≠≠

001.

Compute F = Reg

MS / Res MS that follows F1,n-2

distribution under Ho.

2.

If α

is the significant level reject Ho

if F > F1,n-2,α

3.

The p value = P(F1,n-2 > F)

All these results are usually summarized in an ANOVA table

InferenceInference

Coefficient of DeterminationCoefficient of DeterminationA summary measure of goodness of fit frequently A summary measure of goodness of fit frequently referred to in the literature is referred to in the literature is the coefficient of the coefficient of determinationdetermination

RR22

= = RegReg

SS / Total SS SS / Total SS

RR22

is the proportion of the total variation of the observed is the proportion of the total variation of the observed

values of Y that is accounted for by the regression values of Y that is accounted for by the regression equation of the independent variables. Always 0 equation of the independent variables. Always 0 ≤≤

RR22

≤≤

1. If it is equal to 1 then all variation in Y can be 1. If it is equal to 1 then all variation in Y can be explained by variation in X, and all data points fall on the explained by variation in X, and all data points fall on the regression line. In other words, once X is know Y can be regression line. In other words, once X is know Y can be predicted exactly with no error or variability in the predicted exactly with no error or variability in the prediction. If prediction. If RR22

= 0 then X gives no information about Y, = 0 then X gives no information about Y,

and the variance of Y is the same with or without and the variance of Y is the same with or without knowing X. knowing X.

Coefficient of DeterminationCoefficient of Determination0

510

15Fi

tted

valu

es/v

ar6

0 5 10 15var5

Fitted values var6

.51

1.5

22.

5Fi

tted

valu

es/A

dver

tisin

g ex

pend

iture

s ($

mill

ion)

10 20 30 40var4

Fitted values Advertising expenditures ($ million)

RR2 2 = 0.69= 0.69

RR2 2 = 0.46= 0.46

The value of The value of RR22

is frequently used to measure the extent is frequently used to measure the extent

to which the regression model fits the data. This is to which the regression model fits the data. This is WRONG! There are other ways to determine whether a WRONG! There are other ways to determine whether a linear regression is valid or not.linear regression is valid or not.

Coefficient of Determination and Coefficient of Determination and Sample CorrelationSample Correlation

It can be proved that It can be proved that RR22= [= [r(X,Yr(X,Y) ]) ]22

Thus RThus R2 2 is the square of the sample correlation coefficient between is the square of the sample correlation coefficient between the independent variable Y and the dependent variable X.the independent variable Y and the dependent variable X.

The estimate of the slope b in the simple linear The estimate of the slope b in the simple linear regression model can be written asregression model can be written as

x

y

Sb r(x, y)S

= ×

where Swhere SXX

and Sand SYY

are the sample standard are the sample standard deviations of X, Ydeviations of X, Y

respectively.respectively.

InferenceInference

T test for Simple Linear RegressionT test for Simple Linear Regression

HH00

: : ββ

= 0 = 0 vsvs

HH11

: : ββ

≠≠

00

1.

Compute . This follows the t distribution with n-2 df

under Ho.

2.

If α


if t > tn-2,α/2

or t < -

tn-2,α/2

3.

The p value is given byp = 2 ×

(area to the left of t under the t distribution with n-2 df) if t < 0

p = 2 ×

(area to the right of t under the t distribution with n-2 df) if t ≥

0

y x xxt b / se(b), where se(b) s / L⋅= =

InferenceInference

Confidence IntervalsConfidence Intervals

Two sided 100% Two sided 100% ××

(1(1--αα) ) CI for the CI for the parameters of a regression line parameters of a regression line αα

and and ββ

n 2, / 2 n 2, / 2

2

y x xx y xxx

b t se(b) and a t se(a)

1 xwhere se(b) = s / L and se(a) = s /n L

− α − α

⋅ ⋅

± ±

+

PredictionPrediction

Suppose we wish to make predictions from a Suppose we wish to make predictions from a regression line for an individual observation with regression line for an individual observation with independent variable x that was not used in independent variable x that was not used in constructing the regression line. The distribution constructing the regression line. The distribution of observed Y values for the subset of of observed Y values for the subset of individuals with independent variable x is normal individuals with independent variable x is normal with mean and standard deviation with mean and standard deviation Y a bx= +

2

y xxx

1 (x x)ˆse(Y) s 1n L⋅

−= + +

Furthermore a two sided 100% two sided 100% ××

(1(1--αα) ) CICI

of the observed values (prediction interval for Y) is given by

n 2, / 2ˆ ˆY t se(Y)− α±

ExampleExample

n = 15 observationsn = 15 observationsY = First Year Sales ($ million)Y = First Year Sales ($ million)X = Advertising Expenditures ($ million)X = Advertising Expenditures ($ million)Try to fit a simple linear regression model:Try to fit a simple linear regression model:

Y = Y = αα

+ + ββXX

ExampleExampleY = First Year sales (million $) X = Advertising expenditures (million $)

101.8 1.3

44.4 .7

108.3 1.4

85.1 .5

77.1 .5

158.7 1.9

180.4 1.2

64.2 .4

74.6 .6

143.4 1.3

120.6 1.6

69.7 1

67.8 .8

106.7 .6

119.6 1.1

ExampleExample2n n n

2 2xx i i i

i 1 i 1 i 1

L (x x) x x n 2.869333= = =

⎛ ⎞= − = − =⎜ ⎟⎝ ⎠

∑ ∑ ∑2n n n

2 2yy i i i

i 1 i 1 i 1

L (y y) y y n 20405.1= = =

⎛ ⎞= − = − =⎜ ⎟⎝ ⎠

∑ ∑ ∑

n n n n

xy i i i i i ii 1 i 1 i 1 i 1

L (x x)(y y) x y x y n 171.2393= = = =

⎛ ⎞⎛ ⎞= − − = − =⎜ ⎟⎜ ⎟

⎝ ⎠⎝ ⎠∑ ∑ ∑ ∑

ExampleExample

n

i ixyi 1

n2 xx

ii 1

(x x)(y y) Lb 59.67914

L(x x)

and

a y bx 42.21205

=

=

− −= = =

−

= − =

∑

∑

ExampleExample40

6080

100

120

140

Fitte

d va

lues

/Firs

t Yea

r Sal

es ($

milli

on)

.5 1 1.5 2 2.5Advertising expenditures ($ million)

Fitted values First Year Sales ($ million)

ExampleExamplen

2yy i

i 1Total SS L (y y) 20405.01

=

= = − =∑2 2

xy xx xy xx

2yy xy xx

Reg SS bL b L L / L 10219.42

Res SS Total SS Reg SS L L / L 10185.5919

= = = =

= − = − =

2y x

Reg MS Reg SS/k Reg SS 10219.42Res MS Res SS/(n-k-1) Res SS/(n-2) s 783.507067⋅

= = =

= = ≡ =

RR2 2 = = RegReg

SS / Total SS = 0.5008SS / Total SS = 0.5008

ExampleExampleF testF test

F = F = RegReg

MS / Res MS = 13.04MS / Res MS = 13.04p value = P(Fp value = P(Fk,nk,n--kk--22

> F) = 0.0032> F) = 0.0032t testt test

P value = 2 P value = 2 ××

(area to the right of t under the t distribution with 13 (area to the right of t under the t distribution with 13 dfdf) = 0.003) = 0.003

y x xxse(b) s / L 16.5246

t b / se(b) 3.61⋅= =

= =

ExampleExample

n 2, / 2

2

y xxx

n 2, / 2

b t se(b) = 59.67914 2.160369 16.5246 (23.97991, 95.37837)

1 xse(a) = s / 17.93509n L

a t se(a) 42.21205 2.160369 17.93509(3.465644, 80.95847)

− α

⋅

− α

± ± ×

=

+ =

± = ± × =

=

ExampleExample

Suppose we want in the future to spend x = 0.9 Suppose we want in the future to spend x = 0.9 (million $) in advertising and we wish to predict (million $) in advertising and we wish to predict the first year sales (in million $). According to the the first year sales (in million $). According to the regression line we will gain 59.67914 regression line we will gain 59.67914 ××

0.90.9

+ +

42.21205 = 95.92 million $. This estimate has a 42.21205 = 95.92 million $. This estimate has a standard errorstandard error

A 95% CI for the first year sales (in million $) A 95% CI for the first year sales (in million $) when we spend 0.9 million $ in advertising is when we spend 0.9 million $ in advertising is

2

y xxx

1 (x x)ˆse(Y) s 1 28.95n L⋅

−= + + =

n 2, / 2ˆ ˆY t se(y) 95.92 2.160369 28.95 (33.37,158.46)− α± = ± × =

Computer OutputComputer Output

Source

SS

df

MS

Number of obs

= 15F( 1, 13)

= 13.04Model

10219.4158

1 10219.4158

Prob

> F

= 0.0032Residual

10185.5919

13 783.507067

R-squared

= 0.5008Adj

R-squared

= 0.4624Total

20405.0077

14 1457.50055

Root MSE

= 27.991

sales

Coef. Std. Err. t

P>t

[95% Conf.

Interval]

advertisin~s

59.67914 16.5246 3.61

0.003

23.97991

95.37837_cons

42.21205 17.93509 2.35

0.035

3.465644

80.95847

Coefficient ofCoefficient ofDeterminationDetermination

Regression CoefficientsRegression CoefficientsPP--valuesvaluest t --

testtestConfidence Intervals Confidence Intervals for the coefficientsfor the coefficientsY = 59.67914 x + 42.21205Y = 59.67914 x + 42.21205

Attempt to take intoAttempt to take intoaccount the samplingaccount the sampling

F test

y xs ⋅

Multiple RegressionMultiple Regression

Simple Linear Regression is a model to Simple Linear Regression is a model to predict the value of one variable from predict the value of one variable from another.another.Multiple Regression is a natural extension Multiple Regression is a natural extension of this model: We use it to predict values of this model: We use it to predict values of an outcome from several predictors.of an outcome from several predictors.

Multiple RegressionMultiple Regression

Suppose we have k independent variables Suppose we have k independent variables XX11

,,……,,XXkk

and a dependent variable Y. Then and a dependent variable Y. Then

the the multiple linear regression multiple linear regression is of the is of the form form

We estimate We estimate αα, , ββ11

,...,...ββκκ

by a, bby a, b11

, , bbkk

using using the method of least squares, where we the method of least squares, where we minimize the sum of minimize the sum of

k

j jj 1

Y X e=

= α + β +∑

2k

j jj 1

Y X=

⎡ ⎤⎛ ⎞− α + β⎢ ⎥⎜ ⎟

⎢ ⎥⎝ ⎠⎣ ⎦∑

2N(0, )σ

Multiple RegressionMultiple RegressionIn the multiple linear regression of the formIn the multiple linear regression of the form

the the ββjj

ss

are referred to as are referred to as partial regression partial regression coefficientscoefficients. The . The ββjj

represents the average increase in represents the average increase in Y per unit increase in Y per unit increase in XXjj

with all other variables held with all other variables held constantconstant

(or stated another way after adjusting for all (or stated another way after adjusting for all

other variables in the model) and is estimated by other variables in the model) and is estimated by bbjj

..

k

j jj 1

Y X e=

= α + β +∑

Multiple RegressionMultiple RegressionPartial regression coefficients differ from single linear Partial regression coefficients differ from single linear regression coefficients. The later represent the average regression coefficients. The later represent the average increase in Y per unit increase in X, without considering increase in Y per unit increase in X, without considering any other independent variables. If there are strong any other independent variables. If there are strong relationships among independent variables in a multiple relationships among independent variables in a multiple regression model, then the partial regression coefficients regression model, then the partial regression coefficients may differ considerably from the simple linear regression may differ considerably from the simple linear regression coefficients obtained from considering each independent coefficients obtained from considering each independent variable separately. variable separately. It is possible that a independent variable XIt is possible that a independent variable X11

will seem to will seem to have an important effect on Y when considered by itself, have an important effect on Y when considered by itself, but will not be significant after adjusting for another but will not be significant after adjusting for another variable Xvariable X22

. This usually occurs when X. This usually occurs when X11

and Xand X22

are are strongly related to each other and Xstrongly related to each other and X2 2 is related to Y. We is related to Y. We refer to Xrefer to X22

as a as a confounderconfounder

of the relationship between of the relationship between Y and XY and X11

..

InferenceInferenceF test for Multiple Linear RegressionF test for Multiple Linear Regression

HH00

: : ββ11

= = ……==ββκκ

==0 0 vsvs

HH11

: : at least one at least one ββjj

≠≠

00

1.

Compute F = Reg

MS / Res MS that follows Fk,n-k-1

distribution under Ho.

2.

If α


if F > Fk,n-k-1,α

3.

The p value = P(Fk,n-k-1 > F)

ij ij

n2

i ii 1

n2

ii 1

k

i jj 1

ˆRes SS (y y )

Reg SS Total SS - Res SS

Total SS (y y)

y a b x , where x is the jth independent

variable for the ith subject, j 1, ,.k, i 1,.., n

=

=

=

= −

=

= −

= +

= =

∑

∑

∑

InferenceInference

T test for Multiple Linear RegressionT test for Multiple Linear Regression

1.

Compute . This follows the t distribution with n-k-1 df

under Ho.

2.

If α


if t > tn-k-1,α/2

or t < -tn-k-1,α/2

3.

The p value is given byp = 2 ×

(area to the left of t under the t distribution with n-k-1 df) if t < 0

p = 2 ×

(area to the right of t under the t distribution with n-k-1 df) if t ≥

0

i it b / se(b )=

HH00

: : ββii

= 0 all other = 0 all other ββj j ≠≠

00

vsvs

HH11

: : ββii

≠≠

00 & ββjj

≠≠

00

Predicting Sales of a product based Predicting Sales of a product based on Multiple Factorson Multiple Factors

Table: Sales of Nature-Bar, advertising expenditures, promotion expenditures, and competitors’ sales, by region, for 1998.

Region Sales

($million) Yi

Advertising Expenditures

($million) X1i

Promotions Expenditures

($million) X2i

Competitors’ Sales

($million) X3i

Selkirk 101.8 1.3 0.2 20.40 Susquehanna 44.4 0.7 0.2 30.50

Kittery 108.3 1.4 0.3 24.60 Acton 85.1 0.5 0.4 19.60

Finger Lakes 77.1 0.5 0.6 25.50 Berkshire 158.7 1.9 0.4 21.70 Central 180.4 1.2 1.0 6.80

Providence 64.2 0.4 0.4 12.60 Nashua 74.6 0.6 0.5 31.30 Dunster 143.4 1.3 0.6 18.60 Endicott 120.6 1.6 0.8 19.90

Five-Towns 69.7 1.0 0.3 25.50 Waldeboro 67.8 0.8 0.2 27.40

Jackson 106.7 0.6 0.5 24.30 Stowe 119.6 1.1 0.3 13.70

Predicting Sales of a product based on Predicting Sales of a product based on Multiple FactorsMultiple Factors

Y: dependent variable Y: dependent variable ––

sales of nature barsales of nature bark = 3 independent variablesk = 3 independent variablesXX11

= advertising expenditures= advertising expenditures

XX22

= promotional expenditures= promotional expendituresXX33

= competitors= competitors’’

salessales

n = 15 number of observationsn = 15 number of observations

Predicting Sales of a product based on Predicting Sales of a product based on Multiple FactorsMultiple Factors

With our data it comes out that:With our data it comes out that:Y = 65.705 + 48.979xY = 65.705 + 48.979x11

+ 59.654x+ 59.654x22

––

1.838x1.838x33

Based on the above regression let suppose that we want to Based on the above regression let suppose that we want to predict sales of Naturepredict sales of Nature--Bar for next year in the Nashua Bar for next year in the Nashua region given that we are planning to spend $0.7 million on region given that we are planning to spend $0.7 million on advertising, $0.6 million on promotions and we estimate that advertising, $0.6 million on promotions and we estimate that competitorscompetitors’’

sales will remain flat at their current level of sales will remain flat at their current level of

$31.30 million.$31.30 million.

Y = 65.705 + 48.979 0.7 + 59.654 0.6 Y = 65.705 + 48.979 0.7 + 59.654 0.6 ––

1.838 31.30 = $78.253 million1.838 31.30 = $78.253 million

• • •

Computer outputComputer output

Source | SS Source | SS dfdf

MS Number of MS Number of obsobs

= 15= 15--------------------------++----------------------------------------------------------------------------------------------------

F( 3, 11) = 18.29F( 3, 11) = 18.29Model | 16997.5351 3 5665.84503 Model | 16997.5351 3 5665.84503 ProbProb

> F = 0.0001> F = 0.0001Residual| 3407.47258 11 309.770235 RResidual| 3407.47258 11 309.770235 R--squared = 0.8330squared = 0.8330--------------------------++------------------------------------------------------------------------------------------------------

AdjAdj

RR--squared = 0.7875squared = 0.7875Total | 20405.0077 14 1457.50055 RoTotal | 20405.0077 14 1457.50055 Root MSE = 17.6ot MSE = 17.6

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------sales | sales | CoefCoef. Std. Err. t P>|t| [95% Conf. Interval]. Std. Err. t P>|t| [95% Conf. Interval]

--------------------------++------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Advertising | 48.97876 10.65787 4.60 0.001 25.520Advertising | 48.97876 10.65787 4.60 0.001 25.52096 72.4365796 72.43657promotion | 59.65425 23.6247 2.53 0.028 7.656promotion | 59.65425 23.6247 2.53 0.028 7.656646 111.6519646 111.6519competitors | competitors | --1.837632 .8137517 1.837632 .8137517 --2.26 0.045 2.26 0.045 --3.628687 3.628687 --.0465762.0465762

_cons | 65.70461 27.73107 2.37 0.037 4._cons | 65.70461 27.73107 2.37 0.037 4.668938 126.7403668938 126.7403----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

ValidationValidation

LinearityLinearityNormality of the residuals Normality of the residuals HeteroscedasticityHeteroscedasticityAutocorrelationAutocorrelation

LinearityLinearity

The dependent variable Y depends linearly on the values of The dependent variable Y depends linearly on the values of the independent variables the independent variables When k = 1When k = 1

check thatcheck that

with a scatter plotwith a scatter plot

With k > 1 rely on common sense.With k > 1 rely on common sense.Check value of RCheck value of R22

but as discussed before with caution.but as discussed before with caution.

You might need to add a quadratic term for example if there You might need to add a quadratic term for example if there is a problem with linearity, or transform both the dependent is a problem with linearity, or transform both the dependent and independent variables. and independent variables.

NormalityNormality

i i i i 1 1i 2 2i 3 3iˆy y y a b x b x b xε = − = − + + +

The linear regressionThe linear regressionmodel model Y = Y = αα

+ + ββ11

XX11

+ + ββ22

XX2 2 + + ββ33

XX33

++

ee

assumes that eassumes that e

~~ΝΝ(0,(0,σσ22))..

In order to check that plot In order to check that plot a histogram of the a histogram of the regression residualsregression residuals

01

23

4Fr

eque

ncy

-30 -20 -10 0 10 20Residuals

If there is evidence for no normality, you might need If there is evidence for no normality, you might need to transform your variables, usually the dependent.to transform your variables, usually the dependent.

HeteroscedasticityHeteroscedasticity

The linear regression model The linear regression model YYii

= = αα

+ + ββ11

xx1i1i

+ + ββ22

xx2i2i

+ + ββ33

xx3i3i

+ + eeii

assumesassumes

that the variance of Ythat the variance of Yii

is constant (i.e., is constant (i.e., σσ22)). . This property is called This property is called homoscedasticityhomoscedasticity. . Plot residuals versus the independent variables or Plot residuals versus the independent variables or versus the fitted values and check that there is no versus the fitted values and check that there is no pattern.pattern.If there is a pattern you need to transform your If there is a pattern you need to transform your dependent variable.dependent variable.

iY

HeteroscedasticityHeteroscedasticity-4

0-2

00

2040

Res

idua

ls

.5 1 1.5 2advertising_expenditures

HeteroscedasticityHeteroscedasticity

AutocorrelationAutocorrelation

The linear regression model The linear regression model YYii

= = αα

+ + ββ11

xx1i1i

+ + ββ22

xx2i2i

+ + ββ33

xx3i3i

+ + eeiiassumes that assumes that eeii

~~

ΝΝ(0,(0,σσ22)), with , with eeii

independent. The independent. The phenomenon of autocorrelation can occur if the assumption phenomenon of autocorrelation can occur if the assumption of independence is violated. of independence is violated. Suppose that the regression model is specified with a time Suppose that the regression model is specified with a time component (data for the last 14 weeks) component (data for the last 14 weeks) Plot the residuals in time order of the observations and see if Plot the residuals in time order of the observations and see if there is any kind of a pattern. there is any kind of a pattern. If there is such a pattern then incorporate time as one of the If there is such a pattern then incorporate time as one of the independent variables. independent variables.

AutocorrelationAutocorrelation

AutocorrelationAutocorrelation-4

0-2

00

2040

Res

idua

ls

0 5 10 15observation_number

Warnings and IssuesWarnings and Issues

1.1.

OverspecificationOverspecification

by the addition of too many by the addition of too many Independent Variables.Independent Variables.Use only the independent variables that Use only the independent variables that make sense. It is true that the more the make sense. It is true that the more the better, since Rbetter, since R22

cannot be decreased by cannot be decreased by

adding variables, but the simpler your model adding variables, but the simpler your model the better. the better.

n n ≥≥

5(k+2)5(k+2)Use Use stepwise multiple regressionstepwise multiple regression

(start from (start from

the null model and add the the null model and add the ““bestbest””

variables at variables at each time until Reach time until R2 2 is quite large, or its is quite large, or its increase is too small. increase is too small.

Warnings and IssuesWarnings and Issues

2.2.

Extrapolating beyond the Range of the Data.Extrapolating beyond the Range of the Data.Y = 65.705 + 48.979XY = 65.705 + 48.979X11

+ 59.654X+ 59.654X22

––

1.838X1.838X33

Notice that all of the advertising Notice that all of the advertising expenditures (Xexpenditures (X11

) for the regions in the table ) for the regions in the table with the data are between $0.4 and $1.9. with the data are between $0.4 and $1.9. The regression model is valid in this range. The regression model is valid in this range. Thus it would be unwise to use the model to Thus it would be unwise to use the model to predict sales if we had spend for advertising predict sales if we had spend for advertising purposes $10 million. purposes $10 million.

Warnings and IssuesWarnings and Issues3.3.

MulticollinearityMulticollinearity..

Two independent variables are highly Two independent variables are highly correlated. Should suspect it if Rcorrelated. Should suspect it if R2 2 is high but is high but one or more of the variables does not pass one or more of the variables does not pass the significant test. Check all correlations the significant test. Check all correlations before running regression. If before running regression. If multicollinearitymulticollinearity

occurs, drop one of the independent variables occurs, drop one of the independent variables that is highly correlated with another one. that is highly correlated with another one.

MulticollinearityMulticollinearityTable:Undergraduate grade point average (GPA), GMAT score and graduate school grade point average (GPA) for 25 MBA students

Student

Number

Undergraduate GPA

GMAT Graduate School GPA

1 3.9 640 4.0 2 3.9 644 4.0 3 3.1 557 3.1 4 3.2 550 3.1 5 3.0 547 3.0 6 3.5 589 3.5 7 3.0 533 3.1 8 3.5 600 3.5 9 3.2 630 3.1 10 3.2 548 3.2 11 3.2 600 3.8 12 3.7 633 4.1 13 3.9 546 2.9 14 3.0 602 3.7 15 3.7 614 3.8 16 3.8 644 3.9 17 3.9 634 3.6 18 3.7 572 3.1 19 3.0 570 3.3 20 3.2 656 4.0 21 3.9 574 3.1 22 3.1 636 3.7 23 3.7 635 3.7 24 4.0 654 3.9 25 3.8 633 3.8

MulticollinearityMulticollinearity

Graduate GPA = 0.09540 + 1.13 (Under. GPA) Graduate GPA = 0.09540 + 1.13 (Under. GPA) --0.0088 (GMAT)0.0088 (GMAT)

RR22

= 0.960= 0.960

Corr(underCorr(under. GPA, GMAT) = 0.895. GPA, GMAT) = 0.895

Graduate GPA = Graduate GPA = --0.1287 + 1.0413 (Under. GPA)0.1287 + 1.0413 (Under. GPA)

RR22

= 0.958= 0.958

Not significantNot significant

significantsignificant

OutliersOutliers

Observations that lie outside the overall pattern Observations that lie outside the overall pattern of the other observations. of the other observations. Observations with large residualsObservations with large residualsObservations falling far from the regression line Observations falling far from the regression line while not following the pattern of the relationship while not following the pattern of the relationship apparent in the othersapparent in the others

OutliersOutliers

OutliersOutliers

Outliers can distort the regression results. Therefore many scieOutliers can distort the regression results. Therefore many scientists ntists remove them to have a better fitting. But be CAREFUL! Remove outremove them to have a better fitting. But be CAREFUL! Remove outliers liers only if you are sure that it is a bad data point. only if you are sure that it is a bad data point. Transforming data is one Transforming data is one way to soften the impact of outliers since the most commonly useway to soften the impact of outliers since the most commonly used d expressions, square roots and logarithms, shrink larger values texpressions, square roots and logarithms, shrink larger values to a o a much greater extent than they shrink smaller valuesmuch greater extent than they shrink smaller values..Outliers should be investigated carefully. Often they contain vaOutliers should be investigated carefully. Often they contain valuable luable information about the process under investigation or the data gainformation about the process under investigation or the data gathering thering and recording process. Before considering the possible eliminatiand recording process. Before considering the possible elimination of on of these points from the data, one should try to understand why thethese points from the data, one should try to understand why they y appeared and whether it is likely similar values will continue tappeared and whether it is likely similar values will continue to appear. o appear. Of course, outliers are often bad data points. Of course, outliers are often bad data points.

Other Types of RegressionOther Types of Regression

Non linearNon linear

(e.g. add a quadratic term)(e.g. add a quadratic term)

Other Types of RegressionOther Types of Regression

Logistic RegressionLogistic Regression. The dependent . The dependent variable Y is binary (common in medical variable Y is binary (common in medical research)research)Poisson RegressionPoisson Regression. The dependent . The dependent variable Y is categorical. variable Y is categorical.

Dummy VariablesDummy VariablesWe would like to use linear regression to We would like to use linear regression to predict the effect that a particular predict the effect that a particular phenomenon has on the value of the phenomenon has on the value of the dependant variable, where the dependant variable, where the phenomenon in question either takes phenomenon in question either takes place or not.place or not.

Dummy VariablesDummy VariablesTable: Annual Repair Costs for 19 vehicles at an automobile dealership

Vehicle Age of Vehicle

(Years)

Automatic Transmission (Yes=1, No=0)

Annual Repair Costs ($)

1 3 1 956 2 4 0 839 3 6 0 1257 4 5 1 1225 5 4 1 1288 6 2 1 728 7 4 0 961 8 8 1 1588 9 7 0 1524

10 4 0 875 11 3 1 999 12 5 1 1295 13 3 0 884 14 2 1 789 15 4 0 785 16 3 1 923 17 4 1 1223 18 9 0 1770 19 2 1 692

Dummy VariablesDummy VariablesRepair Cost = Repair Cost = αα

+ + ββ11

XX11

+ + ββ22

XX22

+ e ,+ e ,

where ewhere e

~~ΝΝ(0,(0,σσ22))

AgeAge

Dummy VariableDummy Variable(X(X22

=1 or 0 =1 or 0 depending on weatherdepending on weatheror not the vehicle hasor not the vehicle hasan automatic an automatic transmtransm.).)

RR22

= 0.913= 0.913

CoeffCoeff. St.Err.. St.Err.

Intercept 288.133 72.332Intercept 288.133 72.332Age 160.730 12.424Age 160.730 12.424Automatic 176.964 48.335Automatic 176.964 48.335Repair Cost = 288.133 + 160.730 XRepair Cost = 288.133 + 160.730 X11

+ 176.964 X+ 176.964 X22

Estimate of the additional annual repair cost if you Estimate of the additional annual repair cost if you have an automatic transmission have an automatic transmission

Dummy VariablesDummy Variables

Suppose we have a categorical variable C Suppose we have a categorical variable C with k categories. To represent that with k categories. To represent that variable in a multiple regression model we variable in a multiple regression model we construct kconstruct k--1 dummy variables of the form1 dummy variables of the form

1 if subject is in category 2

1 0 otherwise1 if subject is in category 3

2 0 otherwise

1 if subject is in category k

k 1 0 otherwise

X

X

X

{{

{−

=

=

=

The category omitted (category 1) is referred to as the reference group. It is arbitrary which group is assigned to be the reference group.

Dummy VariablesDummy VariablesTo relate the categorical variable C to an outcome To relate the categorical variable C to an outcome ΥΥ, we use , we use the multiple regression model Y = the multiple regression model Y = αα

+ + ββ11

XX11

++……++ββkk--11

XXkk--11

+e. +e. How can we compare categories from this model? From the How can we compare categories from this model? From the above equation the average above equation the average ΥΥ

for subjects in category 2 is for subjects in category 2 is

αα++ββ11

. . Thus Thus ββ11

represents the difference between the average represents the difference between the average value of y for subjects in category 2 and the average value of value of y for subjects in category 2 and the average value of Y for subjects in the reference category. Similarly Y for subjects in the reference category. Similarly ββj j represents represents the difference between the average value of Y for subjects in the difference between the average value of Y for subjects in category (j+1) and the average value of Y for subjects in the category (j+1) and the average value of Y for subjects in the reference category. reference category. A fixedA fixed--effects one way ANOVA model can be represented by effects one way ANOVA model can be represented by a multiple linear regression model based on a dummy variable a multiple linear regression model based on a dummy variable specification for the grouping variable. specification for the grouping variable.

correlation – regression - Τομέας Μαθηματικών / …fouskakis/5-6.correlation -...

Documents