chapter 7: correlation and regressionpersonal.cb.cityu.edu.hk/msawan/teaching/cb2200/cb2200ch... ·...

1. Introduction2. Covariance and Correlation

3. Regression

CHAPTER 7: CORRELATION ANDREGRESSION

Prof. Alan Wan

1 / 74

3. Regression

Table of contents

1. Introduction1.1 Linear Association

2. Covariance and Correlation2.1 Covariance2.2 Correlation

3. Regression3.1 Simple Linear Regression3.2 Multiple Linear Regression

2 / 74

3. Regression1.1 Linear Association

Introduction

I In this chapter, we are concerned with two related problems:i) first, that of establishing whether or not two variables arerelated, and then, if they are, ii) using that relationship toforecast one variable for given values of the other.

I Such forecast or prediction is useful because there are manysituations where values of one variable are known or can easilybe obtained while the other variable is unknown.

I For example, when the price of a product is changed, the newprice is known but sales for the future periods are not.

3 / 74

Introduction

3 / 74

Introduction

3 / 74

Introduction

I Correlation coefficients are used to solve problem i), whileregression can be used to cope with problem ii).

I The typical situation that we will look at relates to twovariables measuring different characteristics of somepopulation of individuals (although we will also touch on thecase of three or more variables near the end of this chapter).

I For the two-variable case, the available information willusually consist of paired sample data corresponding to pairs ofobservations on the two variables for n members of a sampletaken from the population.

4 / 74

Introduction

I Correlation coefficients are used to solve problem i), whileregression can be used to cope with problem ii).

I The typical situation that we will look at relates to twovariables measuring different characteristics of somepopulation of individuals (although we will also touch on thecase of three or more variables near the end of this chapter).

I For the two-variable case, the available information willusually consist of paired sample data corresponding to pairs ofobservations on the two variables for n members of a sampletaken from the population.

4 / 74

Linear association

I If two variables are related, then the nature of the relationshipmay be indicated by plotting paired samples of observationsfor both variables on a scatter diagram.

I For example, consider the following data for variablesX=height (in cm.) and Y=IQ from a sample of 10 students:

X 170 185 165 140 180 150 200 160 175 170Y 120 130 140 135 100 115 130 125 145 110

5 / 74

Linear association

I If two variables are related, then the nature of the relationshipmay be indicated by plotting paired samples of observationsfor both variables on a scatter diagram.

I For example, consider the following data for variablesX=height (in cm.) and Y=IQ from a sample of 10 students:

X 170 185 165 140 180 150 200 160 175 170Y 120 130 140 135 100 115 130 125 145 110

5 / 74

Linear association

I The scatter diagram for these data is as follows (where eachdot represents a student):

135 145 155 165 175 185 195 205

height

I This diagram indicates no obvious relationship between X andY , as you might well expect, since there is no knownrelationship between height and IQ.

6 / 74

Linear association

I However, consider the scatter diagram between height X andweight Z (in kg.) for the same ten students:

X 170 185 165 140 180 150 200 160 175 170Z 75 80 75 50 70 60 100 65 80 70

I There is a clear tendency for small values of X to beassociated with small values of Z , and large X with large Z .

7 / 74

Linear association

I However, consider the scatter diagram between height X andweight Z (in kg.) for the same ten students:

X 170 185 165 140 180 150 200 160 175 170Z 75 80 75 50 70 60 100 65 80 70

I There is a clear tendency for small values of X to beassociated with small values of Z , and large X with large Z .

7 / 74

Linear association

I The scatter diagram is as follows:

135 145 155 165 175 185 195 205

weight

height

I The dots on the scatter diagram lie ”close to” a straight linewith a positive slope. We say that these two variables, heightand weight, have a positive linear association.

8 / 74

Linear association

I For a different kind of relationship, consider the data from 10firms on the variable S = amount spent by the firm peremployee on job safety education ($000) and T = number ofdays lost through industrial accidents over one year.

S 5 0 5 1 4 8 7 2 6 3T 3 6 5 8 2 1 2 4 3 8

I The prevailing tendency appears for small values of S to beassociated with large values of T and vice versa. The dots lie”close to” a straight line with a negative slope. We say thatthe variables exhibit a negative linear association.

9 / 74

Linear association

I For a different kind of relationship, consider the data from 10firms on the variable S = amount spent by the firm peremployee on job safety education ($000) and T = number ofdays lost through industrial accidents over one year.

S 5 0 5 1 4 8 7 2 6 3T 3 6 5 8 2 1 2 4 3 8

I The prevailing tendency appears for small values of S to beassociated with large values of T and vice versa. The dots lie”close to” a straight line with a negative slope. We say thatthe variables exhibit a negative linear association.

9 / 74

Linear association

I The scatter diagram looks like this:

0 1 2 3 4 5 6 7 8 9 10

amount spent

I If two variables are linear associated, then it is quite easy,given paired sample data, to fit a straight line to the scatterdiagram and use this to forecast values of one variable givenvalues of the other. This is what we do in linear regression.

10 / 74

Linear association

I The scatter diagram looks like this:

0 1 2 3 4 5 6 7 8 9 10

amount spent

I If two variables are linear associated, then it is quite easy,given paired sample data, to fit a straight line to the scatterdiagram and use this to forecast values of one variable givenvalues of the other. This is what we do in linear regression.

10 / 74

3. Regression

2.1 Covariance2.2 Correlation

Covariance

I Question: So how do we measure the degree of linearassociation between two variables X and Y ?

I Answer: The covariance is the quantity that measures thislinear association

I Definition: The population covariance, denoted byσXY =

∑Ni=1(Xi − µX )(Yi − µY )/N, measures the average

value of (X − µX )(Y − µY ) in the population.

I Definition: The sample covariance, denoted bysXY =

∑ni=1(Xi − X )(Yi − Y )/(n − 1), is an estimator of

σXY based on n pairs of sample values.

11 / 74

3. Regression

Covariance

11 / 74

3. Regression

Covariance

11 / 74

3. Regression

Covariance

11 / 74

3. Regression

Covariance

I Clearly, the amounts (Xi − µX ) and (Yi − µY ) measure theamounts by which X and Y are above or below the average.If Xi and Yi are both above or both below their respectiveaverages, then the cross product term (Xi − µX )(Yi − µY )will be positive; if this is generally true for the X and Yvalues in the population then the covariance σXY is positive.In such case, we say that X and Y are positively correlated.

I Conversely, if X is above the average when Y is below theaverage or vice versa then the cross product term(Xi − µX )(Yi − µY ) will be negative; if this is generally truefor the X and Y values in the population then the covarianceσXY is negative. In such case, we say that X and Y arenegatively correlated.

I The same interpretations apply also to the sample covariancesXY .

12 / 74

3. Regression

Covariance

12 / 74

3. Regression

Covariance

12 / 74

3. Regression

Covariance

I The cross product terms will be positive in quadrants I and III,and negative in quadrants II and IV. With positive linearassociation there is a tendency for the dots to liepredominantly in quadrants I and III, as illustrated below:

X 𝑋�

III IV

𝑌�

13 / 74

3. Regression

Covariance

I On the other hand, with negative linear association there is atendency for the dots to lie predominantly in quadrants II andIV:

X 𝑋�

III IV

𝑌�

14 / 74

3. Regression

Covariance

I If there is no or very weak linear association then there is atendency for the dots to scatter across all four quadrants:

X 𝑋�

III IV

𝑌�

15 / 74

3. Regression

Covariance

I The covariance only measures linear association - a covarianceof zero does not necessarily imply that X and Y have noassociation because they may be related in a non-linear way;for example,

X 𝑋�

III IV

𝑌�

16 / 74

3. Regression

Covariance

I Example 1: Calculate sXY for this sample data:

X 0 2 3 4 6Y 1 4 8 12 20

X X − X Y Y − Y (X − X )(Y − Y )

0 -3 1 -8 242 -1 4 -5 53 0 8 -1 04 1 12 3 36 3 20 11 33

15 0 45 0 65

I X = 15/5 = 3, Y=45/5=9, sXY =65/4=16.25

17 / 74

3. Regression

Covariance

X 0 2 3 4 6Y 1 4 8 12 20

X X − X Y Y − Y (X − X )(Y − Y )

0 -3 1 -8 242 -1 4 -5 53 0 8 -1 04 1 12 3 36 3 20 11 33

15 0 45 0 65

I X = 15/5 = 3, Y=45/5=9, sXY =65/4=16.25

17 / 74

3. Regression

Covariance

X 0 2 3 4 6Y 8 5 2 -2 -3

X X − X Y Y − Y (X − X )(Y − Y )

0 -3 8 6 -182 -1 5 3 -33 0 2 0 04 1 -2 -4 -46 3 -3 -5 -15

15 0 10 0 -40

I X = 15/5 = 3, Y=10/5=2, sXY =-40/4=-10

18 / 74

3. Regression

Covariance

X 0 2 3 4 6Y 8 5 2 -2 -3

X X − X Y Y − Y (X − X )(Y − Y )

0 -3 8 6 -182 -1 5 3 -33 0 2 0 04 1 -2 -4 -46 3 -3 -5 -15

15 0 10 0 -40

I X = 15/5 = 3, Y=10/5=2, sXY =-40/4=-10

18 / 74

3. Regression

Covariance

I Example 3: Suppose that values of X and Y in Example 2 aremeasured in dollars. Now, let’s convert the dollars into cents:

X 0 20 30 40 60Y 80 50 20 -20 -30

X X − X Y Y − Y (X − X )(Y − Y )

0 -30 80 60 -180020 -10 50 30 -30030 0 20 0 040 10 -20 -40 -40060 30 -30 -50 -1500

15 0 100 0 -4000

I X = 150/5 = 30, Y=100/5=20, sXY =-4000/4=-1000I The sample covariance is inflated by a factor of 100 although

the relationship between X and Y is really the same as in thelast example.

19 / 74

3. Regression

Covariance

X 0 20 30 40 60Y 80 50 20 -20 -30

X X − X Y Y − Y (X − X )(Y − Y )

0 -30 80 60 -180020 -10 50 30 -30030 0 20 0 040 10 -20 -40 -40060 30 -30 -50 -1500

15 0 100 0 -4000

I X = 150/5 = 30, Y=100/5=20, sXY =-4000/4=-1000

I The sample covariance is inflated by a factor of 100 althoughthe relationship between X and Y is really the same as in thelast example.

19 / 74

3. Regression

Covariance

X 0 20 30 40 60Y 80 50 20 -20 -30

X X − X Y Y − Y (X − X )(Y − Y )

0 -30 80 60 -180020 -10 50 30 -30030 0 20 0 040 10 -20 -40 -40060 30 -30 -50 -1500

15 0 100 0 -4000

I X = 150/5 = 30, Y=100/5=20, sXY =-4000/4=-1000I The sample covariance is inflated by a factor of 100 although

the relationship between X and Y is really the same as in thelast example. 19 / 74

3. Regression

Correlation

I One problem with the covariance is that it is dependent onthe units used to measure X and Y and thus usually cannotbe directly compared for different variables. This scalingproblem affects the magnitude (but not the sign) of the linearassociation between X and Y as reported by the covariance.

I The correlation coefficient is a measure of the linearassociation between two variables that is not affected by thevariables’ units of measure. It adjusts the covariance by thestandard deviations of X and Y so that the resulting measureis unit-free.

I The correlation coefficient is a ”standardised score” of thecovariance.

20 / 74

3. Regression

Correlation

20 / 74

3. Regression

Correlation

20 / 74

3. Regression

Correlation

I Definition: The population correlation coefficient is defined

as ρXY =∑N

i=1(Xi−µX )(Yi−µY )√∑Ni=1(Xi−µX )2

∑Ni=1(Yi−µY )2

= σXYσXσY

I Definition: The sample correlation coefficient is defined as

rXY =∑n

i=1(Xi−X )(Yi−Y )√∑ni=1(Xi−X )2

∑ni=1(Yi−Y )2

= sXYsX sY

I Notice that the denominator of ρXY is always non-negative asit is a product of two standard deviation measures. Hence thesign of ρXY is the same as that of σXY . Similarly, the sign ofrXY is the same as that of sXY .

21 / 74

3. Regression

Correlation

as ρXY =∑N

∑Ni=1(Yi−µY )2

= σXYσXσY

rXY =∑n

i=1(Xi−X )(Yi−Y )√∑ni=1(Xi−X )2

∑ni=1(Yi−Y )2

= sXYsX sY

21 / 74

3. Regression

Correlation

as ρXY =∑N

∑Ni=1(Yi−µY )2

= σXYσXσY

rXY =∑n

i=1(Xi−X )(Yi−Y )√∑ni=1(Xi−X )2

∑ni=1(Yi−Y )2

= sXYsX sY

21 / 74

3. Regression

Correlation

I It can be shown it is always the case that−1 ≤ ρXY ≤ 1 and −1 ≤ rXY ≤ 1

I Three special values of ρXY and rXY are of interest:

1. When ρXY = 0 (rXY = 0), X and Y are not linearly related,and we say that X and Y are uncorrelated in the population(sample);

2. When all population (sample) values of X and Y lie exactly ona straight line having a positive slope (e.g., Y = a + bX , withb > 0), then ρXY =1 (rXY =1);

3. When all population (sample) values of X and Y lie exactly ona straight line having a negative slope (e.g., Y = a + bX , withb < 0), then ρXY =-1 (rXY =-1)

I If the population (sample) values of X and Y lie close to astraight line, then ρXY (rXY ) will be close to 1 or -1.

22 / 74

3. Regression

Correlation

22 / 74

3. Regression

Correlation

22 / 74

3. Regression

Correlation

I The following interpretations may be applied to ρXY :

1. |ρXY | ≥ 0.9, i.e., ρXY ≤ −0.9 or ρXY ≥ 0.9:this indicates a very strong linear association between the twovariables in the population.

2. |ρXY | ≤ 0.2, i.e., −0.2 ≤ ρXY ≤ 0.2:this indicates a weak linear association between the twovariables in the population.

3. 0.2 ≤ |ρXY | ≤ 0.9, i.e., −0.9 ≤ ρXY ≤ −0.2 or0.2 ≤ ρXY ≤ 0.9:there is a linear association between X and Y with moderateto high strength in the population.

I The same interpretations can be applied to rXY by replacing”ρXY ” by ”rXY ” and ”population” by ”sample” in the abovestatements 1, 2 and 3.

23 / 74

3. Regression

Correlation

23 / 74

3. Regression

Correlation

23 / 74

3. Regression

Correlation

23 / 74

3. Regression

Correlation

23 / 74

3. Regression

Correlation

I Here are some diagrams illustrating different values of rXY :

r = 0 r = 0.5 r = 0.8

r = 1 r = −0.8 r = 0

24 / 74

3. Regression

Correlation

I Example 4: Consider the data observations in Example 1again. Calculate rXY for this sample data:

X 0 2 3 4 6Y 1 4 8 12 20

X X − X Y Y − Y (X − X )(Y − Y ) (X − X )2 (Y − Y )2

0 -3 1 -8 24 9 642 -1 4 -5 5 1 253 0 8 -1 0 0 14 1 12 3 3 1 96 3 20 11 33 9 121

15 0 45 0 65 20 220

I X = 15/5 = 3, Y=45/5=9, sXY =65/4=16.25,rXY = 65/

√20× 220=0.9799

25 / 74

3. Regression

Correlation

X 0 2 3 4 6Y 1 4 8 12 20

X X − X Y Y − Y (X − X )(Y − Y ) (X − X )2 (Y − Y )2

0 -3 1 -8 24 9 642 -1 4 -5 5 1 253 0 8 -1 0 0 14 1 12 3 3 1 96 3 20 11 33 9 121

15 0 45 0 65 20 220

I X = 15/5 = 3, Y=45/5=9, sXY =65/4=16.25,rXY = 65/

√20× 220=0.9799

25 / 74

3. Regression

Correlation

X 0 2 3 4 6Y 8 5 2 -2 -3

X X − X Y Y − Y (X − X )(Y − Y ) (X − X )2 (Y − Y )2

0 -3 8 6 -18 9 362 -1 5 3 -3 1 93 0 2 0 0 0 04 1 -2 -4 -4 1 166 3 -3 -5 -15 9 25

15 0 10 0 -40 20 86

I X = 15/5 = 3, Y=10/5=2, sXY =-40/4=-10,rXY = −40/

√20× 86=-0.9645

26 / 74

3. Regression

Correlation

X 0 2 3 4 6Y 8 5 2 -2 -3

X X − X Y Y − Y (X − X )(Y − Y ) (X − X )2 (Y − Y )2

0 -3 8 6 -18 9 362 -1 5 3 -3 1 93 0 2 0 0 0 04 1 -2 -4 -4 1 166 3 -3 -5 -15 9 25

15 0 10 0 -40 20 86

I X = 15/5 = 3, Y=10/5=2, sXY =-40/4=-10,rXY = −40/

√20× 86=-0.9645

26 / 74

3. Regression

Correlation

I Example 6: Consider the data observations in Example 3which are multiples of the data in Example 2 (and Example 5)by a factor of 10:

X 0 20 30 40 60Y 80 50 20 -20 -30

X X − X Y Y − Y (X − X )(Y − Y ) (X − X )2 (Y − Y )2

0 -30 80 60 -1800 900 360020 -10 50 30 -300 100 90030 0 20 0 0 0 040 10 -20 -40 -400 100 160060 30 -30 -50 -1500 900 2500

15 0 100 0 -4000 2000 8600

I X = 150/5 = 30, Y=100/5=20, sXY =-4000/4=-1000,rXY = −4000/

√2000× 8600=-0.9645 (i.e., rXY remains

unchanged although sXY has been inflated by a factor of 100).

27 / 74

3. Regression

Correlation

I Example 6: Consider the data observations in Example 3which are multiples of the data in Example 2 (and Example 5)by a factor of 10:

X 0 20 30 40 60Y 80 50 20 -20 -30

X X − X Y Y − Y (X − X )(Y − Y ) (X − X )2 (Y − Y )2

0 -30 80 60 -1800 900 360020 -10 50 30 -300 100 90030 0 20 0 0 0 040 10 -20 -40 -400 100 160060 30 -30 -50 -1500 900 2500

15 0 100 0 -4000 2000 8600

I X = 150/5 = 30, Y=100/5=20, sXY =-4000/4=-1000,rXY = −4000/

√2000× 8600=-0.9645 (i.e., rXY remains

unchanged although sXY has been inflated by a factor of 100). 27 / 74

3. Regression

Correlation

I Example 7: Calculate rXY for this sample data:

X 0 20 30 40 60Y 1000 -90 230 -200 900

X X − X Y Y − Y (X − X )(Y − Y ) (X − X )2 (Y − Y )2

0 -30 1000 632 -18960 900 39942420 -10 -90 -458 4580 100 20976430 0 230 -138 0 0 1904440 10 -200 -568 -5680 100 32262460 30 900 -532 15960 900 283024

150 0 1840 0 -4100 2000 1233880

I X = 150/5 = 30, Y=1840/4=368, sXY =-4100/4=-1025,rXY = −4100/

√2000× 1233880=-0.082534.

28 / 74

3. Regression

Correlation

X 0 20 30 40 60Y 1000 -90 230 -200 900

X X − X Y Y − Y (X − X )(Y − Y ) (X − X )2 (Y − Y )2

0 -30 1000 632 -18960 900 39942420 -10 -90 -458 4580 100 20976430 0 230 -138 0 0 1904440 10 -200 -568 -5680 100 32262460 30 900 -532 15960 900 283024

150 0 1840 0 -4100 2000 1233880

I X = 150/5 = 30, Y=1840/4=368, sXY =-4100/4=-1025,rXY = −4100/

√2000× 1233880=-0.082534.

28 / 74

3. Regression

Correlation

X 0 2 3 4 6Y 5 9 11 13 17

X X − X Y Y − Y (X − X )(Y − Y ) (X − X )2 (Y − Y )2

0 -3 5 -6 18 9 362 -1 9 -2 2 1 43 0 11 0 0 0 04 1 13 2 2 1 46 3 17 6 18 9 36

15 0 55 0 40 20 80

I X = 15/5 = 3, Y=55/5=11, sXY =40/4=10,rXY = 40/

√20× 80=1 (i.e., perfect linear association)

I Note that Y = 5 + 2X

29 / 74

3. Regression

Correlation

X 0 2 3 4 6Y 5 9 11 13 17

X X − X Y Y − Y (X − X )(Y − Y ) (X − X )2 (Y − Y )2

0 -3 5 -6 18 9 362 -1 9 -2 2 1 43 0 11 0 0 0 04 1 13 2 2 1 46 3 17 6 18 9 36

15 0 55 0 40 20 80

I X = 15/5 = 3, Y=55/5=11, sXY =40/4=10,rXY = 40/

√20× 80=1 (i.e., perfect linear association)

I Note that Y = 5 + 2X

29 / 74

3. Regression

Correlation

X 0 2 3 4 6Y 0 -2 -3 -4 -6

X X − X Y Y − Y (X − X )(Y − Y ) (X − X )2 (Y − Y )2

0 -3 0 3 -9 9 92 -1 -2 1 -1 1 13 0 -3 0 0 0 04 1 -4 -1 -1 1 16 3 -6 -3 -9 9 9

15 0 -15 0 -20 20 20

I X = 15/5 = 3, Y=-15/5=-3, sXY =-20/4=-5,rXY = −20/

√20× 20=-1 (i.e., perfect linear association)

I Note that Y = −X

30 / 74

3. Regression

Correlation

X 0 2 3 4 6Y 0 -2 -3 -4 -6

X X − X Y Y − Y (X − X )(Y − Y ) (X − X )2 (Y − Y )2

0 -3 0 3 -9 9 92 -1 -2 1 -1 1 13 0 -3 0 0 0 04 1 -4 -1 -1 1 16 3 -6 -3 -9 9 9

15 0 -15 0 -20 20 20

I X = 15/5 = 3, Y=-15/5=-3, sXY =-20/4=-5,rXY = −20/

√20× 20=-1 (i.e., perfect linear association)

I Note that Y = −X30 / 74

3. Regression

Correlation

I The sample correlation coefficient rXY is just an estimator ofρXY ; rXY is meaningful only for what it tells you about thepopulation correlation coefficient, because it is ρXY thatmeasures the true relationship between X and Y , where rXYrelates only to the particular sample values observed.

I More precisely, the value of rXY may be used to test certainhypothesis about the population correlation coefficient ρXY . IfX and Y have a joint normal distribution then the followingstatistic t may be used to test the hypothesis:

H0 : ρXY = 0 vs. H1 : ρXY 6= 0

t = rXY√n−2√

(1−r2XY )∼ tn−2

31 / 74

3. Regression

Correlation

I The sample correlation coefficient rXY is just an estimator ofρXY ; rXY is meaningful only for what it tells you about thepopulation correlation coefficient, because it is ρXY thatmeasures the true relationship between X and Y , where rXYrelates only to the particular sample values observed.

I More precisely, the value of rXY may be used to test certainhypothesis about the population correlation coefficient ρXY . IfX and Y have a joint normal distribution then the followingstatistic t may be used to test the hypothesis:

H0 : ρXY = 0 vs. H1 : ρXY 6= 0

t = rXY√n−2√

(1−r2XY )∼ tn−2

31 / 74

3. Regression

Correlation

I Example 10: Suppose based on n = 20 pairs of observationson X and Y we obtain rXY = 0.6 and want to test

H0 : ρXY = 0 vs. H1 : ρXY 6= 0at α = 0.05.

Now, t = 0.6√

20−2√(1−0.62)

= 3.18, and t(0.05/2,18) = 2.101

Hence we reject H0 and conclude that there is a significantlinear association between X and Y .

32 / 74

3. Regression

Correlation

I Example 10: Suppose based on n = 20 pairs of observationson X and Y we obtain rXY = 0.6 and want to test

H0 : ρXY = 0 vs. H1 : ρXY 6= 0at α = 0.05.

Now, t = 0.6√

20−2√(1−0.62)

= 3.18, and t(0.05/2,18) = 2.101

Hence we reject H0 and conclude that there is a significantlinear association between X and Y .

32 / 74

3. Regression

3.1 Simple Linear Regression3.2 Multiple Linear Regression

Simple Linear Regression

I Suppose that a scatter diagram indicates a linear associationbetween two variables, i.e., the points corresponding to pairedobservations lie close to a straight line. Then it would makesense to use the line to predict one variable from given valuesof the other. But how do we know which line to choose?

I It is possible to fit a line by ”eye”, but many possible lines willlook equally good

33 / 74

3. Regression

I Suppose that a scatter diagram indicates a linear associationbetween two variables, i.e., the points corresponding to pairedobservations lie close to a straight line. Then it would makesense to use the line to predict one variable from given valuesof the other. But how do we know which line to choose?

I It is possible to fit a line by ”eye”, but many possible lines willlook equally good

33 / 74

3. Regression

I To single out one line we need to specify a criterion for bestfit. Suppose that Y , the variable that we want to predict, isplotted on the vertical axis. The rationale that has beenadopted is to choose the line that minimises the sum ofsquared differences between the observed Y values and thevalues on the fitted line, i.e., the sum of the squared verticaldeviations between that line and the sample points (thedotted lengths in the diagram)

34 / 74

3. Regression

35 / 74

3. Regression

I Suppose that the true relationship between Y and X isY = β0 + β1X + u,

where u is an unobserved disturbance term and β0 and β1 areunknown coefficients to be estimated.

I Y is called a response or dependent variable and X is anexplanatory variable as it explains the behaviour of Y .

I This model is known as a simple linear regression, as itpostulates, first, a linear statistical relationship between Yand X , and that second, X is the only explanatory variable inthe study; the model becomes a multiple linear regression (tobe discussed ahead) if there are two or more explanatoryvariables.

36 / 74

3. Regression

36 / 74

3. Regression

36 / 74

3. Regression

I It is possible to show using calculus that the values of b0 andb1 minimising the sum of squared vertical deviations of thesample pairs (X1,Y1), · · · , (Xn,Yn) are given by:

I b0 = Y − b1X and

b1 =∑n

i=1(Xi−X )(Yi−Y )∑ni=1(Xi−X )2 respectively.

I This is known as the method of least squares.

37 / 74

3. Regression

I The use of b0 and b1 rather than β0 and β1 indicates that b0

and b1 are sample estimates of the true values: a differentsample would give different estimates.

I The slope coefficient estimate b1 is related to the samplecorrelation coefficient rXY as follows:

b1 = rXYsYsX

√∑ni=1(Yi−Y )2√∑ni=1(Xi−X )2

I Because sX and sY are non-negative, b1 will have the samesign as rXY : the regression line will slope upwards if theobserved linear association is positive and downwards if it isnegative

38 / 74

3. Regression

b1 = rXYsYsX

√∑ni=1(Yi−Y )2√∑ni=1(Xi−X )2

38 / 74

3. Regression

b1 = rXYsYsX

√∑ni=1(Yi−Y )2√∑ni=1(Xi−X )2

38 / 74

3. Regression

I Example 12: The following table gives data collected last yearfor seven employees of a company, where X=number of yearsof service and Y=number of days taken off work.

X 2 5 7 3 8 3 7Y 8 7 5 12 3 9 5

I Hence we have X = 35/7 = 5, Y = 49/7 = 7 and

X − X -3 0 2 -2 3 -2 2∑

= 0(X − X )2 9 0 4 4 9 4 4

∑= 34

Y − Y 1 0 -2 5 -4 2 -2∑

= 0(Y − Y )2 1 0 4 25 16 4 4

∑= 54

(X − X )(Y − Y ) -3 0 -4 -10 -12 -4 -4∑

= −37

39 / 74

3. Regression

I Example 12: The following table gives data collected last yearfor seven employees of a company, where X=number of yearsof service and Y=number of days taken off work.

X 2 5 7 3 8 3 7Y 8 7 5 12 3 9 5

I Hence we have X = 35/7 = 5, Y = 49/7 = 7 and

X − X -3 0 2 -2 3 -2 2∑

= 0(X − X )2 9 0 4 4 9 4 4

∑= 34

Y − Y 1 0 -2 5 -4 2 -2∑

= 0(Y − Y )2 1 0 4 25 16 4 4

∑= 54

(X − X )(Y − Y ) -3 0 -4 -10 -12 -4 -4∑

= −37

39 / 74

3. Regression

I Therefore,rXY = −37/

√(34)(54) = −0.864,

b1 = −37/34 = −1.09, or b1 = −0.864√

54√34

= −1.09

b0 = 7− (−1.09)5 = 12.45, and

I the estimated regression line is Y = 12.45− 1.09X .

I The ”hat” on Y represents prediction; Y is the predicted orforecasted value of Y for a given value of X . Clearly, Y 6= Y .Y = Y for all sample values if and only if |rXY | = 1.

40 / 74

3. Regression

√(34)(54) = −0.864,

b1 = −37/34 = −1.09, or b1 = −0.864√

54√34

= −1.09

b0 = 7− (−1.09)5 = 12.45, and

40 / 74

3. Regression

√(34)(54) = −0.864,

b1 = −37/34 = −1.09, or b1 = −0.864√

54√34

= −1.09

b0 = 7− (−1.09)5 = 12.45, and

40 / 74

3. Regression

0 2 4 6 8 10

Years of Service

41 / 74

3. Regression

I The analysis can be easily performed by EXCEL with outputas follows (to be discussed in tutorials):

SUMMARY OUTPUT

Regression Statistics Multiple R 0.863506052 R Square 0.745642702 Adjusted R

Square 0.694771242 Standard Error 1.65742536 Observations 7

Coefficients Standard

Error t Stat P-value Lower 95% Upper 95% Intercept 12.44117647 1.553168751 8.01019 0.00049 8.44862909 16.43372385 X Variable 1 1.088235294 0.284246104 -3.8285 0.012266 1.81891317 -0.357557422

42 / 74

3. Regression

SUMMARY OUTPUT

42 / 74

3. Regression

SUMMARY OUTPUT

42 / 74

3. Regression

I Suppose we want to predict the number of days off work thisyear for employees with 5, 6, 8, 0 and 14 years of service. Allwe have to do is to substitute the given X values into theestimated regression equation:

Y = 12.45− 1.09(5) = 7 days off workY = 12.45− 1.09(6) = 5.91 (i.e., about 6) days off workY = 12.45− 1.09(8) = 3.73 (i.e., about 4) days off workY = 12.45− 1.09(0) = 12.45 (i.e., about 12) days off workY = 12.45− 1.09(14) = −2.81 days off work (What ???)

43 / 74

3. Regression

I Suppose we want to predict the number of days off work thisyear for employees with 5, 6, 8, 0 and 14 years of service. Allwe have to do is to substitute the given X values into theestimated regression equation:Y = 12.45− 1.09(5) = 7 days off work

Y = 12.45− 1.09(6) = 5.91 (i.e., about 6) days off workY = 12.45− 1.09(8) = 3.73 (i.e., about 4) days off workY = 12.45− 1.09(0) = 12.45 (i.e., about 12) days off workY = 12.45− 1.09(14) = −2.81 days off work (What ???)

43 / 74

3. Regression

I Suppose we want to predict the number of days off work thisyear for employees with 5, 6, 8, 0 and 14 years of service. Allwe have to do is to substitute the given X values into theestimated regression equation:Y = 12.45− 1.09(5) = 7 days off workY = 12.45− 1.09(6) = 5.91 (i.e., about 6) days off work

Y = 12.45− 1.09(8) = 3.73 (i.e., about 4) days off workY = 12.45− 1.09(0) = 12.45 (i.e., about 12) days off workY = 12.45− 1.09(14) = −2.81 days off work (What ???)

43 / 74

3. Regression

I Suppose we want to predict the number of days off work thisyear for employees with 5, 6, 8, 0 and 14 years of service. Allwe have to do is to substitute the given X values into theestimated regression equation:Y = 12.45− 1.09(5) = 7 days off workY = 12.45− 1.09(6) = 5.91 (i.e., about 6) days off workY = 12.45− 1.09(8) = 3.73 (i.e., about 4) days off work

Y = 12.45− 1.09(0) = 12.45 (i.e., about 12) days off workY = 12.45− 1.09(14) = −2.81 days off work (What ???)

43 / 74

3. Regression

I Suppose we want to predict the number of days off work thisyear for employees with 5, 6, 8, 0 and 14 years of service. Allwe have to do is to substitute the given X values into theestimated regression equation:Y = 12.45− 1.09(5) = 7 days off workY = 12.45− 1.09(6) = 5.91 (i.e., about 6) days off workY = 12.45− 1.09(8) = 3.73 (i.e., about 4) days off workY = 12.45− 1.09(0) = 12.45 (i.e., about 12) days off work

Y = 12.45− 1.09(14) = −2.81 days off work (What ???)

43 / 74

3. Regression

I Suppose we want to predict the number of days off work thisyear for employees with 5, 6, 8, 0 and 14 years of service. Allwe have to do is to substitute the given X values into theestimated regression equation:Y = 12.45− 1.09(5) = 7 days off workY = 12.45− 1.09(6) = 5.91 (i.e., about 6) days off workY = 12.45− 1.09(8) = 3.73 (i.e., about 4) days off workY = 12.45− 1.09(0) = 12.45 (i.e., about 12) days off workY = 12.45− 1.09(14) = −2.81 days off work (What ???)

43 / 74

3. Regression

I Interpreting b0: Note that from the prediction of Y for X = 0,we see that b0 = 12.45 is the predicted number of days off foran employee with 0 years of service (i.e., a new employee).

I Interpreting b1: Subtracting the prediction for X = 5 (i.e.,Y = 7) from the prediction for X = 6 (i.e., Y = 5.91) givesb1 = −1.09: thus b1 is the change in the estimated number ofdays off for an additional year’s service.

I In general, b0 is the predicted value of Y for X = 0, while b1

is the marginal change in Y for a one unit’s increase in X .

44 / 74

3. Regression

44 / 74

3. Regression

44 / 74

3. Regression

I Finally, note that the prediction of -2.81 days off work forX=14 years of service does not make sense: what meaningcan we attach to a negative number of days off work?

I The scatter diagram may suggest the reason for this peculiarvalue: the relationship between X and Y is approximatelylinear over the range covered by the sample, but theregression line cannot be extended indefinitely without cuttingthe X -axis. Once we go beyond the sample range therelationship may cease to be approximately linear.

45 / 74

3. Regression

I Finally, note that the prediction of -2.81 days off work forX=14 years of service does not make sense: what meaningcan we attach to a negative number of days off work?

I The scatter diagram may suggest the reason for this peculiarvalue: the relationship between X and Y is approximatelylinear over the range covered by the sample, but theregression line cannot be extended indefinitely without cuttingthe X -axis. Once we go beyond the sample range therelationship may cease to be approximately linear.

45 / 74

3. Regression

I Example 13: The file HKpopulation.xlsx contains yearly dataon Hong Kong’s population from 1960 to 2013, totalling 54observations. Let Y denote the population figure andX = 1, 2, 3..... denote the sequence of time, with X = 1representing the year 1960, X = 2 representing 1961, etc.

I An excerpt of the data is as follows:

X 1 2 10 39 54Year 1960 1961 1969 1998 2013Y 2619983 2735407 3402250 6510390 7259603

46 / 74

3. Regression

I Example 13: The file HKpopulation.xlsx contains yearly dataon Hong Kong’s population from 1960 to 2013, totalling 54observations. Let Y denote the population figure andX = 1, 2, 3..... denote the sequence of time, with X = 1representing the year 1960, X = 2 representing 1961, etc.

I An excerpt of the data is as follows:

X 1 2 10 39 54Year 1960 1961 1969 1998 2013Y 2619983 2735407 3402250 6510390 7259603

46 / 74

3. Regression

I A plot of Y versus X reveals the following:

2000000

3000000

4000000

5000000

6000000

7000000

8000000

0 10 20 30 40 50

lation

I The association between X and Y appears to beapproximately linear. It therefore makes sense to writeY = β0 + β1X + u

47 / 74

3. Regression

I A plot of Y versus X reveals the following:

2000000

3000000

4000000

5000000

6000000

7000000

8000000

0 10 20 30 40 50

lation

I The association between X and Y appears to beapproximately linear. It therefore makes sense to writeY = β0 + β1X + u

47 / 74

3. Regression

I Using the aforementioned least squares method, we cancompute the following estimates b0 and b1 (the computationcan be done automatically in EXCEL):

b0 = 2597679.706 and b1 = 93778.51912

SUMMARY OUTPUT

Regression Statistics Multiple R 0.992888808 R Square 0.985828185 Standard Error 178582.6876 Observations 54

Coefficients Standard Error t Stat P-value Intercept 2597679.706 49287.04567 52.70512 8.24E-47

X Variable 1 93778.51912 1559.243055 60.14362 9.59E-50

48 / 74

3. Regression

b0 = 2597679.706 and b1 = 93778.51912

SUMMARY OUTPUT

X Variable 1 93778.51912 1559.243055 60.14362 9.59E-50

48 / 74

3. Regression

b0 = 2597679.706 and b1 = 93778.51912

SUMMARY OUTPUT

X Variable 1 93778.51912 1559.243055 60.14362 9.59E-50

48 / 74

3. Regression

I So, b0 = 2597679.706 is the predicted Hong Kong populationfor the year 1959 (X = 0) and b1 = 93778.51912 is thepredicted average annual increment in population.

I Now, the in-sample predictions for the years 1998 (X = 39)and 2013 (X = 54) are:

Y39 = 2597679.706 + 93778.51912(39) = 6255042, andY54 = 2597679.706 + 93778.51912(54) = 7661720respectively.

49 / 74

3. Regression

I So, b0 = 2597679.706 is the predicted Hong Kong populationfor the year 1959 (X = 0) and b1 = 93778.51912 is thepredicted average annual increment in population.

I Now, the in-sample predictions for the years 1998 (X = 39)and 2013 (X = 54) are:

Y39 = 2597679.706 + 93778.51912(39) = 6255042, andY54 = 2597679.706 + 93778.51912(54) = 7661720respectively.

49 / 74

3. Regression

I But Y39 = 6510390 and Y54 = 7259603, so the forecast errors(denoted by e = Y − Y ) are:e39 = 6510390− 6255042 = 255348 ande54 = 7259603− 7661720 = −402116

I So, our model ”under-estimates” the population in 1989 but”over-estimates” the population in 2013. This does not meanour model is bad as the regression can never make a preciseprediction without errors unless the linear association isperfect.

I In fact, the ”R2” from the regression output has a value of0.9858, indicating that the estimated regression has the abilityto capture 98.58% of the variation in Y in the sample.

50 / 74

3. Regression

50 / 74

3. Regression

50 / 74

3. Regression

I R2 is known as the coefficient of determination. It measuresthe goodness of fit of the regression model in terms of theproportion of variation in Y explained by the regression, andis defined as:

R2 = 1−∑n

i=1 e2i∑n

i=1(Yi−Yi )2

I It can be shown that 0 ≤ R2 ≤ 1. Values of R2 close to 1indicate that the model fits the data well. Conversely,values ofR2 close to 0 indicates poor fit.

51 / 74

3. Regression

I R2 is known as the coefficient of determination. It measuresthe goodness of fit of the regression model in terms of theproportion of variation in Y explained by the regression, andis defined as:

R2 = 1−∑n

i=1 e2i∑n

i=1(Yi−Yi )2

I It can be shown that 0 ≤ R2 ≤ 1. Values of R2 close to 1indicate that the model fits the data well. Conversely,values ofR2 close to 0 indicates poor fit.

51 / 74

3. Regression

I Thus, in our population forecast example, the regressionmodel accounts for 98.58% of the variation in Y ; theremaining 1.42% of the variation that cannot be captured bythe regression is reflected in the error term e.

I It can be shown that∑n

i=1 ei is always zero, so the totalamounts of over- and under-estimation across the samplevalues always offset each other.

I In a regression model containing only one X variable,R2 = r2

XY . Hence in our population forecast example, thesample correlation coefficient between X and Y isrXY = +

√0.9858 = 0.9929. We know rXY has a positive sign

because b1 is positive; rXY would have a negative sign if b1

were negative.

52 / 74

3. Regression

were negative.

52 / 74

3. Regression

were negative.

52 / 74

3. Regression

I Suppose we want to forecast the future Hong Kongpopulation for 2014 - 2020. Note that 2014 corresponds toX = 55 and so on. The forecasts are therefore:

2014: Y55 = 2597679.706 + 93778.51912(55) = 77554982015: Y56 = 2597679.706 + 93778.51912(56) = 78492772016: Y57 = 2597679.706 + 93778.51912(57) = 79430552017: Y58 = 2597679.706 + 93778.51912(58) = 80368342018: Y59 = 2597679.706 + 93778.51912(59) = 81306122019: Y60 = 2597679.706 + 93778.51912(60) = 82243912020: Y61 = 2597679.706 + 93778.51912(61) = 8318169

53 / 74

3. Regression

I Of course, the accuracy of these forecasts depends, amongother things, on the legitimacy to extend the linearrelationship established based on the sample values beyondthe estimation period.

I These forecasts are called ”ex-ante” forecasts since the actualvalues of the variable being predicted are unknown at the timeof prediction.

54 / 74

3. Regression

I In many studies, the primary objective of regression is toestimate the slope coefficient β1, which indicates the changein Y when X changes by one unit. In our first example, β1 isthe change in the number of days off for an additional year’sservice. Note that β1 is the actual change which can never beknown. It is important to distinguish between β1 and b1: b1 ismerely an estimate of β1 and not the same as β1.

I Note that if β1 = 0, then there is really no relationshipbetween Y when X ; that is, a change in X does not induceany change in Y . That said, even when β1 = 0, b1 is almostcertainly non-zero as it is merely an estimate of β1.

55 / 74

3. Regression

I In many studies, the primary objective of regression is toestimate the slope coefficient β1, which indicates the changein Y when X changes by one unit. In our first example, β1 isthe change in the number of days off for an additional year’sservice. Note that β1 is the actual change which can never beknown. It is important to distinguish between β1 and b1: b1 ismerely an estimate of β1 and not the same as β1.

I Note that if β1 = 0, then there is really no relationshipbetween Y when X ; that is, a change in X does not induceany change in Y . That said, even when β1 = 0, b1 is almostcertainly non-zero as it is merely an estimate of β1.

55 / 74

3. Regression

I A test whether a relationship between X and Y exists istherefore appropriate. The test takes the form:H0 : β1 = 0 vs. H1 : β1 6= 0

I The test statistic ist = b1−0

Sb1∼ tn−2, where Sb1 is the standard error of b1 that

measures the variation of b1.

I At α = 0.05, we reject H0 if the p-value of the test is smallerthan 0.05, or |t| > t(α/2,n−2).

56 / 74

3. Regression

56 / 74

3. Regression

56 / 74

3. Regression

I Rejecting H0 is the more desirable outcome as it suggestsβ1 6= 0, meaning that X does have an impact on the behaviourof Y , and we have chosen a correct explanatory variable.

I If H0 is rejected, we say that β1 significantly differs from zeroor X is significant.

I Alternatively, if H0 cannot be rejected, we say that β1 doesnot differ from zero significantly or X is insignificant.

57 / 74

3. Regression

57 / 74

3. Regression

57 / 74

3. Regression

I Example 14: Consider the data and the regression of Example12. Suppose we want to test if years of service is linearlyassociated with the number of days off work.

I H0 : β1 = 0 vs. H1 : β1 6= 0

I α = 0.05, n = 7, df = 5 and t(0.05/2, 5) = 2.5706 (fromtable).

I t = b1−0Sb1

= −1.09−00.2842 = −3.835

I As | − 3.835| > 2.5706 or p-value < 0.05, we REJECT H0 atα = 0.05 and conclude that years of service is linearly relatedto the number of days off work.

58 / 74

3. Regression

I H0 : β1 = 0 vs. H1 : β1 6= 0

I t = b1−0Sb1

= −1.09−00.2842 = −3.835

58 / 74

3. Regression

I H0 : β1 = 0 vs. H1 : β1 6= 0

I t = b1−0Sb1

= −1.09−00.2842 = −3.835

58 / 74

3. Regression

I H0 : β1 = 0 vs. H1 : β1 6= 0

I t = b1−0Sb1

= −1.09−00.2842 = −3.835

58 / 74

3. Regression

I This test outcome is unsurprising as R2 = 0.7456 is high.

I Testing H0 : β1 = 0 at α = 0.05 is equivalent to asking if the95% confidence interval of β1 contains 0.

I The 95% confidence interval of β1 can be found by

b1 ± t(0.05/2,n−2)Sb1 = −1.09± 2.5706× 0.2842

= [−1.821,−0.359]

I This CI is consistent with the test outcome. Given we rejectH0 : β1 = 0, we do not expect the 95% confidence interval ofβ1 to contain 0.

59 / 74

3. Regression

b1 ± t(0.05/2,n−2)Sb1 = −1.09± 2.5706× 0.2842

= [−1.821,−0.359]

59 / 74

3. Regression

b1 ± t(0.05/2,n−2)Sb1 = −1.09± 2.5706× 0.2842

= [−1.821,−0.359]

59 / 74

3. Regression

b1 ± t(0.05/2,n−2)Sb1 = −1.09± 2.5706× 0.2842

= [−1.821,−0.359]

59 / 74

3. Regression

I One can conduct hypothesis test and construct CI in the samemanner for β0, but usually inferences on the intercept are ofless interest.

I Also, R2 loses its goodness of fit meaning if we remove theintercept from the model.

I The usual practice is to always include the intercept even if itis insignificant.

60 / 74

3. Regression

I One can conduct hypothesis test and construct CI in the samemanner for β0, but usually inferences on the intercept are ofless interest.

I Also, R2 loses its goodness of fit meaning if we remove theintercept from the model.

I The usual practice is to always include the intercept even if itis insignificant.

60 / 74

3. Regression

Multiple Linear Regression

I In many situations, two or more explanatory variables may beincluded in a regression model to provide an adequatedescription of the process under study or to yield sufficientlyprecise inferences.

I For example, a regression model for predicting the demandsfor a firm’s product in different countries uses socioeconomicvariables (mean household income, average years of schoolingof head of household), demographic variables (average familysize, percentage of retired population), and environmentalvariables (mean daily temperature, pollution index), etc.

I Linear regression models containing two or more explanatoryvariables are called multiple linear regression models.

61 / 74

3. Regression

61 / 74

3. Regression

61 / 74

3. Regression

I The simple linear regression model discussed earlier can beextended to include more than one explanatory variables:

Y = β0 + β1X1 + β2X2 + β3X3 + ...+ βkXk + u,

where u is an unobserved error term and β0, β1,.... and βk areunknown coefficients to be estimated.

I There are k explanatory variables, namely, X1, X2,...Xk .

I When k = 1, the multiple linear regression model reduces tothe simple linear regression model.

62 / 74

3. Regression

Y = β0 + β1X1 + β2X2 + β3X3 + ...+ βkXk + u,

62 / 74

3. Regression

Y = β0 + β1X1 + β2X2 + β3X3 + ...+ βkXk + u,

62 / 74

3. Regression

Y = β0 + β1X1 + β2X2 + β3X3 + ...+ βkXk + u,

62 / 74

3. Regression

I As in the case of simple linear regression, we use the methodof least squares to estimate β0, β1,.... and βk . The estimatesare labelled as b0, b1,.... and bk .

I The formulae for b0, b1,.... and bk are complicated, and theircomplexity increases as k increases. Inferences under multiplelinear regression are also more complex.

I We will not be concerned about the manual computation ofb0, b1,.... and bk here - we will simply use EXCEL to computethese estimates.

I Our discussion will only ”touch on” multiple regressionthrough a case study. The technical details of multipleregression will be covered in details in more advanced levelcourses.

63 / 74

3. Regression

63 / 74

3. Regression

63 / 74

3. Regression

63 / 74

3. Regression

Example 15: A case study on multiple regression

The Grade Point Average (GPA) is commonly adopted byuniversities as the indication of students’ academic performance.There is one common consensus that the Intelligence Quotient (IQ)is a significant factor affecting the GPA of a student. On the otherhand, it has been suggested that academic results are also relatedto the student’s lifestyle. In 2012, the Student DevelopmentServices at City U conducted a statistical study to examine therelationship between GPA and a number of potential factors.

64 / 74

3. Regression

I One hundred and twenty 1st-year Business students took partin this study. These participants were selected randomly. Eachtook an IQ test and had to answer questions relating tohis/her study and lifestyle. The data are contained in theEXCEL file Students.xlsx. They are described briefly asfollows:

Category Explanatory Variables (X) Description Range

Intelligence IQ Result in the Intelligence Quotient test 110 – 138 Study load NoCrs Number of credits taken in the semester A 12, 15, 18

Lifestyle

TravelMin Average minutes required to travel between home and CityU

0, 20, 40, 60, 80, 100, 120 Remark: The value “0” means that a student was living in a student residence

SleepHr Average hours spent on sleeping per day 6, 8, 10, 12 StudyHr Average hours spent on studying per

week 0, 5, 10, 15, 20, 25, 30, 35, 40

WorkHr Average hours spent on working per week 0, 5, 10, 15, 20, 25, 30, 35, 40

Category Dependent Variable (Y) Description Range

Academic achievement

SGPA Grade Point Average in the semester A 1.29 – 4.30

65 / 74

3. Regression

I One hundred and twenty 1st-year Business students took partin this study. These participants were selected randomly. Eachtook an IQ test and had to answer questions relating tohis/her study and lifestyle. The data are contained in theEXCEL file Students.xlsx. They are described briefly asfollows:

Category Explanatory Variables (X) Description Range

Intelligence IQ Result in the Intelligence Quotient test 110 – 138 Study load NoCrs Number of credits taken in the semester A 12, 15, 18

Lifestyle

TravelMin Average minutes required to travel between home and CityU

0, 20, 40, 60, 80, 100, 120 Remark: The value “0” means that a student was living in a student residence

SleepHr Average hours spent on sleeping per day 6, 8, 10, 12 StudyHr Average hours spent on studying per

week 0, 5, 10, 15, 20, 25, 30, 35, 40

WorkHr Average hours spent on working per week 0, 5, 10, 15, 20, 25, 30, 35, 40

Category Dependent Variable (Y) Description Range

Academic achievement

SGPA Grade Point Average in the semester A 1.29 – 4.30

65 / 74

3. Regression

I The multiple regression model is

SGPA = β0 + β1IQ + β2NoCrs + β3TravelMin

+ β4SleepHr + β5StudyHr + β6WorkHr + u.

I Note that not all of the six explanatory variables arenecessarily significant, meaning that the performance of theregression may be ”better off” with some of the explanatoryvariables on the right-hand-side of the equation removed.

I We can test for the significance of each explanatory variableusing a t-test after estimating the above full model. Theresults are shown as follows.

66 / 74

3. Regression

I The multiple regression model is

SGPA = β0 + β1IQ + β2NoCrs + β3TravelMin

+ β4SleepHr + β5StudyHr + β6WorkHr + u.

I Note that not all of the six explanatory variables arenecessarily significant, meaning that the performance of theregression may be ”better off” with some of the explanatoryvariables on the right-hand-side of the equation removed.

I We can test for the significance of each explanatory variableusing a t-test after estimating the above full model. Theresults are shown as follows.

66 / 74

3. Regression

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 0.913462442 0.680455298 1.34242829 0.182147891 -0.434642229 2.261567113 IQ 0.022195038 0.004903885 4.526010953 1.49509E-05 0.012479557 0.031910519 NoCrs -0.048683053 0.024812642 -1.962026154 0.052218261 -0.097841372 0.000475266 TravelMin 0.000848208 0.000796475 1.064952445 0.289167731 -0.000729753 0.002426169 SleepHr 0.012080628 0.023950257 0.504404941 0.614958993 -0.035369151 0.059530407 StudyHr 0.031439423 0.004013515 7.833388222 2.82763E-12 0.023487926 0.039390921 WorkHr -0.02982413 0.003813774 -7.82011011 3.02793E-12 -0.037379903 -0.022268357

67 / 74

3. Regression

I R2 is 0.81599, meaning that the model accounts for nearly82% of the variation of the dependent variable, SGPA, in thesample.

I A test of whether a relationship exists between the dependentvariable SGPA and a given explanatory variable takes the form:H0 : βi = 0 vs. H1 : βi 6= 0

I The test statistic ist = bi−0

Sbi∼ tn−(k+1), where Sbi is the standard error of bi ,

and i is an index representing one of 1, 2, 3, ....k .I At α = 0.05, we reject H0 if the p-value of the test is smaller

than 0.05, or |t| > t(α/2,n−(k+1)).I If H0 cannot be rejected, there is a good chance that βi is

zero and the regression should be re-estimated with theexplanatory variable associated with βi removed.

68 / 74

3. Regression

and i is an index representing one of 1, 2, 3, ....k .

I At α = 0.05, we reject H0 if the p-value of the test is smallerthan 0.05, or |t| > t(α/2,n−(k+1)).

I If H0 cannot be rejected, there is a good chance that βi iszero and the regression should be re-estimated with theexplanatory variable associated with βi removed.

68 / 74

3. Regression

than 0.05, or |t| > t(α/2,n−(k+1)).

I If H0 cannot be rejected, there is a good chance that βi iszero and the regression should be re-estimated with theexplanatory variable associated with βi removed.

68 / 74

3. Regression

than 0.05, or |t| > t(α/2,n−(k+1)).I If H0 cannot be rejected, there is a good chance that βi is

zero and the regression should be re-estimated with theexplanatory variable associated with βi removed.

68 / 74

3. Regression

I One can, in addition to R2, evaluate the model’s goodness offit by the sum of the absolute values of errors across thesample of observations.

I Recall that the error for observation i is ei = Yi − Yi , or using

the notations of the current example, ei = SGPAi − SGPAi . Itmeasures the extent to which the predicted value of thedependent variable differs from its true value for anobservation.

I The sum of the absolute values of errors (SAE) is∑n

i |ei |. Weconsider the sum of |ei | rather than the sum of ei to avoidnegative ei ’s cancelling out with positive ei ’s.

I Clearly, the smaller the SAE, the better the fit of the model.

69 / 74

3. Regression

69 / 74

3. Regression

69 / 74

3. Regression

69 / 74

3. Regression

I For example, using the model and the data in ”students.xlsx”,we obtain

SGPA1 = 0.913462442 + 0.022195038(116)− 0.048683053(15)

+0.000848208(60) + 0.012080628(12) + 0.031439423(25)

−0.02982413(5) = 3.5905662,

which is the predicted value of SGPA1.

I From the data provided, the true value of SGPA1 is 3.98.Hence e1 = 3.98− 3.5905662 = 0.38943398 and|e1| = 0.38943398.

I We can similarly work out the |ei |’s for the remaining 119observations in the sample, and verify that∑120

i=1 ei = 28.557784, as shown in the EXCEL output.

70 / 74

3. Regression

SGPA1 = 0.913462442 + 0.022195038(116)− 0.048683053(15)

+0.000848208(60) + 0.012080628(12) + 0.031439423(25)

−0.02982413(5) = 3.5905662,

70 / 74

3. Regression

SGPA1 = 0.913462442 + 0.022195038(116)− 0.048683053(15)

+0.000848208(60) + 0.012080628(12) + 0.031439423(25)

−0.02982413(5) = 3.5905662,

70 / 74

3. Regression

I Now, let us test the significance of each explanatory variable.

I For example, for the variable IQ,H0 : β1 = 0 vs. H1 : β1 6= 0

I The test statistic, as shown in the output, ist = b1−0

Sb1= 0.022195−0

0.004904 = 4.5260

I The p-value, as reported in the output, is 1.49509× 10−5.Our test statistic clearly lies in the rejection region as it issmaller than α = 0.05. Hence we reject H0 and conclude thatIQ is a significant variable.

71 / 74

3. Regression

Sb1= 0.022195−0

0.004904 = 4.5260

71 / 74

3. Regression

Sb1= 0.022195−0

0.004904 = 4.5260

71 / 74

3. Regression

I Looking across the p-values of the t-tests for the sixexplanatory variables, we find that the tests for IQ, NoCrs,TravelMin, StudyHr and WorkHr all have p-values that do notexceed 0.05 and hence these variables are significant.

I On the other hand, the t-tests for TravelMin and SleepHr havep-values of 0.289167731 and 0.614958993 respectively. Thesevariables are therefore insignificant, with SleepHr having ahigher p-value being the more insignificant of the two.

I We then remove TravelMin and SleepHr from the model. Thenew (reduced) model is

SGPA = β0 +β1IQ+β2NoCrs+β5StudyHr+β6WorkHr+u.

I The results, obtained using EXCEL, are as follows.

72 / 74

3. Regression

72 / 74

3. Regression

72 / 74

3. Regression

72 / 74

3. Regression

Regression Statistics Multiple R 0.901948488 R Square 0.813511074 Adjusted R Square 0.807024503 Standard Error 0.296728259 Observations 120

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 1.145038169 0.589973932 1.94082841 0.05472641 -0.023586645 2.313662984 IQ 0.022513062 0.004863368 4.629109561 9.71844E-06 0.012879666 0.032146457 NoCrs -0.055937296 0.02135401 -2.619521881 0.00999271 -0.098235479 -0.013639112 StudyHr 0.031518035 0.003999754 7.879992844 2.04156E-12 0.023595291 0.039440778 WorkHr -0.030595661 0.00370361 -8.261038145 2.78543E-13 -0.037931799 -0.023259523

73 / 74

3. Regression

I All explanatory variables are significant.

I The coefficient estimates of the explanatory variables thatremain in the model have changed (e.g., under the reducedmodel, b1 = 0.02251 but under the previous full model,b1 = 0.02219). Note that Least squares estimates take intoaccount the multiple correlation among the explanatoryvariables, which changes as variables are added or dropped.

I R2 has dropped from 0.90332 to 0.90195. Note that R2

always decreases as variables are removed but this does notalways mean the reduced model is inferior to the full model.

I It can be verified that∑120

i=1 |ei | = 28.44398759, as shown inthe EXCEL output. This sum of absolute errors is slightlysmaller than that under the full model.

74 / 74

3. Regression

74 / 74

3. Regression

74 / 74

3. Regression

74 / 74

chapter 7: correlation and regressionpersonal.cb.cityu.edu.hk/msawan/teaching/cb2200/cb2200ch... ·...

Documents