chapter 12. simple linear regression and correlation

NIPRL 1

Chapter 12. Simple Linear Regression and

Correlation12.1 The Simple Linear Regression Model12.2 Fitting the Regression Line12.3 Inferences on the Slope Rarameter β1

12.4 Inferences on the Regression Line12.5 Prediction Intervals for Future Response Values12.6 The Analysis of Variance Table12.7 Residual Analysis12.8 Variable Transformations12.9 Correlation Analysis12.10 Supplementary Problems

NIPRL 2

12.1 The Simple Linear Regression Model

12.1.1 Model Definition and Assumptions(1/5)

• With the simple linear regression model yi=β0+β1xi+εi

the observed value of the dependent variable yi is composed of a linear function β0+β1xi of the explanatory variable xi, together with an error term εi. The error terms ε1,…,εn are generally taken to be independent observations from a N(0,σ2) distribution, for some error variance σ2. This implies that the values y1,…,yn are observations from the independent random variables

Yi ~ N (β0+β1xi, σ2)

as illustrated in Figure 12.1

NIPRL 3


NIPRL 4


• The parameter β0 is known as the intercept parameter, and the parameter β0 is known as the intercept parameter, and the parameter β1 is known as the slope parameter. A third unknown parameter, the error

variance σ2, can also be estimated from the data set. As illustrated in Figure 12.2, the data values (xi , yi ) lie closer

to the line y = β0+β1x

as the error variance σ2 decreases.

NIPRL 5


• The slope parameter β1 is of particular interest since it indicates how the expected value of the dependent variable depends upon the explanatory variable x, as shown in Figure 12.3

• The data set shown in Figure 12.4 exhibits a quadratic (or at least nonlinear) relationship between the two variables, and it would make no sense to fit a straight line to the data set.

NIPRL 6


• Simple Linear Regression Model The simple linear regression model yi = β0 + β1xi + εi

fit a straight line through a set of paired data observations (x1,y1),…,(xn, yn). The error terms ε1,…,εn are taken to be independent observations from a N(0,σ2) distribution. The three unknown parameters, the intercept parameter β0 , the slope parameter β1, and the error variance σ2, are estimated from the data set.

NIPRL 7

12.1.2 Examples(1/2)

• Example 3 : Car Plant Electricity Usage The manager of a car plant wishes to investigate how the

plant’s electricity usage depends upon the plant’s production.

The linear model

will allow a month’s electrical usage to be estimated as a function of the month’s production.

0 1y x

NIPRL 8


NIPRL 9

12.2 Fitting the Regression Line12.2.1 Parameter Estimation(1/4)

0 1 1 1 The regression line is fitted to the data points ( , ), , ( , )

by finding the line that is "closest" to the data points in some sense.

As Figure 12.14 illustrates, the fitted line is

n ny x x y x y

21 0 1

chosen to be the line that

the sum of the squares of these vertical deviations

( ( ))

and this is referred to as

the fit.

ni i i

minimizes

Q y x

least squares

NIPRL 10

12.2.1 Parameter Estimation(2/4)

2

12

0 1

1

2

ˆ ˆ With normally distributed error terms, and are maximum

likelihood estimates.

( ) The joint density of the error terms , , is

1 .

2

This likelihoo

n

ii

n

n

e

2 20 1

1 0 10

1 0 11

0 1 1

1 0 1 1 1

d is maximized by minizing

( ( ))

2( ( )) and

normal equati

2 ( ( ))

the

ˆ ˆ and

ˆ ˆ

on

s

i i i

ni i i

ni i i i

ni i i

n ni i i i i i

y x Q

Qy x

Qx y x

y n x

x y x

n 2ix

NIPRL 11


n1 1 1

1 n 2 n 2i=1 i=1

1 10 1 1

2 2 21 1

22 1

1

1 1

( )( )

n ( )

and then

where

( )

( )

and

( )( )

n ni i i i i i i XY

i i XX

n ni i i i

n nXX i i i i

nn i ii i

n nXY i i i i i i

n x y x y S

x x S

y xy x

n n

S x x x nx

xx

n

S x x y y x y nxy

*

1 11

*

*0 1

( )( )

For a specific value of the explanatory variable , this equation

ˆ provides a fitted value | for the dependent variable , as

illustrated in Figure 12.15.

n nn i i i ii i i

x

x yx y

n

x

y x y

NIPRL 12


2 The error variance can be estimated by considering the

deviations between the observed data values and their fitted

values . Specifically, the sum of squares for error SSE is

defined t

i

i

y

y

2 211 1 0

21 0 1 1 1

2

o be the sum of the squares of these deviations

SSE ( ) ( ( ))

and the estimate of the error variance is

SSE

2

n ni i i i i i

n n ni i i i i i i

y y y x

y y x y

n

NIPRL 13


• Example 3 : Car Plant Electricity Usage

12

1

12

1

122 2 2

1

122 2 2

1

12

1

For this example 12 and

4.51 4.20 58.62

2.48 2.53 34.15

4.51 4.20 291.2310

2.48 2.53 98.6967

(4.51 2.48) (4.20 2.53) 169.2532

ii

ii

ii

ii

i ii

n

x

y

x

y

x y

NIPRL 14


NIPRL 15


1 1 11

2 2

1 1

2

0 1

The estimates of the slope parameter and the intercept parameter :

( )( )

( )

(12 169.2532) (58.62 34.15)0.49883

(12 291.2310) 58.62

34.15(0.49883

12

n n n

i i i ii i i

n n

i ii i

n x y x y

n x x

y x

0 1

5.5

58.62) 0.4090

12

The fitted regression line :

0.409 0.499

| 0.409 (0.499 5.5) 3.1535

y x x

y

NIPRL 16


Using the model for production values outside this range is known

as and may give inacce uxtrapolatio resra un te lts.

x

NIPRL 17


20 12

1 1 1

298.6967 (0.4090 34.15) (0.49883 169.2532)

0.029910

0.0299 0.1729

n n n

i i i ii i i

y y x y

n

NIPRL 18

12.3 Inferences on the Slope Parameter β1

12.3.1 Inference Procedures(1/4)Inferences on the Slope Parameter β1

2

1

1 1 / 2, 2 1 1 / 2, 2 1

ˆ , ).

A two-sided confidence interval with a confidence level 1 for the slope

parameter in a simple linear regression model is

( . .( ), . .( ))

wh

XX

n n

S

t s e t s e

/ 2, 2 / 2, 21 1 1

, 2 , 21 1 1 1

ich is

( , )

One-sided 1 confidence level confidence intervals are

( , ) and ( , )

n n

XX XX

n n

XX XX

t t

S S

t t

S S

NIPRL 19

12.3.1 Inference Procedures(2/4)

0 1 1 1 1

1

1 1

The two-sided hypotheses

: versus :

for a fixed value of interest are tested with the -statistic

The -value is

-value 2 ( | |)

where the random

A

XX

H b H b

b t

bt

S

p

p P X t

/ 2, 2

variable has a -distribution with 2 degrees of freedom.

A size test rejects the null hypothesis if | | .n

X t n

t t

NIPRL 20


Slki Lab.

0 1 1 1 1

, 2

The one-sided hypotheses

: versus :

have a -value

-value ( )

and a size test rejects the null hypothesis if .

The one-sided hypotheses

A

n

H b H b

p

p P X t

t t

H

0 1 1 1 1

, 2

: versus :

have a -value

-value ( )

and a size test rejects the null hypothesis if .

A

n

b H b

p

p P X t

t t

NIPRL 21


• An interesting point to notice is that for a fixed value of the error

variance σ2, the variance of the slope parameter estimate decreases as the value of SXX increases. This happens as the values of the explanatory

variable xi become more

spread out, as illustrated in Figure 12.30. This result is intuitively reasonable since a greater spread in the values xi provides

a greater “leverage” for fitting the regression line, and therefore the slope parameter estimate should be more accurate.

1

NIPRL 22



122

2122 1

1

1

0 1

1

1

( )58.62

291.2310 4.872312 12

0.1729. .( ) 0.0783

4.8723

The -statistic for testing : 0

0.498836.37

0.0783. .

The two-sided -value

value 2 ( 6.37) 0

ii

XX ii

XX

xS x

s eS

t H

ts e

p

p P X

NIPRL 23


0.005,10

1 1 1 1 1

With 3.169, a 99% two-sided confidence interval for the

slope parameter

( critical point . .( ), critical point . .( ))

0.49883 3.169 0.0783, 0.49883 3.169 0.0783

0.251,

t

s e s e

0.747

NIPRL 24

12.4 Inferences on the Regression Line12.4.1 Inference Procedures(1/2)

Inferences on the Expected Value of the Dependent Variable

*0 1

*

* *0 1 0 1 / 2, 1

A 1 confidence level two-sided confidence interval for , the

expected value of the dependent variable for a particular value of the ex-

planatory variable, is

( . .(n

x

x

x x t s e

*0 1

* *0 1 / 2, 2 0 1

* 2*

0 1

),

. .( ))

where

1 ( ) . .( )

n

XX

x

x t s e x

x xs e x

n S

NIPRL 25


* * *0 1 0 1 , 2 0 1

* * *0 1 0 1 , 1 0 1

*0 1

One-sided confidence intervals are

( , . .( ))

and

( . .( ), )

Hypothesis tests on can be performed by comparing the -statistic

(

n

n

x x t s e x

x x t s e x

x t

t

* *0 1 0 1

*0 1

) ( )

. .( )

with a -distribution with 2 degrees of freedom.

x x

s e x

t n

NIPRL 26



2* * 2*

0 1

*0.025,10 0 1

* 2* *

0 1

*

1 1 ( 4.885). .( ) 0.1729

12 4.8723

With 2.228, a 95% confidence interval for

1 ( 4.885)(0.409 0.499 2.228 0.1729 ,

12 4.8723

10.409 0.499 2.228 0.179

1

XX

x x xs e x

n S

t x

xx x

x

* 2

*

0 1

( 4.885))

2 4.8723

At 5

5 (0.409 (0.499 5) 0.113,0.409 (0.499 5) 0.113)

(2.79,3.02)

x

x

NIPRL 27


NIPRL 28

12.5 Prediction Intervals for Future Response Values


• Prediction Intervals for Future Response Values

*

*

*

* 2*

0 1 / 2, 1

A 1 confidence level two-sided prediction interval for | , a future value

of the dependent variable for a particular value of the explanatory variable,

is

1 ( )| 1

x

nxX

y

x

x xy x t

n S

* 2

*0 1 / 2, 2

,

1 ( )1

X

nXX

x xx t

n S

NIPRL 29


*

*

* 2*

0 1 , 2

* 2*

0 1 , 1

One-sided confidence intervals are

1 ( ) | ( , 1 )

and

1 ( ) | ( 1 , )

nxXX

nxXX

x xy x t

n S

x xy x t

n S

NIPRL 30



*

*

0.025,10

* 2*

* 2*

*

5

With 2.228, a 95% confidence interval for |

13 ( 4.885)| (0.409 0.499 2.228 0.1729 ,

12 4.8723

13 ( 4.885)0.409 0.499 2.228 0.179 )

12 4.8723

At 5

| (0.409 (0.499 5) 0.401,0.409

x

x

t y

xy x

xx

x

y

(0.499 5) 0.401)

(2.50,3.30)

NIPRL 31


NIPRL 32

12.6 The Analysis of Variance Table12.6.1 Sum of Squares Decomposition(1/5)

NIPRL 33

12.6.1 Sum of Squares Decomposition(2/5)

NIPRL 34


Source Degrees of freedom

Sum of squares

Mean squares F-statistic p-value

Regression

Error

1N-2

SSRSSE

MSR=SSR =MSE=SSE/(n-

2)

F=MSR/MSE

P( F1,n-2 > F )

Total n-1

2

F I G U R E 12.41 Analysis of variance table for simple

linear regression analysis

NIPRL 35


NIPRL 36


Coefficient of Determination R2

21

The total variability in the dependent variable, the total sum of squares

SST ( )

can be partitioned into the variability explained by the regression line,

t

ni iy y

21

21

he regression sum of squares SSR ( )

and the variability about the regression line, the error sum of squares

SSE ( ) .

The proportion of the tot

ni i

ni i i

y y

y y

2

coefficient of determination

al variability accounted for by the regression line is

the

SSR SSE 1 1

SSESST SST 1SSR

which takes a value between zero and one.

R

NIPRL 37



2

MSR 1.212440.53

MSE 0.0299

SSR 1.21240.802

SST 1.5115

F

R

NIPRL 38

12.7 Residual Analysis12.7.1 Residual Analysis Methods(1/7)

• The residuals are defined to be so that they are the differences between the observed

values of the dependent variable and the corresponding fitted values .

• A property of the residuals • Residual analysis can be used to

– Identify data points that are outliers,– Check whether the fitted model is appropriate,– Check whether the error variance is constant, and– Check whether the error terms are normally distributed.

, 1i i ie y y i n

1 0ni ie

iy iy

NIPRL 39

12.7.1 Residual Analysis Methods(2/7)

• A nice random scatter plot such as the one in Figure 12.45 ⇒ there are no indications of any problems with the

regression analysis• Any patterns in the residual plot or any residuals with a

large absolute value alert the experimenter to possible problems with the fitted regression model.

NIPRL 40


• A data point (xi, yi ) can be considered to be an outlier if it does not appear to predict well by the fitted model.

• Residuals of outliers have a large absolute value, as indicated in Figure 12.46. Note in the figure that is used instead of

• [For your interest only] ˆ

ies

.ie 2

2( )1( ) (1 ) .ii

XX

x xVar e

n Ss

-= - -

NIPRL 41


• If the residual plot shows positive and negative residuals grouped together as in Figure 12.47, then a linear model is not appropriate. As Figure 12.47 indicates, a nonlinear model is needed for such a data set.

NIPRL 42


• If the residual plot shows a “funnel shape” as in Figure 12.48, so that the size of the residuals depends upon the value of the explanatory variable x, then the assumption of a constant error variance σ2 is not valid.

NIPRL 43


• A normal probability plot ( a normal score plot) of the residuals– Check whether the error terms εi appear to be normally

distributed.

• The normal score of the i th smallest residual

–

• The main body of the points in a normal probability plot lie approximately on a straight line as in Figure 12.49 is reasonable

• The form such as in Figure 12.50 indicates that the distribution is not normal

1

3814

i

n

NIPRL 44


NIPRL 45


• Example : Nile River Flowrate

NIPRL 46


5

3.88

| 0.470 (0.836 3.88) 2.77

4.01 2.77 1.24

1.243.75

0.1092

6.13

5.67 ( 0.470 (0.836 6.13)) 1.02

1.023.07

0.1092

i i i

i

i i i

i

x

y

e y y

e

x

e y y

e

NIPRL 47

12.8 Variable Transformations12.8.1 Intrinsically Linear Models(1/4)

NIPRL 48

12.8.1 Intrinsically Linear Models(2/4)

NIPRL 49


NIPRL 50


NIPRL 51


• Example : Roadway Base Aggregates

NIPRL 52


NIPRL 53


NIPRL 54


NIPRL 55


NIPRL 56

12.9 Correlation Analysis12.9.1 The Sample Correlation Coefficient

Sample Correlation Coefficient

1

2 21 1

1

2 2 2 21 1

The for a set of paired data observations

( , ) is

( )( )

( ) ( )

It measures the st

sample correlation coeffici

g

t

r n t

n

e

e

i

i i

ni i iXY

n nXX YY i i i i

ni i i

n ni i i

r

x y

x x y ySr

S S x x y y

x y nxy

x nx y ny

h of association between two variables and can

be thought of as an estimate of the correlation between the two associated

random variable and .

linear

X Y

NIPRL 57

0

2

Under the assumption that the and random variables have a bivariate

normal distribution, a test of the null hypothesis

: 0

can be performed by comparing the -statistic

2

1with a -dis

X Y

H

t

r nt

rt

0 1

tribution with 2 degrees of freedom. In a regression framework,

this test is equivalent to testing : 0.

n

H

NIPRL 58

NIPRL 59

NIPRL 60


• Example : Nile River Flowrate

2 0.871 0.933r R

chapter 12. simple linear regression and correlation

Documents