chapter 12. simple linear regression and correlation
DESCRIPTION
Chapter 12. Simple Linear Regression and Correlation. 12.1 The Simple Linear Regression Model 12.2 Fitting the Regression Line 12.3 Inferences on the Slope Rarameter β 1 12.4 Inferences on the Regression Line 12.5 Prediction Intervals for Future Response Values - PowerPoint PPT PresentationTRANSCRIPT
NIPRL 1
Chapter 12. Simple Linear Regression and
Correlation12.1 The Simple Linear Regression Model12.2 Fitting the Regression Line12.3 Inferences on the Slope Rarameter β1
12.4 Inferences on the Regression Line12.5 Prediction Intervals for Future Response Values12.6 The Analysis of Variance Table12.7 Residual Analysis12.8 Variable Transformations12.9 Correlation Analysis12.10 Supplementary Problems
NIPRL 2
12.1 The Simple Linear Regression Model
12.1.1 Model Definition and Assumptions(1/5)
• With the simple linear regression model yi=β0+β1xi+εi
the observed value of the dependent variable yi is composed of a linear function β0+β1xi of the explanatory variable xi, together with an error term εi. The error terms ε1,…,εn are generally taken to be independent observations from a N(0,σ2) distribution, for some error variance σ2. This implies that the values y1,…,yn are observations from the independent random variables
Yi ~ N (β0+β1xi, σ2)
as illustrated in Figure 12.1
NIPRL 3
12.1.1 Model Definition and Assumptions(2/5)
NIPRL 4
12.1.1 Model Definition and Assumptions(3/5)
• The parameter β0 is known as the intercept parameter, and the parameter β0 is known as the intercept parameter, and the parameter β1 is known as the slope parameter. A third unknown parameter, the error
variance σ2, can also be estimated from the data set. As illustrated in Figure 12.2, the data values (xi , yi ) lie closer
to the line y = β0+β1x
as the error variance σ2 decreases.
NIPRL 5
12.1.1 Model Definition and Assumptions(4/5)
• The slope parameter β1 is of particular interest since it indicates how the expected value of the dependent variable depends upon the explanatory variable x, as shown in Figure 12.3
• The data set shown in Figure 12.4 exhibits a quadratic (or at least nonlinear) relationship between the two variables, and it would make no sense to fit a straight line to the data set.
NIPRL 6
12.1.1 Model Definition and Assumptions(5/5)
• Simple Linear Regression Model The simple linear regression model yi = β0 + β1xi + εi
fit a straight line through a set of paired data observations (x1,y1),…,(xn, yn). The error terms ε1,…,εn are taken to be independent observations from a N(0,σ2) distribution. The three unknown parameters, the intercept parameter β0 , the slope parameter β1, and the error variance σ2, are estimated from the data set.
NIPRL 7
12.1.2 Examples(1/2)
• Example 3 : Car Plant Electricity Usage The manager of a car plant wishes to investigate how the
plant’s electricity usage depends upon the plant’s production.
The linear model
will allow a month’s electrical usage to be estimated as a function of the month’s pro- duction.
0 1y x
NIPRL 8
12.1.2 Examples(2/2)
NIPRL 9
12.2 Fitting the Regression Line12.2.1 Parameter Estimation(1/4)
0 1 1 1 The regression line is fitted to the data points ( , ), , ( , )
by finding the line that is "closest" to the data points in some sense.
As Figure 12.14 illustrates, the fitted line is
n ny x x y x y
21 0 1
chosen to be the line that
the sum of the squares of these vertical deviations
( ( ))
and this is referred to as
the fit.
ni i i
minimizes
Q y x
least squares
NIPRL 10
12.2.1 Parameter Estimation(2/4)
2
12
0 1
1
2
ˆ ˆ With normally distributed error terms, and are maximum
likelihood estimates.
( ) The joint density of the error terms , , is
1 .
2
This likelihoo
n
ii
n
n
e
2 20 1
1 0 10
1 0 11
0 1 1
1 0 1 1 1
d is maximized by minizing
( ( ))
2( ( )) and
normal equati
2 ( ( ))
the
ˆ ˆ and
ˆ ˆ
on
s
i i i
ni i i
ni i i i
ni i i
n ni i i i i i
y x Q
Qy x
Qx y x
y n x
x y x
n 2ix
NIPRL 11
12.2.1 Parameter Estimation(3/4)
n1 1 1
1 n 2 n 2i=1 i=1
1 10 1 1
2 2 21 1
22 1
1
1 1
( )( )
n ( )
and then
where
( )
( )
and
( )( )
n ni i i i i i i XY
i i XX
n ni i i i
n nXX i i i i
nn i ii i
n nXY i i i i i i
n x y x y S
x x S
y xy x
n n
S x x x nx
xx
n
S x x y y x y nxy
*
1 11
*
*0 1
( )( )
For a specific value of the explanatory variable , this equation
ˆ provides a fitted value | for the dependent variable , as
illustrated in Figure 12.15.
n nn i i i ii i i
x
x yx y
n
x
y x y
NIPRL 12
12.2.1 Parameter Estimation(4/4)
2 The error variance can be estimated by considering the
deviations between the observed data values and their fitted
values . Specifically, the sum of squares for error SSE is
defined t
i
i
y
y
2 211 1 0
21 0 1 1 1
2
o be the sum of the squares of these deviations
SSE ( ) ( ( ))
and the estimate of the error variance is
SSE
2
n ni i i i i i
n n ni i i i i i i
y y y x
y y x y
n
NIPRL 13
12.2.2 Examples(1/5)
• Example 3 : Car Plant Electricity Usage
12
1
12
1
122 2 2
1
122 2 2
1
12
1
For this example 12 and
4.51 4.20 58.62
2.48 2.53 34.15
4.51 4.20 291.2310
2.48 2.53 98.6967
(4.51 2.48) (4.20 2.53) 169.2532
ii
ii
ii
ii
i ii
n
x
y
x
y
x y
NIPRL 14
12.2.2 Examples(2/5)
NIPRL 15
12.2.2 Examples(3/5)
1 1 11
2 2
1 1
2
0 1
The estimates of the slope parameter and the intercept parameter :
( )( )
( )
(12 169.2532) (58.62 34.15)0.49883
(12 291.2310) 58.62
34.15(0.49883
12
n n n
i i i ii i i
n n
i ii i
n x y x y
n x x
y x
0 1
5.5
58.62) 0.4090
12
The fitted regression line :
0.409 0.499
| 0.409 (0.499 5.5) 3.1535
y x x
y
NIPRL 16
12.2.2 Examples(4/5)
Using the model for production values outside this range is known
as and may give inacce uxtrapolatio resra un te lts.
x
NIPRL 17
12.2.2 Examples(5/5)
20 12
1 1 1
298.6967 (0.4090 34.15) (0.49883 169.2532)
0.029910
0.0299 0.1729
n n n
i i i ii i i
y y x y
n
NIPRL 18
12.3 Inferences on the Slope Parameter β1
12.3.1 Inference Procedures(1/4)Inferences on the Slope Parameter β1
2
1
1 1 / 2, 2 1 1 / 2, 2 1
ˆ , ).
A two-sided confidence interval with a confidence level 1 for the slope
parameter in a simple linear regression model is
( . .( ), . .( ))
wh
XX
n n
S
t s e t s e
/ 2, 2 / 2, 21 1 1
, 2 , 21 1 1 1
ich is
( , )
One-sided 1 confidence level confidence intervals are
( , ) and ( , )
n n
XX XX
n n
XX XX
t t
S S
t t
S S
NIPRL 19
12.3.1 Inference Procedures(2/4)
0 1 1 1 1
1
1 1
The two-sided hypotheses
: versus :
for a fixed value of interest are tested with the -statistic
The -value is
-value 2 ( | |)
where the random
A
XX
H b H b
b t
bt
S
p
p P X t
/ 2, 2
variable has a -distribution with 2 degrees of freedom.
A size test rejects the null hypothesis if | | .n
X t n
t t
NIPRL 20
12.3.1 Inference Procedures(3/4)
Slki Lab.
0 1 1 1 1
, 2
The one-sided hypotheses
: versus :
have a -value
-value ( )
and a size test rejects the null hypothesis if .
The one-sided hypotheses
A
n
H b H b
p
p P X t
t t
H
0 1 1 1 1
, 2
: versus :
have a -value
-value ( )
and a size test rejects the null hypothesis if .
A
n
b H b
p
p P X t
t t
NIPRL 21
12.3.1 Inference Procedures(4/4)
• An interesting point to notice is that for a fixed value of the error
variance σ2, the variance of the slope parameter estimate decreases as the value of SXX increases. This happens as the values of the explanatory
variable xi become more
spread out, as illustrated in Figure 12.30. This result is intuitively reasonable since a greater spread in the values xi provides
a greater “leverage” for fitting the regression line, and therefore the slope parameter estimate should be more accurate.
1
NIPRL 22
12.3.2 Examples(1/2)
• Example 3 : Car Plant Electricity Usage
122
2122 1
1
1
0 1
1
1
( )58.62
291.2310 4.872312 12
0.1729. .( ) 0.0783
4.8723
The -statistic for testing : 0
0.498836.37
0.0783. .
The two-sided -value
value 2 ( 6.37) 0
ii
XX ii
XX
xS x
s eS
t H
ts e
p
p P X
NIPRL 23
12.3.2 Examples(2/2)
0.005,10
1 1 1 1 1
With 3.169, a 99% two-sided confidence interval for the
slope parameter
( critical point . .( ), critical point . .( ))
0.49883 3.169 0.0783, 0.49883 3.169 0.0783
0.251,
t
s e s e
0.747
NIPRL 24
12.4 Inferences on the Regression Line12.4.1 Inference Procedures(1/2)
Inferences on the Expected Value of the Dependent Variable
*0 1
*
* *0 1 0 1 / 2, 1
A 1 confidence level two-sided confidence interval for , the
expected value of the dependent variable for a particular value of the ex-
planatory variable, is
( . .(n
x
x
x x t s e
*0 1
* *0 1 / 2, 2 0 1
* 2*
0 1
),
. .( ))
where
1 ( ) . .( )
n
XX
x
x t s e x
x xs e x
n S
NIPRL 25
12.4.1 Inference Procedures(2/2)
* * *0 1 0 1 , 2 0 1
* * *0 1 0 1 , 1 0 1
*0 1
One-sided confidence intervals are
( , . .( ))
and
( . .( ), )
Hypothesis tests on can be performed by comparing the -statistic
(
n
n
x x t s e x
x x t s e x
x t
t
* *0 1 0 1
*0 1
) ( )
. .( )
with a -distribution with 2 degrees of freedom.
x x
s e x
t n
NIPRL 26
12.4.2 Examples(1/2)
• Example 3 : Car Plant Electricity Usage
2* * 2*
0 1
*0.025,10 0 1
* 2* *
0 1
*
1 1 ( 4.885). .( ) 0.1729
12 4.8723
With 2.228, a 95% confidence interval for
1 ( 4.885)(0.409 0.499 2.228 0.1729 ,
12 4.8723
10.409 0.499 2.228 0.179
1
XX
x x xs e x
n S
t x
xx x
x
* 2
*
0 1
( 4.885))
2 4.8723
At 5
5 (0.409 (0.499 5) 0.113,0.409 (0.499 5) 0.113)
(2.79,3.02)
x
x
NIPRL 27
12.4.2 Examples(2/2)
NIPRL 28
12.5 Prediction Intervals for Future Response Values
12.5.1 Inference Procedures(1/2)
• Prediction Intervals for Future Response Values
*
*
*
* 2*
0 1 / 2, 1
A 1 confidence level two-sided prediction interval for | , a future value
of the dependent variable for a particular value of the explanatory variable,
is
1 ( )| 1
x
nxX
y
x
x xy x t
n S
* 2
*0 1 / 2, 2
,
1 ( )1
X
nXX
x xx t
n S
NIPRL 29
12.5.1 Inference Procedures(2/2)
*
*
* 2*
0 1 , 2
* 2*
0 1 , 1
One-sided confidence intervals are
1 ( ) | ( , 1 )
and
1 ( ) | ( 1 , )
nxXX
nxXX
x xy x t
n S
x xy x t
n S
NIPRL 30
12.5.2 Examples(1/2)
• Example 3 : Car Plant Electricity Usage
*
*
0.025,10
* 2*
* 2*
*
5
With 2.228, a 95% confidence interval for |
13 ( 4.885)| (0.409 0.499 2.228 0.1729 ,
12 4.8723
13 ( 4.885)0.409 0.499 2.228 0.179 )
12 4.8723
At 5
| (0.409 (0.499 5) 0.401,0.409
x
x
t y
xy x
xx
x
y
(0.499 5) 0.401)
(2.50,3.30)
NIPRL 31
12.5.2 Examples(2/2)
NIPRL 32
12.6 The Analysis of Variance Table12.6.1 Sum of Squares Decomposition(1/5)
NIPRL 33
12.6.1 Sum of Squares Decomposition(2/5)
NIPRL 34
12.6.1 Sum of Squares Decomposition(3/5)
Source Degrees of freedom
Sum of squares
Mean squares F-statistic p-value
Regression
Error
1N-2
SSRSSE
MSR=SSR =MSE=SSE/(n-
2)
F=MSR/MSE
P( F1,n-2 > F )
Total n-1
2
F I G U R E 12.41 Analysis of variance table for simple
linear regression analysis
NIPRL 35
12.6.1 Sum of Squares Decomposition(4/5)
NIPRL 36
12.6.1 Sum of Squares Decomposition(5/5)
Coefficient of Determination R2
21
The total variability in the dependent variable, the total sum of squares
SST ( )
can be partitioned into the variability explained by the regression line,
t
ni iy y
21
21
he regression sum of squares SSR ( )
and the variability about the regression line, the error sum of squares
SSE ( ) .
The proportion of the tot
ni i
ni i i
y y
y y
2
coefficient of determination
al variability accounted for by the regression line is
the
SSR SSE 1 1
SSESST SST 1SSR
which takes a value between zero and one.
R
NIPRL 37
12.6.2 Examples(1/1)
• Example 3 : Car Plant Electricity Usage
2
MSR 1.212440.53
MSE 0.0299
SSR 1.21240.802
SST 1.5115
F
R
NIPRL 38
12.7 Residual Analysis12.7.1 Residual Analysis Methods(1/7)
• The residuals are defined to be so that they are the differences between the observed
values of the dependent variable and the corresponding fitted values .
• A property of the residuals • Residual analysis can be used to
– Identify data points that are outliers,– Check whether the fitted model is appropriate,– Check whether the error variance is constant, and– Check whether the error terms are normally distributed.
, 1i i ie y y i n
1 0ni ie
iy iy
NIPRL 39
12.7.1 Residual Analysis Methods(2/7)
• A nice random scatter plot such as the one in Figure 12.45 ⇒ there are no indications of any problems with the
regression analysis• Any patterns in the residual plot or any residuals with a
large absolute value alert the experimenter to possible problems with the fitted regression model.
NIPRL 40
12.7.1 Residual Analysis Methods(3/7)
• A data point (xi, yi ) can be considered to be an outlier if it does not appear to predict well by the fitted model.
• Residuals of outliers have a large absolute value, as indicated in Figure 12.46. Note in the figure that is used instead of
• [For your interest only] ˆ
ies
.ie 2
2( )1( ) (1 ) .ii
XX
x xVar e
n Ss
-= - -
NIPRL 41
12.7.1 Residual Analysis Methods(4/7)
• If the residual plot shows positive and negative residuals grouped together as in Figure 12.47, then a linear model is not appropriate. As Figure 12.47 indicates, a nonlinear model is needed for such a data set.
NIPRL 42
12.7.1 Residual Analysis Methods(5/7)
• If the residual plot shows a “funnel shape” as in Figure 12.48, so that the size of the residuals depends upon the value of the explanatory variable x, then the assumption of a constant error variance σ2 is not valid.
NIPRL 43
12.7.1 Residual Analysis Methods(6/7)
• A normal probability plot ( a normal score plot) of the residuals– Check whether the error terms εi appear to be normally
distributed.
• The normal score of the i th smallest residual
–
• The main body of the points in a normal probability plot lie approximately on a straight line as in Figure 12.49 is reasonable
• The form such as in Figure 12.50 indicates that the distribution is not normal
1
3814
i
n
NIPRL 44
12.7.1 Residual Analysis Methods(7/7)
NIPRL 45
12.7.2 Examples(1/2)
• Example : Nile River Flowrate
NIPRL 46
12.7.2 Examples(2/2)
5
3.88
| 0.470 (0.836 3.88) 2.77
4.01 2.77 1.24
1.243.75
0.1092
6.13
5.67 ( 0.470 (0.836 6.13)) 1.02
1.023.07
0.1092
i i i
i
i i i
i
x
y
e y y
e
x
e y y
e
NIPRL 47
12.8 Variable Transformations12.8.1 Intrinsically Linear Models(1/4)
NIPRL 48
12.8.1 Intrinsically Linear Models(2/4)
NIPRL 49
12.8.1 Intrinsically Linear Models(3/4)
NIPRL 50
12.8.1 Intrinsically Linear Models(4/4)
NIPRL 51
12.8.2 Examples(1/5)
• Example : Roadway Base Aggregates
NIPRL 52
12.8.2 Examples(2/5)
NIPRL 53
12.8.2 Examples(3/5)
NIPRL 54
12.8.2 Examples(4/5)
NIPRL 55
12.8.2 Examples(5/5)
NIPRL 56
12.9 Correlation Analysis12.9.1 The Sample Correlation Coefficient
Sample Correlation Coefficient
1
2 21 1
1
2 2 2 21 1
The for a set of paired data observations
( , ) is
( )( )
( ) ( )
It measures the st
sample correlation coeffici
g
t
r n t
n
e
e
i
i i
ni i iXY
n nXX YY i i i i
ni i i
n ni i i
r
x y
x x y ySr
S S x x y y
x y nxy
x nx y ny
h of association between two variables and can
be thought of as an estimate of the correlation between the two associated
random variable and .
linear
X Y
NIPRL 57
0
2
Under the assumption that the and random variables have a bivariate
normal distribution, a test of the null hypothesis
: 0
can be performed by comparing the -statistic
2
1with a -dis
X Y
H
t
r nt
rt
0 1
tribution with 2 degrees of freedom. In a regression framework,
this test is equivalent to testing : 0.
n
H
NIPRL 58
NIPRL 59
NIPRL 60
12.9.2 Examples(1/1)
• Example : Nile River Flowrate
2 0.871 0.933r R