lecture 20 simple linear regression (18.6, 18.9) homework 5 is posted and is due next tuesday at 3...
Post on 21-Dec-2015
214 views
TRANSCRIPT
![Page 1: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/1.jpg)
Lecture 20
• Simple linear regression (18.6, 18.9)• Homework 5 is posted and is due next
Tuesday at 3 p.m. (Note correction on question 4(e)).
• Regular office hours: Tuesday, 9-10, 12-1.• Extra office hours: Today (after class),
Monday, 10-11.• Midterm 2: Wednesday, April 2nd, 6-8 p.m.
![Page 2: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/2.jpg)
Point Prediction
• Example 18.7– Predict the selling price of a three-year-old Taurus
with 40,000 miles on the odometer (Example 18.2).
– It is predicted that a 40,000 miles car would sell for $14,575.
– How close is this prediction to the real price?
575,14)000,40(0623.17067x0623.17067y A point prediction
![Page 3: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/3.jpg)
Interval Estimates• Two intervals can be used to discover how closely the
predicted value will match the true value of y.– Prediction interval – predicts y for a given value of x,– Confidence interval – estimates the average y for a given x.
– The confidence interval
– The confidence interval
2x
2g
2 s)1n()xx(
n1
sty
2x
2g
2 s)1n()xx(
n1
sty
– The prediction interval– The prediction interval
2x
2g
2 s)1n()xx(
n1
1sty
2x
2g
2 s)1n()xx(
n1
1sty
![Page 4: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/4.jpg)
Interval Estimates,Example
• Example 18.7 - continued – Provide an interval estimate for the bidding
price on a Ford Taurus with 40,000 miles on the odometer.
– Two types of predictions are required:• A prediction for a specific car
• An estimate for the average price per car
![Page 5: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/5.jpg)
Interval Estimates,Example
• Solution– A prediction interval provides the price estimate for a
single car:
2x
2g
2 s)1n()xx(
n1
1sty
605575,14690,528,43)1100(
)009,36000,40(
100
11)1.303(984.1)]40000(0623.067,17[
2
t.025,98
Approximately
![Page 6: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/6.jpg)
• Solution – continued– A confidence interval provides the estimate of
the mean price per car for a Ford Taurus with 40,000 miles reading on the odometer.
• The confidence interval (95%) =
2
i
2g
2)xx(
)xx(
n1
sty
70575,14690,528,43)1100(
)009,36000,40(
100
1)1.303(984.1)]40000(0623.067,17[
2
Interval Estimates,Example
![Page 7: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/7.jpg)
– As xg moves away from x the interval becomes longer. That is, the shortest interval is found at
2x
2g
2 s)1n(
)xx(
n1
sty
x
g10 xbby
The effect of the given xg on the length of the interval
x
![Page 8: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/8.jpg)
x1x)1x( 1x)1x(
g10 xbby
)1xx(y g )1xx(y g
1x 1x
– As xg moves away from the interval becomes longer. That is, the shortest interval is found at
The effect of the given xg on the length of the interval
2x
2g
2 s)1n(
)xx(
n1
sty
2x
2
2 s)1n(1
n1
sty
xx
![Page 9: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/9.jpg)
x
– As xg moves away from the interval becomes longer. That is, the shortest interval is found at .
g10 xbby
2x)2x( 2x)2x(
2x 2x
2x
2g
2 s)1n()xx(
n1
sty
2x
2
2 s)1n(1
n1
sty
2x
2
2 s)1n(2
n1
sty
The effect of the given xg on the length of the interval
xx
![Page 10: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/10.jpg)
Caveat about Prediction
• Remember that predicting y based on x from a regression is only reliable if x falls inside the range of the data observed.
• Extrapolation is dangerous.
![Page 11: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/11.jpg)
Predicting Height Based on Age
7 6
7 8
8 0
8 2
8 4
Heigh
t
1 7 .5 2 0 2 2 .5 2 5 2 7 .5 3 0
A g e
H e i g h t i n c e n t i m e t e r s , A g e i n m o n t h s L in e a r F it
L i n e a r F i t H e i g h t = 6 4 . 9 2 8 3 2 2 + 0 . 6 3 4 9 6 5 A g e S u m m a r y o f F i t
R S q u a r e 0 . 9 8 8 7 6 4 R o o t M e a n S q u a r e E r r o r 0 . 2 5 5 9 6 4 P a r a m e t e r E s t i m a t e s T e r m E s t i m a t e S t d E r r o r t R a t i o P r o b > | t |
I n t e r c e p t 6 4 . 9 2 8 3 2 2 0 . 5 0 8 4 1 1 2 7 . 7 1 < . 0 0 0 1 A g e 0 . 6 3 4 9 6 5 0 . 0 2 1 4 0 5 2 9 . 6 6 < . 0 0 0 1
![Page 12: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/12.jpg)
18.9 Regression Diagnostics - I
• The four conditions required for the validity of the simple linear regression analysis are:– the mean of the error variable conditional on x is zero
for each x
– the error variable is normally distributed.
– the error variance is constant for all values of x.
– the errors are independent of each other.
• How can we diagnose violations of these conditions?
![Page 13: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/13.jpg)
Residual Analysis
• Examining the residuals helps detect violation of the required conditions
• A residual plot is a scatterplot of the regression residuals against another variable, usually the independent variable or time.
• If the simple linear regression model holds, there should be no pattern in the residual plots.
• Don’t read too much into these plots. You’re looking for gross departures from a random scatter.
![Page 14: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/14.jpg)
Residual plot for utopia.jmp
• Utopia.jmp is a simulation from a simple linear regression model (all assumptions hold).
-2
0
2
Re
sid
ua
l
-10 -5 0 5 10
X
![Page 15: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/15.jpg)
Residual Plot for Example 18.2
-800
-4000
400800
Re
sid
ua
l
15000 25000 35000 45000
Odometer
![Page 16: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/16.jpg)
Detecting Curvature
• If the residual plot has a curved pattern, this indicates that the regression function is not a straight line.
• Transformations to deal with the problem of a curved regression function rather than a straight line regression function later in the lecture.
![Page 17: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/17.jpg)
Heteroscedasticity• When the requirement of a constant variance is violated
we have a condition of heteroscedasticity.• Diagnose heteroscedasticity by plotting the residual
against the predicted y.
+ + ++
+ ++
++
+
+
+
+
+
+
+
+
+
+
++
+
+
+
The spread increases with y
y
Residualy
+
+++
+
++
+
++
+
+++
+
+
+
+
+
++
+
+
![Page 18: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/18.jpg)
Homoscedasticity• When the requirement of a constant variance is not violated we
have a condition of homoscedasticity.• Example 18.2 - continued
-1000
-500
0
500
1000
13500 14000 14500 15000 15500 16000
Predicted Price
Re
sid
ua
ls
![Page 19: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/19.jpg)
Residual plot for cleaning.jmp
-20
-100
1020
Re
sid
ua
l
0 5 10 15
NumberOfCrews
![Page 20: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/20.jpg)
Non Independence of Error Variables
– A time series is constituted if data were collected over time.
– Examining the residuals over time, no pattern should be observed if the errors are independent.
– When a pattern is detected, the errors are said to be serially correlated (or autocorrelated)
– Serial correlation can be detected by graphing the residuals against time.
![Page 21: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/21.jpg)
Patterns in the appearance of the residuals over time indicates that autocorrelation exists.
+
+++ +
++
++
+ +
++ + +
+
++ +
+
+
+
+
+
+Time
Residual Residual
Time+
+
+
Note the runs of positive residuals,replaced by runs of negative residuals. Positive serial correlation.
Note the oscillating behavior of the residuals around zero. Negative serial correlation.
0 0
Non Independence of Error Variables
![Page 22: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/22.jpg)
Checking normality
• To check the normality of the error variable, draw a histogram of the residuals.
• Violation of normality only has a serious effect on confidence intervals and tests if the sample size is small (less than 30) and there is either strong skewness or outliers.
-800 -400 0 200 600
![Page 23: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/23.jpg)
Outliers• An outlier is an observation that is unusually small or large.• Three types of outliers in scatterplots:
– Outlier in x direction– Outlier in y direction– Outlier in overall direction of scatterplot (residual has large
magnitude)• Several possibilities need to be investigated when an outlier is
observed:– There was an error in recording the value.– The point does not belong in the sample.– The observation is valid.
• Identify outliers from the scatterplot
![Page 24: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/24.jpg)
Leverage and Influential Points
• An observation has high leverage if it is an outlier in the x direction.
• An observation is influential if removing it would markedly change the least squares line.
• Observations that have high leverage are influential if they do not fall very close to the least squares line for the other points.
![Page 25: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/25.jpg)
+
+
+
+
+ +
+ + ++
+
+
+
+
+
+
+
The outlier causes a shift in the regression line
… but, some outliers may be very influential
++++++++++
An outlier An influential observation
![Page 26: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/26.jpg)
Regression of Brain Weight on Body Weight for 96 Mammals
Bivariate Fit of Brain weight By Body Weight
0
1000
2000
3000
4000
Brain
weig
ht
0 5001000150020002500
Body Weight
![Page 27: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/27.jpg)
Transformations
• Suppose that the residual plot indicates curvature in the regression function. What do we do?
• One possibility: Transform x or transform y.
• Tukey’s Bulging Rule (see Handout).
![Page 28: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/28.jpg)
Transformation for display.jmp
• Y=Sales, X=Display Feet
• Y=Sales, X=Square Root of Display Feet/Log of Display Feet
-100
0
100
Re
sid
ua
l
0 1 2 3 4 5 6 7 8
DisplayFeet
-100
-500
50100
Re
sid
ua
l
1 1.5 2 2.5
Square Root DisplayFeet
-50
0
50
100
Re
sid
ua
l
0 .5 1 1.5 2
Log DisplayFeet
![Page 29: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/29.jpg)
Predictions with Transformations
• Linear Fit
• Sales = -46.28718 + 154.90188 Square Root DisplayFeet
• For 5 display feet, the average amount of sales is
08.300590.15429.46
![Page 30: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/30.jpg)
18.6 Finance Application: Market Model• One of the most important applications of
linear regression is the market model.• It is assumed that rate of return on a stock
(R) is linearly related to the rate of return on the overall market.
R = 0 + 1Rm +
Rate of return on a particular stock Rate of return on some major stock index
The beta coefficient measures how sensitive the stock’s rate of return is to changes in the level of the overall market.
![Page 31: Lecture 20 Simple linear regression (18.6, 18.9) Homework 5 is posted and is due next Tuesday at 3 p.m. (Note correction on question 4(e)). Regular office](https://reader034.vdocuments.site/reader034/viewer/2022051619/56649d5f5503460f94a3ed3b/html5/thumbnails/31.jpg)
Example 18.6B i v a r i a t e F i t o f N o r t e l B y T S E
-0 .2
-0 .1
0 .0
0 .1
0 .2
Nortel
-0 .2 5 -0 .1 5 -0 .0 5 .0 0 .0 5 .1 0
T S E
L in e a r F it
L i n e a r F i t N o r t e l = 0 . 0 1 2 8 1 8 1 + 0 . 8 8 7 6 9 1 2 T S E S u m m a r y o f F i t R S q u a r e 0 . 3 1 3 6 8 8 R S q u a r e A d j 0 . 3 0 1 8 5 5 R o o t M e a n S q u a r e E r r o r 0 . 0 6 3 1 2 3 P a r a m e t e r E s t i m a t e s T e r m E s t i m a t e S t d E r r o r t R a t i o P r o b > | t |
I n t e r c e p t 0 . 0 1 2 8 1 8 1 0 . 0 0 8 2 2 3 1 . 5 6 0 . 1 2 4 5 T S E 0 . 8 8 7 6 9 1 2 0 . 1 7 2 4 0 9 5 . 1 5 < . 0 0 0 1