regression. correlation measures the strength of the linear relationship great! but what is that...
TRANSCRIPT
Regression
Regression
• Correlation measures the strength of the linear relationship
• Great! But what is that relationship? How do we describe it?
– regression, regression line, regression equation
• Regression line is used for prediction
Predicting weights from heights• Independent variable: height• Dependent variable: weight• How can we predict one from the other ?• Regression is to a scatter plot as the mean is to a
histogram.
Weights vs. Heights
YRS EM
302520151050-5
SA
LA
RY
70000
60000
50000
40000
30000
20000
Salary by years employed
Regression by local averages
Approximation ofLocal averages by regression line
Inappropriate useof regression line(use other methods)
The equation of a line
• a represents the y-intercept
– when x equals zero, y equals a
– Is this always meaningful in the context of a problem?
– Is it always useful in defining a line?
• b represents the slope of the line (rise/run)
– for every unit change in x, y changes by b.
– Does this mean that if we physically change x by one unit, y will change by b units? Say we gain another year of experience. Will our salary go up by 1107?
bxay
Regression equation• What is the predicted weight of somebody
whose height is h cm ?
• w = intercept + slope x h
• This is known as the regression equation.
• How do we get this formula ?
• We have a statistical model
YRS EM
302520151050-5
SA
LAR
Y
70000
60000
50000
40000
30000
20000
A residual
xy 110728394
line regression gives Minimising
errors, squared of sum theMinimise 2i
Regression line by minimising residual errors
iii bxay i = error of i-th obs from regression line •The best candidate line willminimise these errors•No line can make all errors vanish (some +ve, some –ve)
Regression and correlation• Want to predict weight for those people who are 1 SD
more than avg. height.
• SD line says:• pred. wt. = overall avg. wt. + SD of wt.
• Regression line says:• Predicted wt. = overall avg. wt. + r x SD of wt.• • For people who are k SDs away from avg. height:• Predicted wt. = overall avg. wt. + r x k SD of wt.• Clearly valid for r 0 or r 1
RMS error of regression
• RMS error = SD of y
• RMS inversely related to correlation
21 r
RMS error is to regression what SD is to average
Residuals
residual =observed -predicted
Example: ozone vs. temperature> air[,c(1,3)]
ozone temperature
3.45 67
3.30 72
2.29 74
2.62 62
2.84 65
. . .> cor(ozone,temperature)
[1] 0.7531038
Fitting a regression model in S> ozone.lm <- lm(ozone ~ temperature, data = air)
Coefficients:
. Value Std. Error tvalue Pr(>|t|)
(Intercept) -2.23 0.46 -4.82 0.0000
temperature 0.07 0.01 11.95 0.0000
Multiple R-Squared: 0.5672
> var(ozone)
[1] 0.7928069
> var(resid(ozone.lm))
[1] 0.3431544
> cor(ozone,temperature)
[1] 0.7531038
Checking model appropriatenessWhat assumptions have we made in the regression model ?
Checking model assumptions in S-plus
> par(mfrow=c(2,3))
> plot(ozone.lm)
Fitted : temperature
Res
idua
ls
2.0 2.5 3.0 3.5 4.0 4.5
-10
12
45
23
77
fitssq
rt(a
bs(R
esid
uals
))
2.0 2.5 3.0 3.5 4.0 4.5
0.2
0.4
0.6
0.8
1.0
1.2
1.4
4523
77
Fitted : temperature
ozon
e
2.0 2.5 3.0 3.5 4.0 4.5
12
34
5
Quantiles of Standard Normal
Res
idua
ls
-2 -1 0 1 2
-10
12
45
23
77
Fitted Values
0.0 0.4 0.8
-10
12
Residuals
0.0 0.4 0.8
-10
12
f-value
ozon
e
Index
Coo
k's
Dis
tanc
e0 20 40 60 80 100
0.0
0.02
0.04
0.06 17 77
20
Residual diagnostics for ozone data
Pizza party at the Frat.• How many laps would you
predict a pledge could run if he ate 6 slices of pizza?
• How many laps if he ate 9 slices of pizza?
• A pledge shows off and eats 35 slices of pizza. How many laps would you predict he would run? SLICES
121086420D
ISTA
NC
E
20
18
16
14
12
10
8
6
4
2
965.0
5.120
r
xy
Beware of extrapolation