introduction to biostatistics, harvard extension school, spring, 2007 © scott evans, ph.d. and...
Post on 20-Dec-2015
218 views
TRANSCRIPT
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 1
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation and
Simple Linear Regression
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 2
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation
Used to assess the relationship between two continuous variables. Example: weight and number of hours of exercise
per week
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 3
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation
SCATTERPLOTS are useful graphical tools to visualize the relationship between two continuous variables. i.e., y vs. x Informative Easy to understand
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 4
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation Exercise and Weight example…
How do we determine if this represents correlation?Exercise (hrs/wk)
Weight
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 5
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation
Range: –1 (perfect negative correlation) to +1 (perfect positive correlation)
x
y
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 6
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation
Correlation of –1 means that all of the data lie in a straight line with some negative slope.
Correlation of 1 means that all of the data lie in a straight line with some positive slope.
Correlation of 0, means that the variables are not correlated. (A plot of the data would reveal a cloud with no discernable pattern.)
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 7
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation
x
y
y
x
II
One way to visualize…
III IV
I
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 8
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation
x
y
Example 1 Example 2x
y
yy
xx III
III IVIVIII
II I
Which is more correlated?
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 9
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation
x
y
Example 1 Example 2x
y
yy
xx III
III IVIVIII
II I
Which is more correlated?
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 10
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation
Exercise (hrs/wk)
Weight
x
y
Exercise and Weight example: We can see more data in quadrants II and IV -- negative correlation.
III
III IV
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 11
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation Coefficient
Measures the strength of the linear relationship
ρ = the population correlation coefficient (parameter)
r = the sample correlation coefficient (statistic)
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 12
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation Coefficient
Can conduct hypothesis tests for ρ using the t distribution. H0: ρ=0 is often tested Confidence Intervals for ρ may also be
obtained.
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 13
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation Coefficient
Two major correlation coefficients:1. Pearson correlation: parametric
Sensitive to extreme observations
2. Spearman correlation: nonparametric Based on ranks Robust to extreme observations and thus is
recommended particularly when “outliers” are present.
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 14
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Pearson Correlation
y
y
x
xYX
)()(
average
n
ii
n
ii
ii
n
i
y
i
x
i
yyxx
yyxx
s
yy
s
xx
nr
1
2
1
2
1
n
1i
)()(
))((
1
1
Population Parameter:
Sample Statistic:
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 15
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Spearman Correlation
Nonparametric version Plug in ranks rather than raw values
into sample statistic:
n
irri
n
irri
rrirri
n
is
yyxx
yyxxr
1
2
1
2
1
)()(
))((
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 16
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation Two different questions:
1. Is the correlation strong, moderate or weak?
2. Is the correlation statistically significant? (P-value)
-1 -0.8 -0.5 -0.2 0 0.2 0.5 0.8 1.0
Perfect None Perfect
+-
Strong Moderate Weak Weak Moderate Strong
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 17
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation Also note:
Never extrapolate correlation beyond observed range of values.
Correlation does NOT necessarily imply causation. Correlation does NOT measure the magnitude of the
regression slope. No Correlation does NOT mean no relationship (could be non-
linear relationship between x and y, see below)
x
y
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 18
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation: DPT Example
Example: Diphtheria, pertussis, and tetanus (DPT) immunization rates and mortality. Is there any association between the proportion of
newborns immunized and the level of mortality of children less than 5 years of age?
Let:X = Percent of infants immunized against DPTY = Infant mortality (number of infants under 5 dying per
1,000 live births)
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 19
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation: DPT Example
Und
er-5
mor
talit
y (p
er 1
000
birt
hs)
Percent immunized against DPT 0 50 100
0
50
100
150
200
Is there a pattern?
n=20 countriesex. Bolivia (77, 118) Immunization=77% Mortality Rate=118 per 1,000 live births
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 20
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation: DPT Example X = Percent of infants immunized against DPT Y = Infant mortality (number of infants under 5 dying per 1,000 live
births) Calculate pieces we need for sample correlation coefficient:
%4.772011 20
11 ii ini xxnx
births live 1,000per 0.592011
11 iini
ni yyny
227061
0.594.771
ni
yxyyxxiiii
ni
8.106304.772011
22
ini ii
xxx
774980.591122
n
ini ii
yyy
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 21
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation: DPT Example And the correlation coefficient is:
Thus there appears to be a negative association between immunization levels for DPT and infant mortality.
79.0)77498)(8.10630(
22706
)()(
)()(
11
122
nii
n
i
yyxx
yyxxr
ii
ii
n
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 22
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation: DPT Example Hypothesis test: Testing for zero correlation based on:
and test statistic: 21)(
2
nrrse
222~
12
21
n
tr
nr
nr
rT
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 23
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
The test of the null hypothesis of zero correlation is constructed as follows:1. Ho:
2. a. Ha: b. Ha: c. Ha: 3. The level of significance is (1-α)% 4. Rejection rule: a. Reject Ho if t>tn-2;1-α b. Reject Ho if t<tn-2; α
c. Reject Ho if t>tn-2;1- α/2 or t<tn-2; α/2
Correlation: DPT Example
0 0 00
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 24
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation: DPT Example Setting α at 5%, we have:
Since –5.47<-t18;0.025 we reject the null hypothesis at the 95% level of significance. There seems to be a significant negative correlation between immunization levels and infant mortality. This points to an inversely proportional relationship (as immunization levels rise, infant mortality decreases). Note that we cannot estimate how much infant mortality would
decrease if a country were to increase its immunization levels by 10%.
47.5)79.0(1
22079.01
222
r
nrt
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 25
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation: DPT Example STATA Output: |
immunize under5----------
+------------------ immunize |
1.0000 | | under5 | -
0.7911 1.0000 |
0.0000 P-value for test
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 26
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation: DPT Example
The (1-)% confidence interval for is based on
rrW
11ln
21 where,
31,0~
nNW . The confidence intervals for are computed in two steps:
STEP 1. Compute a confidence interval for the parameter
rrW
11ln
21
:
311ln
21,
311ln
21, 2121
n
z
rr
n
z
rrUL
ZZ.
STEP 2. Solve equations11
11
ln21
2
2
zL
zL
Zee
LL
L and
11
11
ln21
2
2
zU
zU
Zee
UU
L .
Thus, a (1-)% confidence interval for is:
11,
11
2
2
2
2
zU
zU
zL
zL
ee
ee
Confidence Intervals:
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 27
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation: DPT Example
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 28
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation: DPT Example
The interval (-0.913, -0.534) does not contain zero (the hypothesized value under the null hypothesis in the previous test).
Therefore, we reject the hypothesis of zero correlation between immunization levels and infant mortality.
Note that this is not always the case as the hypothesis test and the confidence interval for correlation are based on different statistics.
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 29
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation: DPT Example
What if we had used Spearman Rank Correlation instead? Rank percentages and mortalities between
all 20 countries If all ranks perfectly match up (i.e. country’s
immunization % and mortality rate have same ranking) then r=1
If all ranks are perfectly inversely related then r=-1
rs = -0.54 (slightly lower, likely due to some non-normality/outliers in data)
ts = -2.72, still reject at 95% level
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 30
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation: Assay Comparison
Two different assays to measure LDL cholesterol
Correlated? Agreement?
One step beyond correlation…
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 31
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Simple Linear Regression Correlation gives us a measure of the
association between two variables, but NOT the magnitude.
With simple linear regression, we can now put a value on this magnitude. i.e., What is the average decrease in weight for
every additional workout hour per week? Hypothesis tests and CIs may be obtained for
the magnitude of association (i.e., slope of line).
One may make predictions of the effect of changes in x on y.
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 32
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Simple Linear Regression
Again, we plot the data first (scatterplot): This enables us to see the relationship between
the 2 variables. Designated “Response” variable and “Predictor”
(or “Explanatory”) variable. Unlike correlation where there is no distinction between
2 variables.
If the relationship is non-linear (as evidenced by the plot) then a simple linear regression model is the wrong approach.
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 33
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Simple Linear Regression
Exercise (hrs/wk)
Weight
x
y
Already determined correlated Looks like linear trend
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 34
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Exercise (hrs/wk)
Simple Linear Regression
Weight
Slope = Rise = Δy Run Δx = 10 lbs/1 hour
~ 10 lbs
~ 1 hour
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 35
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Simple Linear Regression
Used to describe and estimate the relationship between two continuous variables.
We can attempt to characterize this relationship with two parameters:
1. an intercept, and
2. a slope
Several assumptions are made.
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 36
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Simple Linear Regression
The model: yi = β0 + β1xi
yi is the dependent variable (i.e., weight) xi is the independent variable (predictor, i.e.
exercise) β0 is the y-intercept (or written as α) β1 is the slope of the line (or just β)
It describes how steep the line is and which way it leans It is also the “effect” (e.g., a treatment effect) on y of a 1
unit change in x Note that if there was no association between x and y,
then β1 will be close to 0 (i.e., x has no effect on y)
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 37
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Simple Linear Regression I II
III IV
x0 2 4 6 8
-15
-10
-5
0
I. Both lines have the same intercept but different positive slopes. II. Both lines have the same positive slope (they are parallel) but different intercepts. III. Both lines have the same intercept but different negative slopes
IV. Both lines have the same negative slope (parallel) but different intercepts.
x0 2 4 6 8
0
5
10
15
x
0 2 4 6 8
0
5
10
x0 2 4 6 8
-10
-5
0
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 38
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Simple Linear Regression
Constant slope - For a fixed increase x, there will be a fixed change in y regardless of the value of x (e.g., there is a fixed increase if the slope is positive and a fixed decrease if the slope is negative). In contrast to a non-linear relationship, such a quadratic or
polynomial, where for some values of x, y will be increasing, and for some other values y will be decreasing (or vice versa).
Δ x
Δ y
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 39
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Simple Linear Regression
For example, even though it seems that y may be increasing for increasing x, the relationship is not a perfect line (many possible “best” lines).
y
x2000 3000 4000 5000
4
6
8
10
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 40
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Use the method of Least Squares to find the best-fitting line (minimize distances from data points):
Best guess without knowing x ’s, is the mean of y.
y
x
y
Method of Least Squares
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 41
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Least Squares LineFinding “Y-hat”
Does knowing x ’s allow us a better line? Try one…
xy 10ˆˆˆ
x
y
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 42
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
For each data point…
yi
),( ii yx
xi
y
xy 10ˆˆˆ
x
y
Least Squares LineFinding “Y-hat”
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 43
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Find distance to new line, y-hat, and to ‘naïve’ guess, y.
xy 10ˆˆˆ
yi
),( ii yx
xi
iii eyy ˆ
yyi ˆ
yyi iy
y
Least Squares LineFinding “Y-hat”
y
x
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 44
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
The regression line (whichever it is) will not pass through all data points yi . Thus, in most cases, for each point xi the line will produce an estimated point
and most probably, .
In fact, as we see in the previous figure, . For each choice of (note that
each pair completely defines the line) we get a new line, and a whole new set of deviation terms .
iXiY1o
ˆ iYiY ˆ
ieiYiY ˆ1o
ˆ and ˆ
ie
Least Squares LineFinding “Y-hat”
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 45
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
The “best-fitting line” according to the least-squares method is
the one that minimizes the sum of square deviations:
We desire to capture all of the structure in the data with the model (finding the systematic component), leaving only random errors.
Thus if we plotted the errors, we would hope that no discernable pattern exists. If a pattern exists, then we have not captured all of the systematic structure in the data, and we should look for a better model. More on this later!
n
i
n
i
n
iXYeYY
iiii111
2
1o2
2 ˆˆˆ
Least Squares LineFinding “Y-hat”
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 46
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
yyi yyi ˆ
yi
),( ii yx
xi
yvariabilitdunexplaine
1
2ˆ
regression todue
yvariabilit1
2ˆ1
2
n
i iY
iY
n
iY
iY
n
iY
iY
y
iy
iii eyy ˆ
Least Squares LineFinding “Y-hat”
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 47
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Variability& “Sum of Squares”
yvariabilitdunexplaine
1
2ˆ
regression todue
yvariabilit1
2ˆ1
2
n
i iY
iY
n
iY
iY
n
iY
iY
n
iY
iYSSR
1
2ˆ
n
iii YYSSE
1
2ˆ
-- part of total variability that can be explained
-- part left unexplained, because the model cannot account for differences between the estimated points and the data (this is called error sum of squares).
SSESSRSSY
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 48
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
SSY is made up of n terms . Once the mean has been estimated, only n-1 terms are needed to compute SSY (i.e., the degrees of freedom of SSY are n-1).
SSR has only one degree of freedom, since it’s computed from a single function of y:
SSE has the remaining n-2 degrees of freedom.
Degrees of Freedom & “Mean Squares”
2
YYi Y
MSRXi
XYYSSRn
ii
n
i
1
22
1
2 ˆˆ1
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 49
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
1.
is the variability of Y without consideration of X.
2.
is the residual variability after the relationship of Y and X has been accounted for. Unless ρ=0, this will
always be less than Sy2.
Degrees of Freedom& “Mean Squares”
2Y
S
2|XY
S
MSESSEn
YYn
Sn
iiXY
)2(1ˆ
)2(1
1
22|
SSYn
YYn
Sn
iiY )1(
1)1(
11
22
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 50
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Analysis of Variance Table
Source of variability
Sums of squares (SS)
Df
Mean squares (MS)
F
Prob > F
Model
n
ii
YYSSR1
2ˆ
1
MSR=SSR MSEMSRF
P=P(F>F1, n-2;)
Residual (error)
n
iii
YYSSE1
2ˆ
n-2
MSE=2n
SSE
Corrected Total
n
ii
YYSSY1
2
n-1
Note that F simplifies to t when the numerator d.f. = 1.
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 51
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Assumptions
Assumptions of the linear regression model:
1. The y values are distributed according to a normal distribution with mean and variance that is unknown
2. The relationship between X and Y is given by the formula
3. The y are independent
4. For every value x the standard deviation of the outcomes y is constant and equal to . This concept is called homoscedasticity.
xy|
xy|
xy|
xxy 10|
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 52
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Hypothesis Testing
In these models, y is our target (or dependent variable, the outcome that we cannot control but want to explain) and x is the explanatory (or independent variable).
Within each regression, the primary interest is the assessment of the existence of the linear relationship between x and y. If such an association exists, then x provides information about y.
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 53
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Hypothesis Testing
Inference on the existence of the linear association is accomplished via tests of hypotheses and confidence intervals. Both of these center around the estimate of the slope , since it is clear, that if the slope is zero, then changing x will have no impact on y (thus there is no association between x and y).
1
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 54
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Hypothesis Testing
The test of hypothesis of no linear association is defined as follows:
1. Ho: No linear association exists between x and y.2. Ha: A linear association exists between x and y.3. Tests are carried out at the (1-α)% level of significance4. The test statistic is
,
5. Rejection rule: Reject Ho, if F>F1, n-2;α. This will happen if F is far from one. (Note when MSR=MSE, F=1 and we gain no information from the regression line beyond just the mean of y.)
MSEMSRF
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 55
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Hypothesis Testing The F test of linear association tests whether a line, other than
the horizontal one going through the sample mean of the Y’s, is useful in explaining some of the variability in the data.
This is based on the fact that if the population regression slope β10, that is, if the regression does not add anything new to our understanding of the data (does not explain a substantial part of the variability), then the two mean square errors MSR and MSE are estimating a common quantity (the population variance σ2).
Thus the ratio should be close to 1 if the hypothesis of no linear association between X and Y is present. On the other hand, if a linear relationship exists, (β1 is far from zero) then SSR>SSE and the ratio will deviate significantly from 1.
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 56
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Hypothesis Testing
The test of hypothesis of zero slope is defined as follows (~F-test for linear association):1. Ho: No linear association between x and y: β1=0.
2. Ha: A linear association exists between x and y:
a. β10 (two-sided test) b. β1>0
c. β1<0
3. Tests are carried out at the (1-α)% level of significance
} (one-sided tests)
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 57
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Hypothesis Testing
4. The test statistic is , is dist’d as a t distribution with n-2 d.f.
5. Rejection Rule: Reject Ho, in favor of the three alternatives respectively, if
a. t<tn-2; α/2, or t>tn-2;(1-α/2)b. t>tn-2;(1-α) c. t<tn-2; α
1
1
ˆs.e.
ˆT
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 58
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Confidence Intervals for β1
Constructed as usual and are based on , the standard error of , and the t statistic.
A (1-α)% confidence interval is as follows:
11
1)21(;211)21(;21
ˆs.e.ˆ,ˆs.e.ˆnn
tt
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 59
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Simple Linear Regression: Newborn Infant Length Example
Consider the problem described in section 18.4 of the textbook. Is there a relationship between the length and the gestational age of newborn infants?
Scatterplot of length vs. gestational age…
Leng
th(c
entim
eter
s)
Gestational age (weeks)22 24 26 28 30 32 34 36
18
22
26
30
34
38
42
46
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 60
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Simple Linear Regression: Newborn Infant Length Example
Source | SS df MS Number of obs = 100
---------+------------------------------ F( 1, 98) = 82.13
Model | 575.73916 1 575.73916 Prob > F = 0.0000
Residual | 687.02084 98 7.01041674 R-squared = 0.4559
---------+------------------------------ Adj R-squared = 0.4504
Total | 1262.76 99 12.7551515 Root MSE = 2.6477
------------------------------------------------------------------------------
length | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985
_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712
STATA Output:
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 61
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Simple Linear Regression: Newborn Infant Length Example
Source | SS df MS Number of obs = 100
---------+------------------------------ F( 1, 98) = 82.13
Model | 575.73916 1 575.73916 Prob > F = 0.0000
Residual | 687.02084 98 7.01041674 R-squared = 0.4559
---------+------------------------------ Adj R-squared = 0.4504
Total | 1262.76 99 12.7551515 Root MSE = 2.6477
------------------------------------------------------------------------------
length | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985
_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712
Degrees of freedom. There is one degree of freedom associated with the model, and n-2=98 degrees of freedom comprising the residual.
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 62
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Simple Linear Regression: Newborn Infant Length Example
Source | SS df MS Number of obs = 100
---------+------------------------------ F( 1, 98) = 82.13
Model | 575.73916 1 575.73916 Prob > F = 0.0000
Residual | 687.02084 98 7.01041674 R-squared = 0.4559
---------+------------------------------ Adj R-squared = 0.4504
Total | 1262.76 99 12.7551515 Root MSE = 2.6477
------------------------------------------------------------------------------
length | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985
_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712
F test. This is the overall test of linear association. Note that the numerator degrees of freedom is the model degrees of freedom, while the denominator degrees of freedom is the residual (error) degrees of freedom. This test measures deviations from one.
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 63
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Simple Linear Regression: Newborn Infant Length Example
Source | SS df MS Number of obs = 100
---------+------------------------------ F( 1, 98) = 82.13
Model | 575.73916 1 575.73916 Prob > F = 0.0000
Residual | 687.02084 98 7.01041674 R-squared = 0.4559
---------+------------------------------ Adj R-squared = 0.4504
Total | 1262.76 99 12.7551515 Root MSE = 2.6477
------------------------------------------------------------------------------
length | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985
_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712
Rejection rule of the F test. Since the p-value is 0.000<0.05=α, we reject the null hypothesis. There is evidence to suggest a linear association between gestational age and length of the newborn.
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 64
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Simple Linear Regression: Newborn Infant Length Example
Source | SS df MS Number of obs = 100
---------+------------------------------ F( 1, 98) = 82.13
Model | 575.73916 1 575.73916 Prob > F = 0.0000
Residual | 687.02084 98 7.01041674 R-squared = 0.4559
---------+------------------------------ Adj R-squared = 0.4504
Total | 1262.76 99 12.7551515 Root MSE = 2.6477
------------------------------------------------------------------------------
length | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985
_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712
Root MSE. This is the square root of the mean squares for error, and as before, can be used as an estimate of the population variability.
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 65
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Simple Linear Regression: Newborn Infant Length Example
Source | SS df MS Number of obs = 100
---------+------------------------------ F( 1, 98) = 82.13
Model | 575.73916 1 575.73916 Prob > F = 0.0000
Residual | 687.02084 98 7.01041674 R-squared = 0.4559
---------+------------------------------ Adj R-squared = 0.4504
Total | 1262.76 99 12.7551515 Root MSE = 2.6477
------------------------------------------------------------------------------
length | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985
_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712
gestage is the estimate of the slope, β1=.9516035. This means that the fetus grows by 0.95 inches every week of gestation.
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 66
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Simple Linear Regression: Newborn Infant Length Example
Source | SS df MS Number of obs = 100
---------+------------------------------ F( 1, 98) = 82.13
Model | 575.73916 1 575.73916 Prob > F = 0.0000
Residual | 687.02084 98 7.01041674 R-squared = 0.4559
---------+------------------------------ Adj R-squared = 0.4504
Total | 1262.76 99 12.7551515 Root MSE = 2.6477
------------------------------------------------------------------------------
length | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985
_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712
p-value of the t-test. Since 0.000<0.05, we reject the null hypothesis. There is a strong positive linear relationship between gestational age and length of the newborn.
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 67
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Simple Linear Regression: Newborn Infant Length Example
Source | SS df MS Number of obs = 100
---------+------------------------------ F( 1, 98) = 82.13
Model | 575.73916 1 575.73916 Prob > F = 0.0000
Residual | 687.02084 98 7.01041674 R-squared = 0.4559
---------+------------------------------ Adj R-squared = 0.4504
Total | 1262.76 99 12.7551515 Root MSE = 2.6477
------------------------------------------------------------------------------
length | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985
_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712
The 95% confidence interval for β1 is [0.743, 1.160]. Since it excludes 0, we reject the null hypothesis and conclude that there is a strong positive linear relationship between gestational age and length of newborn.
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 68
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Simple Linear Regression: Newborn Infant Length Example
Source | SS df MS Number of obs = 100
---------+------------------------------ F( 1, 98) = 82.13
Model | 575.73916 1 575.73916 Prob > F = 0.0000
Residual | 687.02084 98 7.01041674 R-squared = 0.4559
---------+------------------------------ Adj R-squared = 0.4504
Total | 1262.76 99 12.7551515 Root MSE = 2.6477
------------------------------------------------------------------------------
length | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985
_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712
The estimate of the intercept _cons=9.328 implies that the fetus is 9.3 inches long at conception (week zero) which is totally false. Note however, that our model is only valid in the period in which our data have been collected.
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 69
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation & RegressionFinal Notes
1. Correlation does not measure the magnitude of the slope.
Even though,
the size of the slope is also dependent on the variability in the X and Y.
XY
X
Y
Y
XXY
rS
S
S
Sr
11ˆˆ
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 70
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation & Regression Final Notes
2. Correlation is not a measure of the appropriateness of the straight-line model.
Consider the case of a quadratic model. The correlation coefficient may be large, but a straight-line model may still be inappropriate (figure).
Y
x0 .5 1 1.5
0
2
4
6
8
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 71
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation & Regression Final Notes
Beyond the limitations of the correlation coefficient just mentioned, there are additional advantages in using a regression model. Consider again the DPT immunization example:
The estimates of slope and intercept are β1 = 224.32 and β0=-2.136 respectively.
Source | SS df MS Number of obs = 20 ---------+------------------------------ F( 1, 18) = 30.10 Model | 48497.0497 1 48497.0497 Prob > F = 0.0000 Residual | 29000.9503 18 1611.16391 R-squared = 0.6258 ---------+------------------------------ Adj R-squared = 0.6050 Total | 77498.00 19 4078.84211 Root MSE = 40.139 ------------------------------------------------------------------------------ under5 | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- immunize | -2.135869 .3893022 -5.486 0.000 -2.953763 -1.317976 _cons | 224.3163 31.44034 7.135 0.000 158.2626 290.37 ------------------------------------------------------------------------------
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 72
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Correlation & Regression Final Notes
If an official wanted to estimate the decline in infant mortality for a 10% increase in immunization from current levels they would only have to multiply 10 (=10%) by the estimate for the slope –2.136 (taking advantage of the fact that for a constant increase in x, y decreases by a constant factor).
If DPT immunization rates were increased by 10%, the infant mortality would drop by (10)(-2.136)=-21.36 or about 21 infants per 1,000 live births. This would account for one-third the mortality rate of Egypt, or two-thirds the infant mortality rate in Russia. Taking advantage of the properties of a straight line and the results of the regression model, a concrete public health policy could be constructed.
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 73
Introduction to Biostatistics, Harvard Extension School, Spring, 2007
Next Week…More Regression
Predicted Values Evaluation of linear regression model
i.e., homoscedasticity, R2
Extend to Multiple Linear Regression Parallels with Analysis of Variance (ANOVA)