introduction to biostatistics, harvard extension school, spring, 2007 © scott evans, ph.d. and...

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 1

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation and

Simple Linear Regression



Correlation

Used to assess the relationship between two continuous variables. Example: weight and number of hours of exercise

per week



Correlation

SCATTERPLOTS are useful graphical tools to visualize the relationship between two continuous variables. i.e., y vs. x Informative Easy to understand



Correlation Exercise and Weight example…

How do we determine if this represents correlation?Exercise (hrs/wk)

Weight



Correlation

Range: –1 (perfect negative correlation) to +1 (perfect positive correlation)

x

y



Correlation

Correlation of –1 means that all of the data lie in a straight line with some negative slope.

Correlation of 1 means that all of the data lie in a straight line with some positive slope.

Correlation of 0, means that the variables are not correlated. (A plot of the data would reveal a cloud with no discernable pattern.)



Correlation

x

y

y

x

II

One way to visualize…

III IV

I



Correlation

x

y

Example 1 Example 2x

y

yy

xx III

III IVIVIII

II I

Which is more correlated?



Correlation

Exercise (hrs/wk)

Weight

x

y

Exercise and Weight example: We can see more data in quadrants II and IV -- negative correlation.

III

III IV



Correlation Coefficient

Measures the strength of the linear relationship

ρ = the population correlation coefficient (parameter)

r = the sample correlation coefficient (statistic)




Can conduct hypothesis tests for ρ using the t distribution. H0: ρ=0 is often tested Confidence Intervals for ρ may also be

obtained.




Two major correlation coefficients:1. Pearson correlation: parametric

Sensitive to extreme observations

2. Spearman correlation: nonparametric Based on ranks Robust to extreme observations and thus is

recommended particularly when “outliers” are present.



Pearson Correlation

y

y

x

xYX

)()(

average

n

ii

n

ii

ii

n

i

y

i

x

i

yyxx

yyxx

s

yy

s

xx

nr

1

2

1

2

1

n

1i

)()(

))((

1

1

Population Parameter:

Sample Statistic:



Spearman Correlation

Nonparametric version Plug in ranks rather than raw values

into sample statistic:

n

irri

n

irri

rrirri

n

is

yyxx

yyxxr

1

2

1

2

1

)()(

))((



Correlation Two different questions:

1. Is the correlation strong, moderate or weak?

2. Is the correlation statistically significant? (P-value)

-1 -0.8 -0.5 -0.2 0 0.2 0.5 0.8 1.0

Perfect None Perfect

+-

Strong Moderate Weak Weak Moderate Strong



Correlation Also note:

Never extrapolate correlation beyond observed range of values.

Correlation does NOT necessarily imply causation. Correlation does NOT measure the magnitude of the

regression slope. No Correlation does NOT mean no relationship (could be non-

linear relationship between x and y, see below)

x

y



Correlation: DPT Example

Example: Diphtheria, pertussis, and tetanus (DPT) immunization rates and mortality. Is there any association between the proportion of

newborns immunized and the level of mortality of children less than 5 years of age?

Let:X = Percent of infants immunized against DPTY = Infant mortality (number of infants under 5 dying per

1,000 live births)




Und

er-5

mor

talit

y (p

er 1

000

birt

hs)

Percent immunized against DPT 0 50 100

0

50

100

150

200

Is there a pattern?

n=20 countriesex. Bolivia (77, 118) Immunization=77% Mortality Rate=118 per 1,000 live births



Correlation: DPT Example X = Percent of infants immunized against DPT Y = Infant mortality (number of infants under 5 dying per 1,000 live

births) Calculate pieces we need for sample correlation coefficient:

%4.772011 20

11 ii ini xxnx

births live 1,000per 0.592011

11 iini

ni yyny

227061

0.594.771

ni

yxyyxxiiii

ni

8.106304.772011

22

ini ii

xxx

774980.591122

n

ini ii

yyy



Correlation: DPT Example And the correlation coefficient is:

Thus there appears to be a negative association between immunization levels for DPT and infant mortality.

79.0)77498)(8.10630(

22706

)()(

)()(

11

122

nii

n

i

yyxx

yyxxr

ii

ii

n



Correlation: DPT Example Hypothesis test: Testing for zero correlation based on:

and test statistic: 21)(

2

nrrse

222~

12

21

n

tr

nr

nr

rT



The test of the null hypothesis of zero correlation is constructed as follows:1. Ho:

2. a. Ha: b. Ha: c. Ha: 3. The level of significance is (1-α)% 4. Rejection rule: a. Reject Ho if t>tn-2;1-α b. Reject Ho if t<tn-2; α

c. Reject Ho if t>tn-2;1- α/2 or t<tn-2; α/2


0 0 00



Correlation: DPT Example Setting α at 5%, we have:

Since –5.47<-t18;0.025 we reject the null hypothesis at the 95% level of significance. There seems to be a significant negative correlation between immunization levels and infant mortality. This points to an inversely proportional relationship (as immunization levels rise, infant mortality decreases). Note that we cannot estimate how much infant mortality would

decrease if a country were to increase its immunization levels by 10%.

47.5)79.0(1

22079.01

222

r

nrt




The (1-)% confidence interval for is based on

rrW

11ln

21 where,

31,0~

nNW . The confidence intervals for are computed in two steps:

STEP 1. Compute a confidence interval for the parameter

rrW

11ln

21

:

311ln

21,

311ln

21, 2121

n

z

rr

n

z

rrUL

ZZ.

STEP 2. Solve equations11

11

ln21

2

2

zL

zL

Zee

LL

L and

11

11

ln21

2

2

zU

zU

Zee

UU

L .

Thus, a (1-)% confidence interval for is:

11,

11

2

2

2

2

zU

zU

zL

zL

ee

ee

Confidence Intervals:




The interval (-0.913, -0.534) does not contain zero (the hypothesized value under the null hypothesis in the previous test).

Therefore, we reject the hypothesis of zero correlation between immunization levels and infant mortality.

Note that this is not always the case as the hypothesis test and the confidence interval for correlation are based on different statistics.




What if we had used Spearman Rank Correlation instead? Rank percentages and mortalities between

all 20 countries If all ranks perfectly match up (i.e. country’s

immunization % and mortality rate have same ranking) then r=1

If all ranks are perfectly inversely related then r=-1

rs = -0.54 (slightly lower, likely due to some non-normality/outliers in data)

ts = -2.72, still reject at 95% level



Correlation: Assay Comparison

Two different assays to measure LDL cholesterol

Correlated? Agreement?

One step beyond correlation…



Simple Linear Regression Correlation gives us a measure of the

association between two variables, but NOT the magnitude.

With simple linear regression, we can now put a value on this magnitude. i.e., What is the average decrease in weight for

every additional workout hour per week? Hypothesis tests and CIs may be obtained for

the magnitude of association (i.e., slope of line).

One may make predictions of the effect of changes in x on y.




Again, we plot the data first (scatterplot): This enables us to see the relationship between

the 2 variables. Designated “Response” variable and “Predictor”

(or “Explanatory”) variable. Unlike correlation where there is no distinction between

2 variables.

If the relationship is non-linear (as evidenced by the plot) then a simple linear regression model is the wrong approach.




Exercise (hrs/wk)

Weight

x

y

Already determined correlated Looks like linear trend



Exercise (hrs/wk)


Weight

Slope = Rise = Δy Run Δx = 10 lbs/1 hour

~ 10 lbs

~ 1 hour




Used to describe and estimate the relationship between two continuous variables.

We can attempt to characterize this relationship with two parameters:

1. an intercept, and

2. a slope

Several assumptions are made.




The model: yi = β0 + β1xi

yi is the dependent variable (i.e., weight) xi is the independent variable (predictor, i.e.

exercise) β0 is the y-intercept (or written as α) β1 is the slope of the line (or just β)

It describes how steep the line is and which way it leans It is also the “effect” (e.g., a treatment effect) on y of a 1

unit change in x Note that if there was no association between x and y,

then β1 will be close to 0 (i.e., x has no effect on y)



Simple Linear Regression I II

III IV

x0 2 4 6 8

-15

-10

-5

0

I. Both lines have the same intercept but different positive slopes. II. Both lines have the same positive slope (they are parallel) but different intercepts. III. Both lines have the same intercept but different negative slopes

IV. Both lines have the same negative slope (parallel) but different intercepts.

x0 2 4 6 8

0

5

10

15

x

0 2 4 6 8

0

5

10

x0 2 4 6 8

-10

-5

0




Constant slope - For a fixed increase x, there will be a fixed change in y regardless of the value of x (e.g., there is a fixed increase if the slope is positive and a fixed decrease if the slope is negative). In contrast to a non-linear relationship, such a quadratic or

polynomial, where for some values of x, y will be increasing, and for some other values y will be decreasing (or vice versa).

Δ x

Δ y




For example, even though it seems that y may be increasing for increasing x, the relationship is not a perfect line (many possible “best” lines).

y

x2000 3000 4000 5000

4

6

8

10



Use the method of Least Squares to find the best-fitting line (minimize distances from data points):

Best guess without knowing x ’s, is the mean of y.

y

x

y

Method of Least Squares



Least Squares LineFinding “Y-hat”

Does knowing x ’s allow us a better line? Try one…

xy 10ˆˆˆ

x

y



For each data point…

yi

),( ii yx

xi

y

xy 10ˆˆˆ

x

y




Find distance to new line, y-hat, and to ‘naïve’ guess, y.

xy 10ˆˆˆ

yi

),( ii yx

xi

iii eyy ˆ

yyi ˆ

yyi iy

y


y

x



The regression line (whichever it is) will not pass through all data points yi . Thus, in most cases, for each point xi the line will produce an estimated point

and most probably, .

In fact, as we see in the previous figure, . For each choice of (note that

each pair completely defines the line) we get a new line, and a whole new set of deviation terms .

iXiY1o

ˆ iYiY ˆ

ieiYiY ˆ1o

ˆ and ˆ

ie




The “best-fitting line” according to the least-squares method is

the one that minimizes the sum of square deviations:

We desire to capture all of the structure in the data with the model (finding the systematic component), leaving only random errors.

Thus if we plotted the errors, we would hope that no discernable pattern exists. If a pattern exists, then we have not captured all of the systematic structure in the data, and we should look for a better model. More on this later!

n

i

n

i

n

iXYeYY

iiii111

2

1o2

2 ˆˆˆ




yyi yyi ˆ

yi

),( ii yx

xi

yvariabilitdunexplaine

1

2ˆ

regression todue

yvariabilit1

2ˆ1

2

n

i iY

iY

n

iY

iY

n

iY

iY

y

iy

iii eyy ˆ




Variability& “Sum of Squares”

yvariabilitdunexplaine

1

2ˆ

regression todue

yvariabilit1

2ˆ1

2

n

i iY

iY

n

iY

iY

n

iY

iY

n

iY

iYSSR

1

2ˆ

n

iii YYSSE

1

2ˆ

-- part of total variability that can be explained

-- part left unexplained, because the model cannot account for differences between the estimated points and the data (this is called error sum of squares).

SSESSRSSY



SSY is made up of n terms . Once the mean has been estimated, only n-1 terms are needed to compute SSY (i.e., the degrees of freedom of SSY are n-1).

SSR has only one degree of freedom, since it’s computed from a single function of y:

SSE has the remaining n-2 degrees of freedom.

Degrees of Freedom & “Mean Squares”

2

YYi Y

MSRXi

XYYSSRn

ii

n

i

1

22

1

2 ˆˆ1



1.

is the variability of Y without consideration of X.

2.

is the residual variability after the relationship of Y and X has been accounted for. Unless ρ=0, this will

always be less than Sy2.

Degrees of Freedom& “Mean Squares”

2Y

S

2|XY

S

MSESSEn

YYn

Sn

iiXY

)2(1ˆ

)2(1

1

22|

SSYn

YYn

Sn

iiY )1(

1)1(

11

22



Analysis of Variance Table

Source of variability

Sums of squares (SS)

Df

Mean squares (MS)

F

Prob > F

Model

n

ii

YYSSR1

2ˆ

1

MSR=SSR MSEMSRF

P=P(F>F1, n-2;)

Residual (error)

n

iii

YYSSE1

2ˆ

n-2

MSE=2n

SSE

Corrected Total

n

ii

YYSSY1

2

n-1

Note that F simplifies to t when the numerator d.f. = 1.



Assumptions

Assumptions of the linear regression model:

1. The y values are distributed according to a normal distribution with mean and variance that is unknown

2. The relationship between X and Y is given by the formula

3. The y are independent

4. For every value x the standard deviation of the outcomes y is constant and equal to . This concept is called homoscedasticity.

xy|

xy|

xy|

xxy 10|



Hypothesis Testing

In these models, y is our target (or dependent variable, the outcome that we cannot control but want to explain) and x is the explanatory (or independent variable).

Within each regression, the primary interest is the assessment of the existence of the linear relationship between x and y. If such an association exists, then x provides information about y.



Hypothesis Testing

Inference on the existence of the linear association is accomplished via tests of hypotheses and confidence intervals. Both of these center around the estimate of the slope , since it is clear, that if the slope is zero, then changing x will have no impact on y (thus there is no association between x and y).

1



Hypothesis Testing

The test of hypothesis of no linear association is defined as follows:

1. Ho: No linear association exists between x and y.2. Ha: A linear association exists between x and y.3. Tests are carried out at the (1-α)% level of significance4. The test statistic is

,

5. Rejection rule: Reject Ho, if F>F1, n-2;α. This will happen if F is far from one. (Note when MSR=MSE, F=1 and we gain no information from the regression line beyond just the mean of y.)

MSEMSRF



Hypothesis Testing The F test of linear association tests whether a line, other than

the horizontal one going through the sample mean of the Y’s, is useful in explaining some of the variability in the data.

This is based on the fact that if the population regression slope β10, that is, if the regression does not add anything new to our understanding of the data (does not explain a substantial part of the variability), then the two mean square errors MSR and MSE are estimating a common quantity (the population variance σ2).

Thus the ratio should be close to 1 if the hypothesis of no linear association between X and Y is present. On the other hand, if a linear relationship exists, (β1 is far from zero) then SSR>SSE and the ratio will deviate significantly from 1.



Hypothesis Testing

The test of hypothesis of zero slope is defined as follows (~F-test for linear association):1. Ho: No linear association between x and y: β1=0.

2. Ha: A linear association exists between x and y:

a. β10 (two-sided test) b. β1>0

c. β1<0

3. Tests are carried out at the (1-α)% level of significance

} (one-sided tests)



Hypothesis Testing

4. The test statistic is , is dist’d as a t distribution with n-2 d.f.

5. Rejection Rule: Reject Ho, in favor of the three alternatives respectively, if

a. t<tn-2; α/2, or t>tn-2;(1-α/2)b. t>tn-2;(1-α) c. t<tn-2; α

1

1

ˆs.e.

ˆT



Confidence Intervals for β1

Constructed as usual and are based on , the standard error of , and the t statistic.

A (1-α)% confidence interval is as follows:

11

1)21(;211)21(;21

ˆs.e.ˆ,ˆs.e.ˆnn

tt



Simple Linear Regression: Newborn Infant Length Example

Consider the problem described in section 18.4 of the textbook. Is there a relationship between the length and the gestational age of newborn infants?

Scatterplot of length vs. gestational age…

Leng

th(c

entim

eter

s)

Gestational age (weeks)22 24 26 28 30 32 34 36

18

22

26

30

34

38

42

46




Source | SS df MS Number of obs = 100

---------+------------------------------ F( 1, 98) = 82.13

Model | 575.73916 1 575.73916 Prob > F = 0.0000

Residual | 687.02084 98 7.01041674 R-squared = 0.4559

---------+------------------------------ Adj R-squared = 0.4504

Total | 1262.76 99 12.7551515 Root MSE = 2.6477

------------------------------------------------------------------------------

length | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985

_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712

STATA Output:





---------+------------------------------ F( 1, 98) = 82.13

Model | 575.73916 1 575.73916 Prob > F = 0.0000


---------+------------------------------ Adj R-squared = 0.4504

Total | 1262.76 99 12.7551515 Root MSE = 2.6477

------------------------------------------------------------------------------


---------+--------------------------------------------------------------------

gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985

_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712

Degrees of freedom. There is one degree of freedom associated with the model, and n-2=98 degrees of freedom comprising the residual.





---------+------------------------------ F( 1, 98) = 82.13

Model | 575.73916 1 575.73916 Prob > F = 0.0000


---------+------------------------------ Adj R-squared = 0.4504

Total | 1262.76 99 12.7551515 Root MSE = 2.6477

------------------------------------------------------------------------------


---------+--------------------------------------------------------------------

gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985

_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712

F test. This is the overall test of linear association. Note that the numerator degrees of freedom is the model degrees of freedom, while the denominator degrees of freedom is the residual (error) degrees of freedom. This test measures deviations from one.





---------+------------------------------ F( 1, 98) = 82.13

Model | 575.73916 1 575.73916 Prob > F = 0.0000


---------+------------------------------ Adj R-squared = 0.4504

Total | 1262.76 99 12.7551515 Root MSE = 2.6477

------------------------------------------------------------------------------


---------+--------------------------------------------------------------------

gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985

_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712

Rejection rule of the F test. Since the p-value is 0.000<0.05=α, we reject the null hypothesis. There is evidence to suggest a linear association between gestational age and length of the newborn.





---------+------------------------------ F( 1, 98) = 82.13

Model | 575.73916 1 575.73916 Prob > F = 0.0000


---------+------------------------------ Adj R-squared = 0.4504

Total | 1262.76 99 12.7551515 Root MSE = 2.6477

------------------------------------------------------------------------------


---------+--------------------------------------------------------------------

gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985

_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712

Root MSE. This is the square root of the mean squares for error, and as before, can be used as an estimate of the population variability.





---------+------------------------------ F( 1, 98) = 82.13

Model | 575.73916 1 575.73916 Prob > F = 0.0000


---------+------------------------------ Adj R-squared = 0.4504

Total | 1262.76 99 12.7551515 Root MSE = 2.6477

------------------------------------------------------------------------------


---------+--------------------------------------------------------------------

gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985

_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712

gestage is the estimate of the slope, β1=.9516035. This means that the fetus grows by 0.95 inches every week of gestation.





---------+------------------------------ F( 1, 98) = 82.13

Model | 575.73916 1 575.73916 Prob > F = 0.0000


---------+------------------------------ Adj R-squared = 0.4504

Total | 1262.76 99 12.7551515 Root MSE = 2.6477

------------------------------------------------------------------------------


---------+--------------------------------------------------------------------

gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985

_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712

p-value of the t-test. Since 0.000<0.05, we reject the null hypothesis. There is a strong positive linear relationship between gestational age and length of the newborn.





---------+------------------------------ F( 1, 98) = 82.13

Model | 575.73916 1 575.73916 Prob > F = 0.0000


---------+------------------------------ Adj R-squared = 0.4504

Total | 1262.76 99 12.7551515 Root MSE = 2.6477

------------------------------------------------------------------------------


---------+--------------------------------------------------------------------

gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985

_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712

The 95% confidence interval for β1 is [0.743, 1.160]. Since it excludes 0, we reject the null hypothesis and conclude that there is a strong positive linear relationship between gestational age and length of newborn.





---------+------------------------------ F( 1, 98) = 82.13

Model | 575.73916 1 575.73916 Prob > F = 0.0000


---------+------------------------------ Adj R-squared = 0.4504

Total | 1262.76 99 12.7551515 Root MSE = 2.6477

------------------------------------------------------------------------------


---------+--------------------------------------------------------------------

gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985

_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712

The estimate of the intercept _cons=9.328 implies that the fetus is 9.3 inches long at conception (week zero) which is totally false. Note however, that our model is only valid in the period in which our data have been collected.



Correlation & RegressionFinal Notes

1. Correlation does not measure the magnitude of the slope.

Even though,

the size of the slope is also dependent on the variability in the X and Y.

XY

X

Y

Y

XXY

rS

S

S

Sr

11ˆˆ



Correlation & Regression Final Notes

2. Correlation is not a measure of the appropriateness of the straight-line model.

Consider the case of a quadratic model. The correlation coefficient may be large, but a straight-line model may still be inappropriate (figure).

Y

x0 .5 1 1.5

0

2

4

6

8




Beyond the limitations of the correlation coefficient just mentioned, there are additional advantages in using a regression model. Consider again the DPT immunization example:

The estimates of slope and intercept are β1 = 224.32 and β0=-2.136 respectively.

Source | SS df MS Number of obs = 20 ---------+------------------------------ F( 1, 18) = 30.10 Model | 48497.0497 1 48497.0497 Prob > F = 0.0000 Residual | 29000.9503 18 1611.16391 R-squared = 0.6258 ---------+------------------------------ Adj R-squared = 0.6050 Total | 77498.00 19 4078.84211 Root MSE = 40.139 ------------------------------------------------------------------------------ under5 | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- immunize | -2.135869 .3893022 -5.486 0.000 -2.953763 -1.317976 _cons | 224.3163 31.44034 7.135 0.000 158.2626 290.37 ------------------------------------------------------------------------------




If an official wanted to estimate the decline in infant mortality for a 10% increase in immunization from current levels they would only have to multiply 10 (=10%) by the estimate for the slope –2.136 (taking advantage of the fact that for a constant increase in x, y decreases by a constant factor).

If DPT immunization rates were increased by 10%, the infant mortality would drop by (10)(-2.136)=-21.36 or about 21 infants per 1,000 live births. This would account for one-third the mortality rate of Egypt, or two-thirds the infant mortality rate in Russia. Taking advantage of the properties of a straight line and the results of the regression model, a concrete public health policy could be constructed.



Next Week…More Regression

Predicted Values Evaluation of linear regression model

i.e., homoscedasticity, R2

Extend to Multiple Linear Regression Parallels with Analysis of Variance (ANOVA)

introduction to biostatistics, harvard extension school, spring, 2007 © scott evans, ph.d. and...

Documents