introduction to biostatistics, harvard extension school, spring, 2007 © scott evans, ph.d. and...

73
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 1 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 Correlation and Simple Linear Regression

Post on 20-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 1

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation and

Simple Linear Regression

Page 2: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 2

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation

Used to assess the relationship between two continuous variables. Example: weight and number of hours of exercise

per week

Page 3: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 3

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation

SCATTERPLOTS are useful graphical tools to visualize the relationship between two continuous variables. i.e., y vs. x Informative Easy to understand

Page 4: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 4

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation Exercise and Weight example…

How do we determine if this represents correlation?Exercise (hrs/wk)

Weight

Page 5: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 5

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation

Range: –1 (perfect negative correlation) to +1 (perfect positive correlation)

x

y

Page 6: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 6

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation

Correlation of –1 means that all of the data lie in a straight line with some negative slope.

Correlation of 1 means that all of the data lie in a straight line with some positive slope.

Correlation of 0, means that the variables are not correlated. (A plot of the data would reveal a cloud with no discernable pattern.)

Page 7: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 7

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation

x

y

y

x

II

One way to visualize…

III IV

I

Page 8: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 8

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation

x

y

Example 1 Example 2x

y

yy

xx III

III IVIVIII

II I

Which is more correlated?

Page 9: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 9

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation

x

y

Example 1 Example 2x

y

yy

xx III

III IVIVIII

II I

Which is more correlated?

Page 10: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 10

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation

Exercise (hrs/wk)

Weight

x

y

Exercise and Weight example: We can see more data in quadrants II and IV -- negative correlation.

III

III IV

Page 11: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 11

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation Coefficient

Measures the strength of the linear relationship

ρ = the population correlation coefficient (parameter)

r = the sample correlation coefficient (statistic)

Page 12: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 12

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation Coefficient

Can conduct hypothesis tests for ρ using the t distribution. H0: ρ=0 is often tested Confidence Intervals for ρ may also be

obtained.

Page 13: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 13

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation Coefficient

Two major correlation coefficients:1. Pearson correlation: parametric

Sensitive to extreme observations

2. Spearman correlation: nonparametric Based on ranks Robust to extreme observations and thus is

recommended particularly when “outliers” are present.

Page 14: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 14

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Pearson Correlation

y

y

x

xYX

)()(

average

n

ii

n

ii

ii

n

i

y

i

x

i

yyxx

yyxx

s

yy

s

xx

nr

1

2

1

2

1

n

1i

)()(

))((

1

1

Population Parameter:

Sample Statistic:

Page 15: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 15

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Spearman Correlation

Nonparametric version Plug in ranks rather than raw values

into sample statistic:

n

irri

n

irri

rrirri

n

is

yyxx

yyxxr

1

2

1

2

1

)()(

))((

Page 16: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 16

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation Two different questions:

1. Is the correlation strong, moderate or weak?

2. Is the correlation statistically significant? (P-value)

-1 -0.8 -0.5 -0.2 0 0.2 0.5 0.8 1.0

Perfect None Perfect

+-

Strong Moderate Weak Weak Moderate Strong

Page 17: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 17

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation Also note:

Never extrapolate correlation beyond observed range of values.

Correlation does NOT necessarily imply causation. Correlation does NOT measure the magnitude of the

regression slope. No Correlation does NOT mean no relationship (could be non-

linear relationship between x and y, see below)

x

y

Page 18: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 18

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation: DPT Example

Example: Diphtheria, pertussis, and tetanus (DPT) immunization rates and mortality. Is there any association between the proportion of

newborns immunized and the level of mortality of children less than 5 years of age?

Let:X = Percent of infants immunized against DPTY = Infant mortality (number of infants under 5 dying per

1,000 live births)

Page 19: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 19

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation: DPT Example

Und

er-5

mor

talit

y (p

er 1

000

birt

hs)

Percent immunized against DPT 0 50 100

0

50

100

150

200

Is there a pattern?

n=20 countriesex. Bolivia (77, 118) Immunization=77% Mortality Rate=118 per 1,000 live births

Page 20: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 20

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation: DPT Example X = Percent of infants immunized against DPT Y = Infant mortality (number of infants under 5 dying per 1,000 live

births) Calculate pieces we need for sample correlation coefficient:

%4.772011 20

11 ii ini xxnx

births live 1,000per 0.592011

11 iini

ni yyny

227061

0.594.771

ni

yxyyxxiiii

ni

8.106304.772011

22

ini ii

xxx

774980.591122

n

ini ii

yyy

Page 21: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 21

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation: DPT Example And the correlation coefficient is:

Thus there appears to be a negative association between immunization levels for DPT and infant mortality.

79.0)77498)(8.10630(

22706

)()(

)()(

11

122

nii

n

i

yyxx

yyxxr

ii

ii

n

Page 22: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 22

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation: DPT Example Hypothesis test: Testing for zero correlation based on:

and test statistic: 21)(

2

nrrse

222~

12

21

n

tr

nr

nr

rT

Page 23: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 23

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

The test of the null hypothesis of zero correlation is constructed as follows:1. Ho:

2. a. Ha: b. Ha: c. Ha: 3. The level of significance is (1-α)% 4. Rejection rule: a. Reject Ho if t>tn-2;1-α b. Reject Ho if t<tn-2; α

c. Reject Ho if t>tn-2;1- α/2 or t<tn-2; α/2

Correlation: DPT Example

0 0 00

Page 24: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 24

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation: DPT Example Setting α at 5%, we have:

Since –5.47<-t18;0.025 we reject the null hypothesis at the 95% level of significance. There seems to be a significant negative correlation between immunization levels and infant mortality. This points to an inversely proportional relationship (as immunization levels rise, infant mortality decreases). Note that we cannot estimate how much infant mortality would

decrease if a country were to increase its immunization levels by 10%.

47.5)79.0(1

22079.01

222

r

nrt

Page 25: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 25

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation: DPT Example STATA Output: |

immunize under5----------

+------------------ immunize |

1.0000 | | under5 | -

0.7911 1.0000 |

0.0000 P-value for test

Page 26: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 26

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation: DPT Example

The (1-)% confidence interval for is based on

rrW

11ln

21 where,

31,0~

nNW . The confidence intervals for are computed in two steps:

STEP 1. Compute a confidence interval for the parameter

rrW

11ln

21

:

311ln

21,

311ln

21, 2121

n

z

rr

n

z

rrUL

ZZ.

STEP 2. Solve equations11

11

ln21

2

2

zL

zL

Zee

LL

L and

11

11

ln21

2

2

zU

zU

Zee

UU

L .

Thus, a (1-)% confidence interval for is:

11,

11

2

2

2

2

zU

zU

zL

zL

ee

ee

Confidence Intervals:

Page 27: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 27

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation: DPT Example

Page 28: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 28

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation: DPT Example

The interval (-0.913, -0.534) does not contain zero (the hypothesized value under the null hypothesis in the previous test).

Therefore, we reject the hypothesis of zero correlation between immunization levels and infant mortality.

Note that this is not always the case as the hypothesis test and the confidence interval for correlation are based on different statistics.

Page 29: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 29

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation: DPT Example

What if we had used Spearman Rank Correlation instead? Rank percentages and mortalities between

all 20 countries If all ranks perfectly match up (i.e. country’s

immunization % and mortality rate have same ranking) then r=1

If all ranks are perfectly inversely related then r=-1

rs = -0.54 (slightly lower, likely due to some non-normality/outliers in data)

ts = -2.72, still reject at 95% level

Page 30: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 30

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation: Assay Comparison

Two different assays to measure LDL cholesterol

Correlated? Agreement?

One step beyond correlation…

Page 31: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 31

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Simple Linear Regression Correlation gives us a measure of the

association between two variables, but NOT the magnitude.

With simple linear regression, we can now put a value on this magnitude. i.e., What is the average decrease in weight for

every additional workout hour per week? Hypothesis tests and CIs may be obtained for

the magnitude of association (i.e., slope of line).

One may make predictions of the effect of changes in x on y.

Page 32: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 32

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Simple Linear Regression

Again, we plot the data first (scatterplot): This enables us to see the relationship between

the 2 variables. Designated “Response” variable and “Predictor”

(or “Explanatory”) variable. Unlike correlation where there is no distinction between

2 variables.

If the relationship is non-linear (as evidenced by the plot) then a simple linear regression model is the wrong approach.

Page 33: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 33

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Simple Linear Regression

Exercise (hrs/wk)

Weight

x

y

Already determined correlated Looks like linear trend

Page 34: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 34

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Exercise (hrs/wk)

Simple Linear Regression

Weight

Slope = Rise = Δy Run Δx = 10 lbs/1 hour

~ 10 lbs

~ 1 hour

Page 35: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 35

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Simple Linear Regression

Used to describe and estimate the relationship between two continuous variables.

We can attempt to characterize this relationship with two parameters:

1. an intercept, and

2. a slope

Several assumptions are made.

Page 36: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 36

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Simple Linear Regression

The model: yi = β0 + β1xi

yi is the dependent variable (i.e., weight) xi is the independent variable (predictor, i.e.

exercise) β0 is the y-intercept (or written as α) β1 is the slope of the line (or just β)

It describes how steep the line is and which way it leans It is also the “effect” (e.g., a treatment effect) on y of a 1

unit change in x Note that if there was no association between x and y,

then β1 will be close to 0 (i.e., x has no effect on y)

Page 37: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 37

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Simple Linear Regression I II

III IV

x0 2 4 6 8

-15

-10

-5

0

I. Both lines have the same intercept but different positive slopes. II. Both lines have the same positive slope (they are parallel) but different intercepts. III. Both lines have the same intercept but different negative slopes

IV. Both lines have the same negative slope (parallel) but different intercepts.

x0 2 4 6 8

0

5

10

15

x

0 2 4 6 8

0

5

10

x0 2 4 6 8

-10

-5

0

Page 38: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 38

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Simple Linear Regression

Constant slope - For a fixed increase x, there will be a fixed change in y regardless of the value of x (e.g., there is a fixed increase if the slope is positive and a fixed decrease if the slope is negative). In contrast to a non-linear relationship, such a quadratic or

polynomial, where for some values of x, y will be increasing, and for some other values y will be decreasing (or vice versa).

Δ x

Δ y

Page 39: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 39

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Simple Linear Regression

For example, even though it seems that y may be increasing for increasing x, the relationship is not a perfect line (many possible “best” lines).

y

x2000 3000 4000 5000

4

6

8

10

Page 40: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 40

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Use the method of Least Squares to find the best-fitting line (minimize distances from data points):

Best guess without knowing x ’s, is the mean of y.

y

x

y

Method of Least Squares

Page 41: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 41

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Least Squares LineFinding “Y-hat”

Does knowing x ’s allow us a better line? Try one…

xy 10ˆˆˆ

x

y

Page 42: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 42

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

For each data point…

yi

),( ii yx

xi

y

xy 10ˆˆˆ

x

y

Least Squares LineFinding “Y-hat”

Page 43: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 43

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Find distance to new line, y-hat, and to ‘naïve’ guess, y.

xy 10ˆˆˆ

yi

),( ii yx

xi

iii eyy ˆ

yyi ˆ

yyi iy

y

Least Squares LineFinding “Y-hat”

y

x

Page 44: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 44

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

The regression line (whichever it is) will not pass through all data points yi . Thus, in most cases, for each point xi the line will produce an estimated point

and most probably, .

In fact, as we see in the previous figure, . For each choice of (note that

each pair completely defines the line) we get a new line, and a whole new set of deviation terms .

iXiY1o

ˆ iYiY ˆ

ieiYiY ˆ1o

ˆ and ˆ

ie

Least Squares LineFinding “Y-hat”

Page 45: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 45

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

The “best-fitting line” according to the least-squares method is

the one that minimizes the sum of square deviations:

We desire to capture all of the structure in the data with the model (finding the systematic component), leaving only random errors.

Thus if we plotted the errors, we would hope that no discernable pattern exists. If a pattern exists, then we have not captured all of the systematic structure in the data, and we should look for a better model. More on this later!

n

i

n

i

n

iXYeYY

iiii111

2

1o2

2 ˆˆˆ

Least Squares LineFinding “Y-hat”

Page 46: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 46

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

yyi yyi ˆ

yi

),( ii yx

xi

yvariabilitdunexplaine

1

regression todue

yvariabilit1

2ˆ1

2

n

i iY

iY

n

iY

iY

n

iY

iY

y

iy

iii eyy ˆ

Least Squares LineFinding “Y-hat”

Page 47: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 47

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Variability& “Sum of Squares”

yvariabilitdunexplaine

1

regression todue

yvariabilit1

2ˆ1

2

n

i iY

iY

n

iY

iY

n

iY

iY

n

iY

iYSSR

1

n

iii YYSSE

1

-- part of total variability that can be explained

-- part left unexplained, because the model cannot account for differences between the estimated points and the data (this is called error sum of squares).

SSESSRSSY

Page 48: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 48

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

SSY is made up of n terms . Once the mean has been estimated, only n-1 terms are needed to compute SSY (i.e., the degrees of freedom of SSY are n-1).

SSR has only one degree of freedom, since it’s computed from a single function of y:

SSE has the remaining n-2 degrees of freedom.

Degrees of Freedom & “Mean Squares”

2

YYi Y

MSRXi

XYYSSRn

ii

n

i

1

22

1

2 ˆˆ1

Page 49: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 49

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

1.

is the variability of Y without consideration of X.

2.

is the residual variability after the relationship of Y and X has been accounted for. Unless ρ=0, this will

always be less than Sy2.

Degrees of Freedom& “Mean Squares”

2Y

S

2|XY

S

MSESSEn

YYn

Sn

iiXY

)2(1ˆ

)2(1

1

22|

SSYn

YYn

Sn

iiY )1(

1)1(

11

22

Page 50: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 50

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Analysis of Variance Table

Source of variability

Sums of squares (SS)

Df

Mean squares (MS)

F

Prob > F

Model

n

ii

YYSSR1

1

MSR=SSR MSEMSRF

P=P(F>F1, n-2;)

Residual (error)

n

iii

YYSSE1

n-2

MSE=2n

SSE

Corrected Total

n

ii

YYSSY1

2

n-1

Note that F simplifies to t when the numerator d.f. = 1.

Page 51: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 51

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Assumptions

Assumptions of the linear regression model:

1. The y values are distributed according to a normal distribution with mean and variance that is unknown

2. The relationship between X and Y is given by the formula

3. The y are independent

4. For every value x the standard deviation of the outcomes y is constant and equal to . This concept is called homoscedasticity.

xy|

xy|

xy|

xxy 10|

Page 52: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 52

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Hypothesis Testing

In these models, y is our target (or dependent variable, the outcome that we cannot control but want to explain) and x is the explanatory (or independent variable).

Within each regression, the primary interest is the assessment of the existence of the linear relationship between x and y. If such an association exists, then x provides information about y.

Page 53: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 53

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Hypothesis Testing

Inference on the existence of the linear association is accomplished via tests of hypotheses and confidence intervals. Both of these center around the estimate of the slope , since it is clear, that if the slope is zero, then changing x will have no impact on y (thus there is no association between x and y).

1

Page 54: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 54

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Hypothesis Testing

The test of hypothesis of no linear association is defined as follows:

1. Ho: No linear association exists between x and y.2. Ha: A linear association exists between x and y.3. Tests are carried out at the (1-α)% level of significance4. The test statistic is

,

5. Rejection rule: Reject Ho, if F>F1, n-2;α. This will happen if F is far from one. (Note when MSR=MSE, F=1 and we gain no information from the regression line beyond just the mean of y.)

MSEMSRF

Page 55: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 55

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Hypothesis Testing The F test of linear association tests whether a line, other than

the horizontal one going through the sample mean of the Y’s, is useful in explaining some of the variability in the data.

This is based on the fact that if the population regression slope β10, that is, if the regression does not add anything new to our understanding of the data (does not explain a substantial part of the variability), then the two mean square errors MSR and MSE are estimating a common quantity (the population variance σ2).

Thus the ratio should be close to 1 if the hypothesis of no linear association between X and Y is present. On the other hand, if a linear relationship exists, (β1 is far from zero) then SSR>SSE and the ratio will deviate significantly from 1.

Page 56: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 56

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Hypothesis Testing

The test of hypothesis of zero slope is defined as follows (~F-test for linear association):1. Ho: No linear association between x and y: β1=0.

2. Ha: A linear association exists between x and y:

a. β10 (two-sided test) b. β1>0

c. β1<0

3. Tests are carried out at the (1-α)% level of significance

} (one-sided tests)

Page 57: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 57

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Hypothesis Testing

4. The test statistic is , is dist’d as a t distribution with n-2 d.f.

5. Rejection Rule: Reject Ho, in favor of the three alternatives respectively, if

a. t<tn-2; α/2, or t>tn-2;(1-α/2)b. t>tn-2;(1-α) c. t<tn-2; α

1

1

ˆs.e.

ˆT

Page 58: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 58

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Confidence Intervals for β1

Constructed as usual and are based on , the standard error of , and the t statistic.

A (1-α)% confidence interval is as follows:

11

1)21(;211)21(;21

ˆs.e.ˆ,ˆs.e.ˆnn

tt

Page 59: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 59

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Simple Linear Regression: Newborn Infant Length Example

Consider the problem described in section 18.4 of the textbook. Is there a relationship between the length and the gestational age of newborn infants?

Scatterplot of length vs. gestational age…

Leng

th(c

entim

eter

s)

Gestational age (weeks)22 24 26 28 30 32 34 36

18

22

26

30

34

38

42

46

Page 60: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 60

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Simple Linear Regression: Newborn Infant Length Example

Source | SS df MS Number of obs = 100

---------+------------------------------ F( 1, 98) = 82.13

Model | 575.73916 1 575.73916 Prob > F = 0.0000

Residual | 687.02084 98 7.01041674 R-squared = 0.4559

---------+------------------------------ Adj R-squared = 0.4504

Total | 1262.76 99 12.7551515 Root MSE = 2.6477

------------------------------------------------------------------------------

length | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985

_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712

STATA Output:

Page 61: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 61

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Simple Linear Regression: Newborn Infant Length Example

Source | SS df MS Number of obs = 100

---------+------------------------------ F( 1, 98) = 82.13

Model | 575.73916 1 575.73916 Prob > F = 0.0000

Residual | 687.02084 98 7.01041674 R-squared = 0.4559

---------+------------------------------ Adj R-squared = 0.4504

Total | 1262.76 99 12.7551515 Root MSE = 2.6477

------------------------------------------------------------------------------

length | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985

_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712

Degrees of freedom. There is one degree of freedom associated with the model, and n-2=98 degrees of freedom comprising the residual.

Page 62: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 62

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Simple Linear Regression: Newborn Infant Length Example

Source | SS df MS Number of obs = 100

---------+------------------------------ F( 1, 98) = 82.13

Model | 575.73916 1 575.73916 Prob > F = 0.0000

Residual | 687.02084 98 7.01041674 R-squared = 0.4559

---------+------------------------------ Adj R-squared = 0.4504

Total | 1262.76 99 12.7551515 Root MSE = 2.6477

------------------------------------------------------------------------------

length | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985

_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712

F test. This is the overall test of linear association. Note that the numerator degrees of freedom is the model degrees of freedom, while the denominator degrees of freedom is the residual (error) degrees of freedom. This test measures deviations from one.

Page 63: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 63

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Simple Linear Regression: Newborn Infant Length Example

Source | SS df MS Number of obs = 100

---------+------------------------------ F( 1, 98) = 82.13

Model | 575.73916 1 575.73916 Prob > F = 0.0000

Residual | 687.02084 98 7.01041674 R-squared = 0.4559

---------+------------------------------ Adj R-squared = 0.4504

Total | 1262.76 99 12.7551515 Root MSE = 2.6477

------------------------------------------------------------------------------

length | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985

_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712

Rejection rule of the F test. Since the p-value is 0.000<0.05=α, we reject the null hypothesis. There is evidence to suggest a linear association between gestational age and length of the newborn.

Page 64: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 64

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Simple Linear Regression: Newborn Infant Length Example

Source | SS df MS Number of obs = 100

---------+------------------------------ F( 1, 98) = 82.13

Model | 575.73916 1 575.73916 Prob > F = 0.0000

Residual | 687.02084 98 7.01041674 R-squared = 0.4559

---------+------------------------------ Adj R-squared = 0.4504

Total | 1262.76 99 12.7551515 Root MSE = 2.6477

------------------------------------------------------------------------------

length | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985

_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712

Root MSE. This is the square root of the mean squares for error, and as before, can be used as an estimate of the population variability.

Page 65: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 65

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Simple Linear Regression: Newborn Infant Length Example

Source | SS df MS Number of obs = 100

---------+------------------------------ F( 1, 98) = 82.13

Model | 575.73916 1 575.73916 Prob > F = 0.0000

Residual | 687.02084 98 7.01041674 R-squared = 0.4559

---------+------------------------------ Adj R-squared = 0.4504

Total | 1262.76 99 12.7551515 Root MSE = 2.6477

------------------------------------------------------------------------------

length | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985

_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712

gestage is the estimate of the slope, β1=.9516035. This means that the fetus grows by 0.95 inches every week of gestation.

Page 66: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 66

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Simple Linear Regression: Newborn Infant Length Example

Source | SS df MS Number of obs = 100

---------+------------------------------ F( 1, 98) = 82.13

Model | 575.73916 1 575.73916 Prob > F = 0.0000

Residual | 687.02084 98 7.01041674 R-squared = 0.4559

---------+------------------------------ Adj R-squared = 0.4504

Total | 1262.76 99 12.7551515 Root MSE = 2.6477

------------------------------------------------------------------------------

length | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985

_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712

p-value of the t-test. Since 0.000<0.05, we reject the null hypothesis. There is a strong positive linear relationship between gestational age and length of the newborn.

Page 67: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 67

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Simple Linear Regression: Newborn Infant Length Example

Source | SS df MS Number of obs = 100

---------+------------------------------ F( 1, 98) = 82.13

Model | 575.73916 1 575.73916 Prob > F = 0.0000

Residual | 687.02084 98 7.01041674 R-squared = 0.4559

---------+------------------------------ Adj R-squared = 0.4504

Total | 1262.76 99 12.7551515 Root MSE = 2.6477

------------------------------------------------------------------------------

length | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985

_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712

The 95% confidence interval for β1 is [0.743, 1.160]. Since it excludes 0, we reject the null hypothesis and conclude that there is a strong positive linear relationship between gestational age and length of newborn.

Page 68: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 68

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Simple Linear Regression: Newborn Infant Length Example

Source | SS df MS Number of obs = 100

---------+------------------------------ F( 1, 98) = 82.13

Model | 575.73916 1 575.73916 Prob > F = 0.0000

Residual | 687.02084 98 7.01041674 R-squared = 0.4559

---------+------------------------------ Adj R-squared = 0.4504

Total | 1262.76 99 12.7551515 Root MSE = 2.6477

------------------------------------------------------------------------------

length | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

gestage | .9516035 .1050062 9.062 0.000 .7432221 1.159985

_cons | 9.3281740 3.0451630 3.063 0.003 3.285149 15.3712

The estimate of the intercept _cons=9.328 implies that the fetus is 9.3 inches long at conception (week zero) which is totally false. Note however, that our model is only valid in the period in which our data have been collected.

Page 69: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 69

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation & RegressionFinal Notes

1. Correlation does not measure the magnitude of the slope.

Even though,

the size of the slope is also dependent on the variability in the X and Y.

XY

X

Y

Y

XXY

rS

S

S

Sr

11ˆˆ

Page 70: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 70

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation & Regression Final Notes

2. Correlation is not a measure of the appropriateness of the straight-line model.

Consider the case of a quadratic model. The correlation coefficient may be large, but a straight-line model may still be inappropriate (figure).

Y

x0 .5 1 1.5

0

2

4

6

8

Page 71: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 71

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation & Regression Final Notes

Beyond the limitations of the correlation coefficient just mentioned, there are additional advantages in using a regression model. Consider again the DPT immunization example:

The estimates of slope and intercept are β1 = 224.32 and β0=-2.136 respectively.

Source | SS df MS Number of obs = 20 ---------+------------------------------ F( 1, 18) = 30.10 Model | 48497.0497 1 48497.0497 Prob > F = 0.0000 Residual | 29000.9503 18 1611.16391 R-squared = 0.6258 ---------+------------------------------ Adj R-squared = 0.6050 Total | 77498.00 19 4078.84211 Root MSE = 40.139 ------------------------------------------------------------------------------ under5 | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- immunize | -2.135869 .3893022 -5.486 0.000 -2.953763 -1.317976 _cons | 224.3163 31.44034 7.135 0.000 158.2626 290.37 ------------------------------------------------------------------------------

Page 72: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 72

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Correlation & Regression Final Notes

If an official wanted to estimate the decline in infant mortality for a 10% increase in immunization from current levels they would only have to multiply 10 (=10%) by the estimate for the slope –2.136 (taking advantage of the fact that for a constant increase in x, y decreases by a constant factor).

If DPT immunization rates were increased by 10%, the infant mortality would drop by (10)(-2.136)=-21.36 or about 21 infants per 1,000 live births. This would account for one-third the mortality rate of Egypt, or two-thirds the infant mortality rate in Russia. Taking advantage of the properties of a straight line and the results of the regression model, a concrete public health policy could be constructed.

Page 73: Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 73

Introduction to Biostatistics, Harvard Extension School, Spring, 2007

Next Week…More Regression

Predicted Values Evaluation of linear regression model

i.e., homoscedasticity, R2

Extend to Multiple Linear Regression Parallels with Analysis of Variance (ANOVA)