correlation and simple linear regression s5 · 4/43 relationship between two numerical variables if...

44
1/43 Basic medical statistics for clinical and experimental research Correlation and simple linear regression S5 Sander Roberti [email protected] December 4, 2019

Upload: others

Post on 28-Mar-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

1/43

Basic medical statistics for clinical and experimental research

Correlation and simple linear regressionS5

Sander [email protected]

December 4, 2019

Page 2: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

2/43

Introduction

Example: Brain size and body weight

Brain size WeightSubject id (MRI total pixel count per 10,000) (pounds)

1 81.69 1182 96.54 1723 92.88 1464 90.49 1345 95.55 1726 83.39 1187 92.41 155...

......

24 79.06 12225 98.00 190

Page 3: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

3/43

Introduction

Example: Brain size and body weight

 

Subject 4 

Subject 24

Page 4: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

4/43

Relationship between two numerical variables

If a linear relationship between x and y appears to be reasonablefrom the scatter plot, we can take the next step and

1. Calculate Pearson’s correlation coefficient between x and yI Measures how closely the data points on the scatter plot resemble a

straight line

2. Perform a simple linear regression analysisI Finds the equation of the line that best describes the relationship

between variables seen in a scatter plot

Page 5: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

5/43

Correlation

Sample Pearson’s correlation coefficient (or product momentcorrelation coefficient), between variables x and y is calculated as

r(x , y) =1

n − 1

n∑i=1

(xi − x

sx

)(yi − y

sy

)where:I {(xi , yi ) : i = 1, . . . ,n} is a random sample of n observations on x

and y ,I x and y are the sample means of respectively x and y ,I sx and sy are the sample standard deviations of respectively x and

y .

Page 6: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

6/43

Correlation

Properties of r :

I r estimates the true population correlation coefficient ρI r takes on any value between −1 and 1I Magnitude of r indicates the strength of a linear relationship

between x and y :I r = −1 or 1 means perfect linear associationI The closer r is to -1 or 1, the stronger the linear association

(e.g. r = -0.1 (weak association) vs r = 0.85 (strong association))I r = 0 indicates no linear association (but can be e.g. non-linear)

I Sign of r indicates the direction of association:I r > 0 implies positive relationship

i.e. the two variables tend to move in the same directionI r < 0 implies negative relationship

i.e. the two variables tend to move in the opposite directions

Page 7: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

7/43

Correlation

Properties of r (cont’d):

I r(a · x + b, c · y + d) = r(x , y), where a > 0, c > 0, and b and d areconstants

I r(x , y) = r(y , x)

I r 6= 0 does not imply causation! Just because two variables arecorrelated does not necessarily mean that one causes the other!

I r2 is called the coefficient of determinationI 0 ≤ r 2 ≤ 1I Represents the proportion of total variation in one variable that is

explained by the otherI For example: the coefficient of determination between body weight and

age of 0.60 means that 60% of total variation in body weight is explainedby age alone and the remaining 40% is explained by other factors

Page 8: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

8/43

Correlation

Example: Correlations does not imply causation

Margarineconsum

edDiv

orce

rat

ein

Mai

ne

DivorcerateinMainecorrelateswith

Percapitaconsumptionofmargarine

Margarineconsumed DivorcerateinMaine

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

2lbs

4lbs

6lbs

8lbs

3.96per1,000

4.29per1,000

4.62per1,000

4.95per1,000

tylervigen.com

Page 9: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

9/43

CorrelationCorrelation

r= -1 r= 1 r= 0.8 r= -0.8

r= 0 r= 0 0 < r< 1 -1 < r< 0

6 / 49

Don’t interpret r without looking at the scatter plot!

Page 10: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

10/43

Correlation

Hypothesis test for the population correlation coefficient ρ:

H0 : ρ = 0(there is no linear relationship between y and x)

H1 : ρ 6= 0(there is a linear relationship between y and x)

Under H0, the test statistic

T = r√

n−21−r2

follows a t-distribution with n − 2 degrees of freedom.

I This test assumes that the variables x and y are normallydistributed

Page 11: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

11/43

Correlation

Example: Brain size and body weight

 

What is the magnitude and sign of correlation coefficient betweenbrain size and weight?

Page 12: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

12/43

Correlation

Example: Brain size and body weight

Correlations

Weight Brain

Weight Pearson Correlation

Sig. (2-tailed)

N

Brain Pearson Correlation

Sig. (2-tailed)

N

1 .826**

.000

25 25

.826** 1

.000

25 25

Correlation is significant at the 0.01 level (2-tailed).**.

Page 1

Page 13: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

13/43

Simple linear regression

I Pearson’s correlation coefficient measures the strength anddirection of a linear association between x and y

I Simple linear regression finds an equation (mathematical model)that describes the relationship between the two variables→ we canpredict values of one variable using values of the other variable

I Unlike correlation, regression requiresI a dependent variable y (outcome/response variable): variable being

predicted (always on the vertical or y -axis)I an independent variable x (explanatory/predictor variable): variable used

for prediction (always on the horizontal or x-axis)

Page 14: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

14/43

Simple linear regression

Simple linear regression postulates that in the population

y = (α + β · x) + ε,

where:I y is the dependent variableI x is the independent variableI α and β are parameters called the population regression

coefficientsI α is called the intercept or constant termI β is called the slope

I ε is the random error term

Page 15: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

15/43

Simple linear regression

x

y

x1 x2 x3 x4 x5

Page 16: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

16/43

Simple linear regression

x

y E(y|xi)

x1 x2 x3 x4 x5

E(y|x) = α + β·x

E(y |xi) is the mean value of y when x = xi

E(y |x) = α+ β · x is the population regression function

Page 17: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

17/43

Simple linear regression

x

y

1 2 3 4 5

E(y|x) = α + β·x

0

α

6

β

I α is the y -intercept of the population regression function, i.e. the meanvalue of y when x equals 0

I β is the slope of the population regression function, i.e. the mean (orexpected) change in y associated with a 1-unit increase in the value of x

I c · β is the mean change in y for a c-unit increase in the value of xI α and β are estimated from the sample data using the least squares

method (usually)

Page 18: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

18/43

Simple linear regression

x

y

xi 0

y� = a + b·x

y�i

yi ei ei = yi - y�i = residual i

Least squares method chooses a and b (estimates for α and β) tominimize the sum of the squares of the residuals

n∑i=1

ei2 =

n∑i=1

(yi − yi )2 =

n∑i=1

[yi − (a + b · xi )]2

Page 19: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

19/43

Simple linear regression

The least squares estimates for β and α are:

b =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2

and

a = y − b · x ,

where x and y are the respective sample means of x and y .

Note that:b = r(x , y) ·

sy

sx,

where r(x , y) is the sample correlation between x and y , and sx andsy are the sample standard deviations of x and y .

Page 20: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

20/43

Simple linear regression

Test of H0 : β = 0 versus H1 : β 6= 0

1. t-test:I Test statistic: T = b

SE(b) , where SE(b) is the standard error of bcalculated from the data

I Under H0, T follows a t-distribution with n − 2 degrees of freedom

2. F -test:I Test statistic: F =

(b

SE(b)

)2= T 2, where SE(b) and T are as above

I Under H0, F follows an F -distribution with 1 and n − 2 degrees offreedom

I The t-test and the F-test lead to the same outcome

The test of zero intercept α is of less interest, unless x = 0 is meaningful

Page 21: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

21/43

Simple linear regression

Example: Brain size (MRI total pixel count per 10,000) and bodyweight (pounds)

Coefficientsa

Model

Unstandardized CoefficientsStandardized Coefficients

t Sig.B Std. Error Beta

1 (Constant)

Weight

62.334 3.845 16.212 .000

.176 .025 .826 7.030 .000

Dependent Variable: Braina.

Page 1

ANOVAa

ModelSum of Squares df Mean Square F Sig.

1 Regression

Residual

Total

507.387 1 507.387 49.416 .000b

236.157 23 10.268

743.544 24

Dependent Variable: Braina.

Predictors: (Constant), Weightb.

Page 1

Brain = 62.33 + 0.18 ·Weight

Page 22: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

22/43

Simple linear regressionExample: Brain size (MRI total pixel count per 10,000) and bodyweight (pounds)

Weight

200180160140120100

Bra

in100.00

95.00

90.00

85.00

80.00

75.00

y=62.33+0.18*x

Page 1

Page 23: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

23/43

Simple linear regression

Example: Blood pressure (mmHg) and body weight (kg) in 20patients with hypertension1

Weight

105.00100.0095.0090.0085.00

BP

125.00

120.00

115.00

110.00

105.00

Page 1

1Daniel, W.W. and Cross, C.L.(2013). Biostatistics: a foundation for analysis in the health sciences, 10th edition.

Page 24: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

24/43

Simple linear regression

Example: Blood pressure (mmHg) and body weight (kg) in 20patients with hypertension

Correlations

BP Weight

BP Pearson Correlation

Sig. (2-tailed)

N

Weight Pearson Correlation

Sig. (2-tailed)

N

1 ,950**

,000

20 20

,950** 1

,000

20 20

Correlation is significant at the 0.01 level (2-tailed).**.

Page 1

Page 25: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

25/43

Simple linear regression

Example: Blood pressure (mmHg) and body weight (kg) in 20patients with hypertension

Coefficientsa

Model

Unstandardized Coefficients

t Sig.B Std. Error Beta

1 (Constant)

Weight

2.205 8.663 .255 .802

1.201 .093 .950 12.917 .000

a.

Page 1

ANOVAa

ModelSum of Squares df Mean Square F Sig.

1 Regression

Residual

Total

505.472 1 505.472 166.859 .000b

54.528 18 3.029

560.000 19

Dependent Variable: BPa.

Predictors: (Constant), Weightb.

Page 1

BP = 2.21 + 1.20 ·Weight

Page 26: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

25/43

Simple linear regression

Example: Blood pressure (mmHg) and body weight (kg) in 20patients with hypertension

Coefficientsa

Model

Unstandardized Coefficients

t Sig.B Std. Error Beta

1 (Constant)

Weight

2.205 8.663 .255 .802

1.201 .093 .950 12.917 .000

a.

Page 1

ANOVAa

ModelSum of Squares df Mean Square F Sig.

1 Regression

Residual

Total

505.472 1 505.472 166.859 .000b

54.528 18 3.029

560.000 19

Dependent Variable: BPa.

Predictors: (Constant), Weightb.

Page 1

BP = 2.21 + 1.20 ·Weight

Page 27: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

26/43

Simple linear regressionExample: Blood pressure (mmHg) and body weight (kg) in 20patients with hypertension

Weight

105.00100.0095.0090.0085.00

BP

125.00

120.00

115.00

110.00

105.00

y=2.21+1.2*x

Page 1

Page 28: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

27/43

Simple linear regression

Standardized coefficientsI Obtained by standardizing both y and x (i.e. converting into

z-scores) and re-running the regressionI Standardized intercept equals zero and standardized slope for x

equals the sample correlation coefficientCorrelations

Weight Brain

Weight Pearson Correlation

Sig. (2-tailed)

N

Brain Pearson Correlation

Sig. (2-tailed)

N

1 .826**

.000

25 25

.826** 1

.000

25 25

Correlation is significant at the 0.01 level (2-tailed).**.

Page 1

Coefficientsa

Model

Unstandardized CoefficientsStandardized Coefficients

t Sig.B Std. Error Beta

1 (Constant)

Weight

62.334 3.845 16.212 .000

.176 .025 .826 7.030 .000

Dependent Variable: Braina.

Page 1

Page 29: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

28/43

Simple linear regression

Standardized coefficientsI Obtained by standardizing both y and x (i.e. converting into

z-scores) and re-running the regressionI Standardized intercept equals zero and standardized slope for x

equals the sample correlation coefficientI Of greater concern in multiple linear regression where the

predictors are expressed in different unitsI Standardization removes the dependence of regression coefficients on

the units of measurements of y and x ’s so they can be meaningfullycompared

I The larger the standardized coefficient (in absolute value) the greaterthe contribution of the respective variable in the prediction of y

Page 30: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

29/43

Simple linear regression

Linear regression is only appropriate when the following assumptionsare satisfied:

1. Independence: the observations are independent, i.e. there is onlyone pair (x , y) per subject

2. Linearity: the relationship between x and y is linear3. Constant variance: the variance of y is constant for all values of x4. Normality: Given x , y has a normal distribution (or equivalently, the

residuals have a normal distribution)

Page 31: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

30/43

Simple linear regression

Checking linearity assumption:1. Make a scatter plot of y versus x

I the points should generally form a straight line

2. Plot the residuals against the explanatory variable xI the points should present a random scatter of points around zero, there

should be no systematic pattern

x x

x

x

x x

x x

x x

x

x

x

x x

x

x x

x

x x x

x

x x

0

x

e

Linearity

x

x x

x

x

x

x x

x x

x x

x x

x

x

x

x

x

x

x

x

x

x

x

0

Lack of linearity

x

e

Page 32: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

31/43

Simple linear regression

Checking constant variance assumption:I Make a residual plot, i.e. plot the residuals against the fitted values

of y (yi = a + b · xi )I the points should present a random scatter of points

x x

x

x

x x

x x

x

x

x

x

x

x

x

x

x x

x

x x x

x

x x

0

e

Constant variance

0

Non-constant variance

e

x

x

x x

x

x

x

x x

x

x x x

x

x x

x

x x

x x

x x

x

x

x

Page 33: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

32/43

Simple linear regression

Example: Blood pressure and body weight

Page 34: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

33/43

Simple linear regression

Checking normality assumption:1. Draw a histogram of the residuals and “eyeball” the result2. Make a normal probability plot (P–P plot) of the residuals,

i.e. plot the expected cumulative probability of a normal distributionversus the observed cumulative probability at each value of theresidualI the points should form a straight diagonal line

Page 35: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

34/43

Simple linear regression

Example: Blood pressure and body weight

Page 36: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

35/43

Simple linear regression

Assessing goodness of fitI The estimated regression line is the “best” one available (in the

least-squares sense)I Yet, it can still be a very poor fit to the observed data

x

x

x

x x

x x

x

x x

x

x

x

x x

x

x

x x

x

x

x

x x

x

x

y

Good fit

x x

x

x

x

x

x

x

x

x x x

x x x

x

x x

x x

x

x

x

x

x

Bad fit

x

y

Page 37: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

36/43

Simple linear regression

To assess goodness of fit of a regression line (i.e. how well the linefits the data) we can:

1. Calculate the correlation coefficient R between the predicted andobserved values of yI A higher value of R indicates better fit (predicted and observed values of

y are closer to each other)

2. Calculate R2 (R Square in SPSS)I 0 ≤ R2 ≤ 1I A higher value of R2 indicates better fitI R2 = 1 indicates perfect fit (i.e. yi = yi for each i)I R2 = 0 indicates very poor fit

Page 38: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

37/43

Simple linear regression

Alternatively, R2 can be calculated as

R2 =

∑ni=1(yi − y)2∑ni=1(yi − y)2

=variation in y explained by x

total variation in y

I R2 is interpreted as proportion of total variability in y explained byexplanatory variable xI R2 = 1: x explains all variability in yI R2 = 0: x does not explain any variability in y

I R2 is usually expressed as a percentage; e.g., R2 = 0.93 indicatesthat 93% of total variation in y can be explained by x

Page 39: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

38/43

Simple linear regression

Example: Blood pressure and body weight

Model Summary

Model R R Square Adjusted R

Square

Std. Error of the

Estimate

1 ,950a ,903 ,897 1,74050

a. Predictors: (Constant), Weight

Page 40: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

39/43

Simple linear regressionPrediction: interpolation versus extrapolation

y

Range of actual data

x

Possible patterns of additional data

Interpolation Extrapolation Extrapolation

Extrapolation beyond the range of the data is risky!!

Page 41: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

40/43

Categorical explanatory variable

I So far we assumed that the predictor variable is numericalI We can also study an association between y and a categorical x ,

e.g. between blood pressure and gender or between brain size andethnicityI Categorical variables can be incorporated through dummy variables that

take on the values 0 and 1; to include a categorical variable with pcategories, p − 1 dummy variables are required

Page 42: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

41/43

Categorical explanatory variable

Example: IQ and blood group (4 categories, A, B, AB, 0)

1. Dummy variables for all blood groups

xA =

{1, if blood group is A0, otherwise

xB =

{1, if blood group is B0, otherwise

xAB =

{1, if blood group is AB0, otherwise

x0 =

{1, if blood group is 00, otherwise

2. One category is a reference categoryI category that results in useful comparisons (e.g. exposed versus

non-exposed, experimental versus standard treatment) or a categorywith large number of subjects

3. In the model we include all dummies except the one correspondingto the reference category

Page 43: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

42/43

Categorical explanatory variable

Model with blood group 0 as reference category

y = α + βA · xA + βB · xB + βAB · xAB + ε

and its estimated counterpart is

y = a + bA · xA + bB · xB + bAB · xAB

I Estimation of model parameters requires running multiple linearregression, unless the explanatory variable has only two categories

I Given that y represents IQ score, the estimated coefficients areinterpreted as follows:I a is the mean IQ for subjects with blood group 0, i.e. the reference

categoryI Each b represents the mean difference in IQ between subjects with a

blood group represented by the respective dummy variable and subjectswith blood group 0 (the reference category)

Page 44: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter

43/43

Categorical explanatory variable

Specifically:I bA is the mean difference in IQ between subjects with blood group

A and subjects with blood group 0I bB is the mean difference in IQ between subjects with blood group

B and subjects with blood group 0I bAB is the mean difference in IQ between subjects with blood group

AB and subjects with blood group 0

A test for the significance of a categorical explanatory variable with plevels involves the hypothesis that the coefficients of all p − 1 dummyvariables are zero. For that purpose, we need to use an overall F-test(next lecture) and not a t-test. The t-test can be used only when thevariable is binary.