outline statistics in medicine - yale...

15
11/7/2016 1 S L I D E 0 Statistics in medicine Lecture 4: Correlation and multivariable regression Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health [email protected] S L I D E 1 Outline • Correlation • Regression – Linear –Logistic –Cox’s proportional hazard model S L I D E 2 Correlation and prediction methods Correlation and prediction Regression Linear Logistic Other Correlation Simple Pearson Spearman Other Partial S L I D E 3 Correlation –Simple correlation • Is a measure of the strength and direction of the association between two variables measured on numerical scale. –Measured by correlation coefficient r

Upload: others

Post on 11-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Outline Statistics in medicine - Yale Universitytbl.med.yale.edu/files/correlation_mutlivariable_regression/slides.pdf · Statistics in medicine Lecture 4: Correlation and multivariable

11/7/2016

1

S L I D E 0

Statistics in medicine

Lecture 4: Correlation and multivariable regression

Fatma Shebl, MD, MS, MPH, PhD

Assistant Professor

Chronic Disease Epidemiology Department

Yale School of Public Health

[email protected]

S L I D E 1

Outline

• Correlation

• Regression

–Linear

–Logistic

–Cox’s proportional hazard model

S L I D E 2

Correlation and prediction methods

Correlation and prediction

Regression

Linear Logistic Other

Correlation

Simple

Pearson Spearman Other

Partial

S L I D E 3

Correlation

–Simple correlation

• Is a measure of the strength and direction of the association between two variables measured on numerical scale.

–Measured by correlation coefficient r

Page 2: Outline Statistics in medicine - Yale Universitytbl.med.yale.edu/files/correlation_mutlivariable_regression/slides.pdf · Statistics in medicine Lecture 4: Correlation and multivariable

11/7/2016

2

S L I D E 4

Correlation

–Simple correlation –The sign of r indicate the direction of

association

• + sign: • Positive correlation

• High value of one variable are associated with high values of the second variable

• - sign: • Negative correlation

• High value of one variable are associated with low values of the second variable

S L I D E 5

Correlation

–Simple correlation –r ranges between +1 and -1

• + 1: • Positive correlation

• Perfect correlation

• - 1: • Negative correlation

• Perfect correlation

• 0: • No correlation

S L I D E 6

Correlation

–Simple correlation –r is immune to the change in x and y

position

–r is immune to linear transformation

–r close to 0 does not mean lack of relationship i.e. strong non-linear relationship might exist

–Correlation does NOT indicate causation

S L I D E 7

Correlation: Visualization

–Scatterplot

–A two-dimensional graph displaying the relationship between two numerical characteristics.

–Visualization of correlation

•Also called “joint distribution graph”

•Plot (x,y) to assess the pattern of the relationship

Page 3: Outline Statistics in medicine - Yale Universitytbl.med.yale.edu/files/correlation_mutlivariable_regression/slides.pdf · Statistics in medicine Lecture 4: Correlation and multivariable

11/7/2016

3

S L I D E 8

Date of download: 1/11/2016

Copyright © 2016 McGraw-Hill Education. All rights reserved.

Scatterplots and correlations.

A:r = +1.0; B:r = 0.7; C:r = –0.9; D:r = –0.4; E:r = 0.0;

F:r = 0.0.

Legend:

From: Chapter 8. Research Questions About Relationships among Variables

Basic & Clinical Biostatistics, 4e, 2004

Correlation: Scatterplot

• Patterns

A.Perfect positive

B.Positive

C.Negative

D.Week negative

E.Nonexistent

F.Nonlinear

S L I D E 9

Correlation

–Simple correlation

–Types

•Pearson product-moment

•Interval or ratio scale

•Spearman rank-order

•Ordinal scale

•Other

S L I D E 10

Pearson product-moment

Used for two numerical normally distributed variables

Test of significance

• 1- Calculate r (correlation coefficient)

• 2-Calculate the degrees of freedom

• 3-Calculate the test statistic t

• 4-Find the critical value of significance t

• 5-Draw a conclusion

• 𝑟 = (𝑋−𝑋 )(𝑌−𝑌 )

(𝑋−𝑋 )2 (𝑌−𝑌 )2

• df=n-2

• t=𝑟 𝑛−2

1−𝑟2

S L I D E 11

Correlation

–Simple correlation –Assumptions

• Linear relationship

• Normal distribution

• No outliers

• Large sample size (>30)

Page 4: Outline Statistics in medicine - Yale Universitytbl.med.yale.edu/files/correlation_mutlivariable_regression/slides.pdf · Statistics in medicine Lecture 4: Correlation and multivariable

11/7/2016

4

S L I D E 12

Spearman’s Rho

Could be used for

NOT normally distributed variables

Normally distributed variables

Based on ranks

Test of significance

• 1- Calculate rs (correlation coefficient)

• 2-Calculate the degrees of freedom

• 3-Calculate the test statistic t

• 4-Find the critical value of significance t

• 5-Draw a conclusion

• rs= (𝑅𝑋−𝑅𝑋)(𝑅𝑌−𝑅𝑌)

(𝑅𝑋−𝑅𝑋)2 (𝑅𝑌−𝑅𝑌)

2

• df=n-2

• t=𝑟 𝑛−2

1−𝑟2

S L I D E 13

Correlation

–Simple correlation

• Interpretation of the size of r –Rule of thumb

• |0-.10| no to trivial correlation

• |.10-.30| very low correlation

• |.30-.50| low correlation

• |.50-.70| moderate correlation

• |.70-.90| high correlation

• |.90-1| very high correlation

• |1| perfect correlation

S L I D E 14

Correlation

–Simple correlation

• Interpretation of the size of r –r is affected by sample size

• Large sample size, with small r significant results

–Better interpretation using r2 (Known as the coefficient of determination)

• Is the proportion of the variance in one variable that is accounted for by the other variable

• Is a measure of the strength of the relationship

S L I D E 15

Correlation: Example

– A study was conducted to examine whether serum calcium and serum triglycerides are correlated. If the correlation coefficient is 20%, interpret the r coefficient, what is the coefficient of determination, and its interpretation, and can you infer causation?

Answer

– There is low correlation between serum calcium and serum triglycerides

– R2=.2x.2=.04=4%

– Interpretation of r2: 4% of the variation of serum calcium is accounted for by serum triglycerides (and vice versa)

– No

Page 5: Outline Statistics in medicine - Yale Universitytbl.med.yale.edu/files/correlation_mutlivariable_regression/slides.pdf · Statistics in medicine Lecture 4: Correlation and multivariable

11/7/2016

5

S L I D E 16

Correlation

–Partial correlation –Is a measure of the strength and

direction of the association between two variables controlling for one or more variable

–r ranges between +1 and -1

–Assumptions

• Linear relationship between all pairs of variables

• Normal distribution

• No outliers

S L I D E 17

Multivariable methods

Definition:

–Statistical models that have one dependent (outcome) variable, but include more than one independent variable

S L I D E 18

Multivariable methods

Rational of the regression equations

–Example: if we hypothesize that cholesterol level is predicted by age, gender, and diabetic status, and we would like to find out the line (as in the scatter diagram) that best fit this relationship, we can write this as a straight line equation: Y = a + bX

–Rational of the straight line equation in regression

–Cholesterol = age + gender + diabetes

• But not all these predictors are equally important, so we give each predictor a weight(coefficient) relative to its importance

–Cholesterol=(W1)age+(W2)gender+(W3)diabetes

S L I D E 19

Multivariable methods

Rational of the regression equations

–However, we need a starting point for the calculation, so we add it to the equation

• Cholesterol=starting point+ (W1)age+(W2)gender+(W3)diabetes

–Because usually the prediction of the outcome is not perfect, so we add an error term

• Cholesterol=starting point+ (W1)age+(W2)gender+(W3)diabetes + error term

Page 6: Outline Statistics in medicine - Yale Universitytbl.med.yale.edu/files/correlation_mutlivariable_regression/slides.pdf · Statistics in medicine Lecture 4: Correlation and multivariable

11/7/2016

6

S L I D E 20

Multivariable methods

Rational of the regression equations

–The final formula could be expressed as

–y= a+b1x1+b2x2+b3x3+ e

• Also written as

–y= β0+β1x1+β2x2+β3x3 + ε

S L I D E 21

Multivariable methods

Rational of the regression equations

y= a+b1x1+b2x2+b3x3+ e

• Also written as

y= β0+β1x1+β2x2+β3x3 + ε

– Interpretation of the symbols

• a (β0):intercept i.e. where line crosses the y-axis

• b (β1…k): regression coefficients (slope) i.e. amount y changes each time x change by 1 unit

• e (ε): error term(residual) i.e. the distance the actual value of y depart from the regression line

This equation is commonly referred to as general linear model

S L I D E 22

Multivariable methods

Rational of the regression equations

– Estimation of best estimates(least-squares method)

• Observed y and x are known, therefore e has to be calculated

• Use different a and b to calculate the predicted y (y hat)

ŷ= a+b1x1+b2x2+b3x3

• e is then calculated as: y-ŷ

• The best estimate is the one with the least error i.e. that minimize e2 = (y-ŷ)2 i.e. minimize “the sum of the squared error term”

S L I D E 23

Date of download: 1/12/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved.

Least squares regression line.

From: Chapter 8. Research Questions About Relationships among Variables

Basic & Clinical Biostatistics, 4e, 2004

Geometric interpretation of a regression line.

Multivariable methods

Page 7: Outline Statistics in medicine - Yale Universitytbl.med.yale.edu/files/correlation_mutlivariable_regression/slides.pdf · Statistics in medicine Lecture 4: Correlation and multivariable

11/7/2016

7

S L I D E 24

Multivariable methods

Applications

– Test for interaction

– Adjust for confounding

– Predict future values of y given x

S L I D E 25

Multivariable methods

–The types of models described by the previous equation are referred to as “general linear models”

• General because can accommodate different types of y and or x

• Linear because is a linear combination of the x terms

–Commonly used methods

• Linear regression

• Logistic regression

• Survival

S L I D E 26

Readings and resources

• Chapter 8, p190-220: Dawson, B. and Trapp, R. G. (2004). Basic and Clinical Biostatistics (4th edition). New York: McGraw-Hill

• Chapter 9, p221-244: Dawson, B. and Trapp, R. G. (2004). Basic and Clinical Biostatistics (4th edition). New York: McGraw-Hill.

• Chapter 10, p245-263: Dawson, B. and Trapp, R. G. (2004). Basic and Clinical Biostatistics (4th edition). New York: McGraw-Hill.

• Chapter 11, p147-151: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition).

• Chapter 13, p163-170: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition).

S L I D E 27

Statistics in medicine

Lecture 4 part 2: Correlation and multiple regression

Fatma Shebl, MD, MS, MPH, PhD

Assistant Professor

Chronic Disease Epidemiology Department

Yale School of Public Health

[email protected]

Page 8: Outline Statistics in medicine - Yale Universitytbl.med.yale.edu/files/correlation_mutlivariable_regression/slides.pdf · Statistics in medicine Lecture 4: Correlation and multivariable

11/7/2016

8

S L I D E 28

Outline

• Regression

–Linear

–Logistic

–Cox’s proportional hazard model

S L I D E 29

Correlation and prediction methods

Correlation and prediction

Regression

Linear Logistic Other

Correlation

Simple

Pearson Spearman Other

Partial

S L I D E 30

Linear regression

– General linear model in which the dependent variable is continuous variable

– Types

• Single continuous predictor: simple linear regression

• Multiple continuous predictors: multiple linear regression

• Single categorical predictor: one-way ANOVA

• Multiple categorical predictors: N-way ANOVA

• Some categorical and some continuous predictors: analysis of covariance (ANCOVA)

S L I D E 31

Linear regression

– Assumptions

• Linearity: the relation is linear between each independent variable and the dependent variable

• Independence: The values of Y are independent

• Homogeneity: The equal variance of Y across the range of X

Page 9: Outline Statistics in medicine - Yale Universitytbl.med.yale.edu/files/correlation_mutlivariable_regression/slides.pdf · Statistics in medicine Lecture 4: Correlation and multivariable

11/7/2016

9

S L I D E 32

Linear regression

–Steps:

• Build up the model

• Assess model fit

• Interpret the regression coefficient

S L I D E 33

Linear regression

–Steps:

• Build up the model • Most common method is stepwise (it is

automated in most programs)

• Start with a one variable in the model (the main predictor, if one is hypothesized)

• Add another variable

• Keep adding variables to the list of variables already in the model

• Use a stopping criterion such as:

• The increase in r2 <.01

S L I D E 34

Linear regression

–Steps:

• Build up the model • R2

• Is a measure of how much of the variation of the outcome is accounted for by the explanatory variables

• Range 0-1

• 0 no variance accounted for

• 1 all the variance (100%) accounted for

S L I D E 35

Linear regression

–Steps:

• Assess model fit • Residuals “the part of Y that is not

explained by X” could be used to assess the model fit

• Plot the residuals(on Y axis) versus X

• The mean of the residuals is zero, therefore, if the model fits the data, the residuals and x should not be correlated

Page 10: Outline Statistics in medicine - Yale Universitytbl.med.yale.edu/files/correlation_mutlivariable_regression/slides.pdf · Statistics in medicine Lecture 4: Correlation and multivariable

11/7/2016

10

S L I D E 36

Date of download: 1/12/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved.

Illustration of analysis of residuals. A: Linear

relationship between X and Y. B: Residuals

versus values of X for relation in part A.C:

Curvilinear relationship between X and Y. D:

Residuals versus values of X for relation in part

C.

Legend:

From: Chapter 8. Research Questions About Relationships

among Variables

Basic &amp; Clinical Biostatistics, 4e, 2004

Linear regression

• Assess model fit

– Good fit: the residuals form a random scatter around the zero line

S L I D E 37

Linear regression

–Steps:

• Interpret the regression coefficient

–The intercept: it is the expected value of Y if all X = zero

• If x cannot be zero, so intercept is not meaningful

• If the interest is in the relationship between X and Y, the intercept is not of interest and will not affect the conclusion

• If the interest is in the prediction of Y from X, then X has to be re-scaled intercept will be the expected value of Y at the chosen X value

S L I D E 38

Linear regression

–Steps:

• Interpret the regression coefficient

–The slope:

• If X is continuous: it is the change in Y for a one-unit increase in X, holding other variables (other Xs) constant

• If X is categorical: it is the mean difference in Y for between one category and the reference category of X, holding other variables (other Xs) constant

S L I D E 39

Linear regression

–Steps:

• Interpret the regression coefficient

–Compare the test statistic (t) with the critical value of significance for the relevant df

–The p value:

• Null hypothesis: the coefficients (intercept and slopes) = zero

• If p < predetermined significance level reject the null

Page 11: Outline Statistics in medicine - Yale Universitytbl.med.yale.edu/files/correlation_mutlivariable_regression/slides.pdf · Statistics in medicine Lecture 4: Correlation and multivariable

11/7/2016

11

S L I D E 40

Linear regression

An example and interpretation

–A study was conducted to examine the association between insulin sensitivity scores (outcome) and BMI. The resultant regression equation was Y’=1.5817 – 0.0433X. Interpret the terms in the equation.

–Y’: predicted insulin sensitivity

–1.5817: the intercept i.e. patients with zero BMI (unrealistic) have insulin sensitivity of 1.5817

From: Basic &amp; Clinical Biostatistics, 4e, 2004

S L I D E 41

Linear regression

An example and interpretation

–A study was conducted to examine the association between insulin sensitivity scores (outcome) and BMI. The resultant regression equation was Y’=1.5817 – 0.0433X. Interpret the terms in the equation.

oX: Observed BMI value

o -0.0433: the slope i.e. when BMI increase by 1 unit, predicted insulin sensitivity decrease by 0.0433

From: Basic &amp; Clinical Biostatistics, 4e, 2004

S L I D E 42

Logistic regression

– General linear model in which the dependent variable is nominal/categorical variable

– Types

• Commonly used for dichotomous dependent variable

• Could be used for multinomial dependent variable

S L I D E 43

Logistic regression

– Using a mathematical function (logit) to transform the regression data so y will be limited to (0,1)

– Logit(p (y=1|xs))=log(p/(1-p))= β0+β1x1+β2x2+…+βkxk

• Translated into the probability of the dependent variable as an exponential function of the independent variables

𝑝𝑦=1 =1

1 + 𝑒𝑥𝑝[−(𝑏0+𝑏1𝑥1+𝑏2𝑥2+…+𝑏𝑘𝑥𝑘)]

Page 12: Outline Statistics in medicine - Yale Universitytbl.med.yale.edu/files/correlation_mutlivariable_regression/slides.pdf · Statistics in medicine Lecture 4: Correlation and multivariable

11/7/2016

12

S L I D E 44

Logistic regression

– Types

• Commonly used for dichotomous dependent variable

• Could be used for multinomial dependent variable

S L I D E 45

Logistic regression

–Steps:

• Build up the model

–Similar to linear regression

• Assess model fit

–Hosmer and Lemeshow’s goodness of fit test a p value >.05 acceptable fit

• Interpret the regression coefficient

S L I D E 46

Logistic regression

– Steps:

• Interpret the regression coefficient

–exp(β0): the odds that y=1, given x=0

–exp(β1):

• If X is categorical: is the odds ratio of y=1 in one category of x compared to the reference category of x, holding other variables (other xs) constant

• If x is continuous: is the change in the odds of y=1 for a one-unit increase in x, holding other variables (other xs) constant

S L I D E 47

Logistic regression

– Practical coding issues:

• Interpretation of the results depend on how you code your data, therefore

–It is important to check how the outcome is coded i.e. what level is coded as 1 and what level is coded as 2 (or 0)

–It is important check how the predictors are coded

• Common practice is to code binary predictors as 0,1

Page 13: Outline Statistics in medicine - Yale Universitytbl.med.yale.edu/files/correlation_mutlivariable_regression/slides.pdf · Statistics in medicine Lecture 4: Correlation and multivariable

11/7/2016

13

S L I D E 48

Logistic regression

An example and interpretation

– Blood alcohol concentration (BAC)>50mg/dL was examined among men with unintentional injury who was admitted to emergency room. Predictors of BAC were daytime, weekday, being Caucasian, and age of 40. the reported OR(95% CI) for Caucasian, and age of 40 were 1.32(1.06-1.65), and 0.89(0.72-1.10). Interpret the results.

From: Basic &amp; Clinical Biostatistics, 4e, 2004

S L I D E 49

Logistic regression

An example and interpretation

–Answer

• Caucasians were significantly more likely to have elevated BAC than other races.

• Age did not significantly predict elevated BAC.

From: Basic &amp; Clinical Biostatistics, 4e, 2004

S L I D E 50

Survival analysis

–Definition: The statistical methods for analyzing survival data when there are censored observations

–Censored observation: is an observation whose value is unknown, generally because the subject has not been in the study long enough for outcome of interest, such as death

S L I D E 51

Survival analysis

–Common methods of summarization and presentation

• Person-time

• Life-tables

Page 14: Outline Statistics in medicine - Yale Universitytbl.med.yale.edu/files/correlation_mutlivariable_regression/slides.pdf · Statistics in medicine Lecture 4: Correlation and multivariable

11/7/2016

14

S L I D E 52

Survival analysis

–Common methods of summarization and presentation

• Person-time –Person-time is the length of follow-up

• Ex. If two subject were followed, one for 2 years and the second for one year, then the total person-time is three person-year

–Could be used to calculate incidence density

• Incidence density is the number of events divided by the total person-time

• Useful method if the event could be recurrent

S L I D E 53

Survival analysis

–Common methods of summarization and presentation

• Life-tables (covered in the epi course) –Two methods

• Kaplan-Meier method

• Actuarial method

–Requirements

• Date of entry

• Date of withdrawal

• Cause of withdrawal

• Death

• Loss of follow-up

S L I D E 54

Survival analysis

–Common methods to test significance in survival analysis

• Logrank test

• Mantel-Haenszel chi-square test

• Cox’s proportional hazard model

Bi-variate

S L I D E 55

Cox proportional hazard model

– Regression model when there is censored outcome data

• Data is said to be censored if

–Loss of follow up

–End of the study

– The dependent variable is the survival time (time to event)

• h(t, X1, X2, ….XK)= h0(t)eb1x1 +b2x2 +….+bkxk

Page 15: Outline Statistics in medicine - Yale Universitytbl.med.yale.edu/files/correlation_mutlivariable_regression/slides.pdf · Statistics in medicine Lecture 4: Correlation and multivariable

11/7/2016

15

S L I D E 56

Cox proportional hazard model

–Answer the question “what is the likelihood of survival to a particular time (i.e. dying in the next interval), given survival up to this time, and given a set of independent variables”

S L I D E 57

Cox proportional hazard model

–Allows estimating relative risk (also called hazard ratio)

• In other words, answer the question of what is the risk of an event (such as death) at a given time, given it has not occurred until that time

• The ratio of the risk of the event at a given time, in the exposed to the risk in the unexposed

–Assessing the assumption of proportional hazard is beyond this class

S L I D E 58

Readings and resources

• Chapter 8, p190-220: Dawson, B. and Trapp, R. G. (2004). Basic and Clinical Biostatistics (4th edition). New York: McGraw-Hill

• Chapter 9, p221-244: Dawson, B. and Trapp, R. G. (2004). Basic and Clinical Biostatistics (4th edition). New York: McGraw-Hill.

• Chapter 10, p245-263: Dawson, B. and Trapp, R. G. (2004). Basic and Clinical Biostatistics (4th edition). New York: McGraw-Hill.

• Chapter 11, p147-151: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition).

• Chapter 13, p163-170: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition).