correlation and simple linear regression s5 · 4/43 relationship between two numerical variables if...
TRANSCRIPT
![Page 1: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/1.jpg)
1/43
Basic medical statistics for clinical and experimental research
Correlation and simple linear regressionS5
Sander [email protected]
December 4, 2019
![Page 2: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/2.jpg)
2/43
Introduction
Example: Brain size and body weight
Brain size WeightSubject id (MRI total pixel count per 10,000) (pounds)
1 81.69 1182 96.54 1723 92.88 1464 90.49 1345 95.55 1726 83.39 1187 92.41 155...
......
24 79.06 12225 98.00 190
![Page 3: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/3.jpg)
3/43
Introduction
Example: Brain size and body weight
Subject 4
Subject 24
![Page 4: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/4.jpg)
4/43
Relationship between two numerical variables
If a linear relationship between x and y appears to be reasonablefrom the scatter plot, we can take the next step and
1. Calculate Pearson’s correlation coefficient between x and yI Measures how closely the data points on the scatter plot resemble a
straight line
2. Perform a simple linear regression analysisI Finds the equation of the line that best describes the relationship
between variables seen in a scatter plot
![Page 5: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/5.jpg)
5/43
Correlation
Sample Pearson’s correlation coefficient (or product momentcorrelation coefficient), between variables x and y is calculated as
r(x , y) =1
n − 1
n∑i=1
(xi − x
sx
)(yi − y
sy
)where:I {(xi , yi ) : i = 1, . . . ,n} is a random sample of n observations on x
and y ,I x and y are the sample means of respectively x and y ,I sx and sy are the sample standard deviations of respectively x and
y .
![Page 6: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/6.jpg)
6/43
Correlation
Properties of r :
I r estimates the true population correlation coefficient ρI r takes on any value between −1 and 1I Magnitude of r indicates the strength of a linear relationship
between x and y :I r = −1 or 1 means perfect linear associationI The closer r is to -1 or 1, the stronger the linear association
(e.g. r = -0.1 (weak association) vs r = 0.85 (strong association))I r = 0 indicates no linear association (but can be e.g. non-linear)
I Sign of r indicates the direction of association:I r > 0 implies positive relationship
i.e. the two variables tend to move in the same directionI r < 0 implies negative relationship
i.e. the two variables tend to move in the opposite directions
![Page 7: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/7.jpg)
7/43
Correlation
Properties of r (cont’d):
I r(a · x + b, c · y + d) = r(x , y), where a > 0, c > 0, and b and d areconstants
I r(x , y) = r(y , x)
I r 6= 0 does not imply causation! Just because two variables arecorrelated does not necessarily mean that one causes the other!
I r2 is called the coefficient of determinationI 0 ≤ r 2 ≤ 1I Represents the proportion of total variation in one variable that is
explained by the otherI For example: the coefficient of determination between body weight and
age of 0.60 means that 60% of total variation in body weight is explainedby age alone and the remaining 40% is explained by other factors
![Page 8: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/8.jpg)
8/43
Correlation
Example: Correlations does not imply causation
Margarineconsum
edDiv
orce
rat
ein
Mai
ne
DivorcerateinMainecorrelateswith
Percapitaconsumptionofmargarine
Margarineconsumed DivorcerateinMaine
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2lbs
4lbs
6lbs
8lbs
3.96per1,000
4.29per1,000
4.62per1,000
4.95per1,000
tylervigen.com
![Page 9: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/9.jpg)
9/43
CorrelationCorrelation
r= -1 r= 1 r= 0.8 r= -0.8
r= 0 r= 0 0 < r< 1 -1 < r< 0
6 / 49
Don’t interpret r without looking at the scatter plot!
![Page 10: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/10.jpg)
10/43
Correlation
Hypothesis test for the population correlation coefficient ρ:
H0 : ρ = 0(there is no linear relationship between y and x)
H1 : ρ 6= 0(there is a linear relationship between y and x)
Under H0, the test statistic
T = r√
n−21−r2
follows a t-distribution with n − 2 degrees of freedom.
I This test assumes that the variables x and y are normallydistributed
![Page 11: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/11.jpg)
11/43
Correlation
Example: Brain size and body weight
What is the magnitude and sign of correlation coefficient betweenbrain size and weight?
![Page 12: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/12.jpg)
12/43
Correlation
Example: Brain size and body weight
Correlations
Weight Brain
Weight Pearson Correlation
Sig. (2-tailed)
N
Brain Pearson Correlation
Sig. (2-tailed)
N
1 .826**
.000
25 25
.826** 1
.000
25 25
Correlation is significant at the 0.01 level (2-tailed).**.
Page 1
![Page 13: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/13.jpg)
13/43
Simple linear regression
I Pearson’s correlation coefficient measures the strength anddirection of a linear association between x and y
I Simple linear regression finds an equation (mathematical model)that describes the relationship between the two variables→ we canpredict values of one variable using values of the other variable
I Unlike correlation, regression requiresI a dependent variable y (outcome/response variable): variable being
predicted (always on the vertical or y -axis)I an independent variable x (explanatory/predictor variable): variable used
for prediction (always on the horizontal or x-axis)
![Page 14: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/14.jpg)
14/43
Simple linear regression
Simple linear regression postulates that in the population
y = (α + β · x) + ε,
where:I y is the dependent variableI x is the independent variableI α and β are parameters called the population regression
coefficientsI α is called the intercept or constant termI β is called the slope
I ε is the random error term
![Page 15: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/15.jpg)
15/43
Simple linear regression
x
y
x1 x2 x3 x4 x5
![Page 16: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/16.jpg)
16/43
Simple linear regression
x
y E(y|xi)
x1 x2 x3 x4 x5
E(y|x) = α + β·x
E(y |xi) is the mean value of y when x = xi
E(y |x) = α+ β · x is the population regression function
![Page 17: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/17.jpg)
17/43
Simple linear regression
x
y
1 2 3 4 5
E(y|x) = α + β·x
0
α
6
β
3β
I α is the y -intercept of the population regression function, i.e. the meanvalue of y when x equals 0
I β is the slope of the population regression function, i.e. the mean (orexpected) change in y associated with a 1-unit increase in the value of x
I c · β is the mean change in y for a c-unit increase in the value of xI α and β are estimated from the sample data using the least squares
method (usually)
![Page 18: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/18.jpg)
18/43
Simple linear regression
x
y
xi 0
y� = a + b·x
y�i
yi ei ei = yi - y�i = residual i
Least squares method chooses a and b (estimates for α and β) tominimize the sum of the squares of the residuals
n∑i=1
ei2 =
n∑i=1
(yi − yi )2 =
n∑i=1
[yi − (a + b · xi )]2
![Page 19: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/19.jpg)
19/43
Simple linear regression
The least squares estimates for β and α are:
b =
∑ni=1(xi − x)(yi − y)∑n
i=1(xi − x)2
and
a = y − b · x ,
where x and y are the respective sample means of x and y .
Note that:b = r(x , y) ·
sy
sx,
where r(x , y) is the sample correlation between x and y , and sx andsy are the sample standard deviations of x and y .
![Page 20: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/20.jpg)
20/43
Simple linear regression
Test of H0 : β = 0 versus H1 : β 6= 0
1. t-test:I Test statistic: T = b
SE(b) , where SE(b) is the standard error of bcalculated from the data
I Under H0, T follows a t-distribution with n − 2 degrees of freedom
2. F -test:I Test statistic: F =
(b
SE(b)
)2= T 2, where SE(b) and T are as above
I Under H0, F follows an F -distribution with 1 and n − 2 degrees offreedom
I The t-test and the F-test lead to the same outcome
The test of zero intercept α is of less interest, unless x = 0 is meaningful
![Page 21: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/21.jpg)
21/43
Simple linear regression
Example: Brain size (MRI total pixel count per 10,000) and bodyweight (pounds)
Coefficientsa
Model
Unstandardized CoefficientsStandardized Coefficients
t Sig.B Std. Error Beta
1 (Constant)
Weight
62.334 3.845 16.212 .000
.176 .025 .826 7.030 .000
Dependent Variable: Braina.
Page 1
ANOVAa
ModelSum of Squares df Mean Square F Sig.
1 Regression
Residual
Total
507.387 1 507.387 49.416 .000b
236.157 23 10.268
743.544 24
Dependent Variable: Braina.
Predictors: (Constant), Weightb.
Page 1
Brain = 62.33 + 0.18 ·Weight
![Page 22: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/22.jpg)
22/43
Simple linear regressionExample: Brain size (MRI total pixel count per 10,000) and bodyweight (pounds)
Weight
200180160140120100
Bra
in100.00
95.00
90.00
85.00
80.00
75.00
y=62.33+0.18*x
Page 1
![Page 23: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/23.jpg)
23/43
Simple linear regression
Example: Blood pressure (mmHg) and body weight (kg) in 20patients with hypertension1
Weight
105.00100.0095.0090.0085.00
BP
125.00
120.00
115.00
110.00
105.00
Page 1
1Daniel, W.W. and Cross, C.L.(2013). Biostatistics: a foundation for analysis in the health sciences, 10th edition.
![Page 24: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/24.jpg)
24/43
Simple linear regression
Example: Blood pressure (mmHg) and body weight (kg) in 20patients with hypertension
Correlations
BP Weight
BP Pearson Correlation
Sig. (2-tailed)
N
Weight Pearson Correlation
Sig. (2-tailed)
N
1 ,950**
,000
20 20
,950** 1
,000
20 20
Correlation is significant at the 0.01 level (2-tailed).**.
Page 1
![Page 25: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/25.jpg)
25/43
Simple linear regression
Example: Blood pressure (mmHg) and body weight (kg) in 20patients with hypertension
Coefficientsa
Model
Unstandardized Coefficients
t Sig.B Std. Error Beta
1 (Constant)
Weight
2.205 8.663 .255 .802
1.201 .093 .950 12.917 .000
a.
Page 1
ANOVAa
ModelSum of Squares df Mean Square F Sig.
1 Regression
Residual
Total
505.472 1 505.472 166.859 .000b
54.528 18 3.029
560.000 19
Dependent Variable: BPa.
Predictors: (Constant), Weightb.
Page 1
BP = 2.21 + 1.20 ·Weight
![Page 26: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/26.jpg)
25/43
Simple linear regression
Example: Blood pressure (mmHg) and body weight (kg) in 20patients with hypertension
Coefficientsa
Model
Unstandardized Coefficients
t Sig.B Std. Error Beta
1 (Constant)
Weight
2.205 8.663 .255 .802
1.201 .093 .950 12.917 .000
a.
Page 1
ANOVAa
ModelSum of Squares df Mean Square F Sig.
1 Regression
Residual
Total
505.472 1 505.472 166.859 .000b
54.528 18 3.029
560.000 19
Dependent Variable: BPa.
Predictors: (Constant), Weightb.
Page 1
BP = 2.21 + 1.20 ·Weight
![Page 27: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/27.jpg)
26/43
Simple linear regressionExample: Blood pressure (mmHg) and body weight (kg) in 20patients with hypertension
Weight
105.00100.0095.0090.0085.00
BP
125.00
120.00
115.00
110.00
105.00
y=2.21+1.2*x
Page 1
![Page 28: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/28.jpg)
27/43
Simple linear regression
Standardized coefficientsI Obtained by standardizing both y and x (i.e. converting into
z-scores) and re-running the regressionI Standardized intercept equals zero and standardized slope for x
equals the sample correlation coefficientCorrelations
Weight Brain
Weight Pearson Correlation
Sig. (2-tailed)
N
Brain Pearson Correlation
Sig. (2-tailed)
N
1 .826**
.000
25 25
.826** 1
.000
25 25
Correlation is significant at the 0.01 level (2-tailed).**.
Page 1
Coefficientsa
Model
Unstandardized CoefficientsStandardized Coefficients
t Sig.B Std. Error Beta
1 (Constant)
Weight
62.334 3.845 16.212 .000
.176 .025 .826 7.030 .000
Dependent Variable: Braina.
Page 1
![Page 29: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/29.jpg)
28/43
Simple linear regression
Standardized coefficientsI Obtained by standardizing both y and x (i.e. converting into
z-scores) and re-running the regressionI Standardized intercept equals zero and standardized slope for x
equals the sample correlation coefficientI Of greater concern in multiple linear regression where the
predictors are expressed in different unitsI Standardization removes the dependence of regression coefficients on
the units of measurements of y and x ’s so they can be meaningfullycompared
I The larger the standardized coefficient (in absolute value) the greaterthe contribution of the respective variable in the prediction of y
![Page 30: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/30.jpg)
29/43
Simple linear regression
Linear regression is only appropriate when the following assumptionsare satisfied:
1. Independence: the observations are independent, i.e. there is onlyone pair (x , y) per subject
2. Linearity: the relationship between x and y is linear3. Constant variance: the variance of y is constant for all values of x4. Normality: Given x , y has a normal distribution (or equivalently, the
residuals have a normal distribution)
![Page 31: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/31.jpg)
30/43
Simple linear regression
Checking linearity assumption:1. Make a scatter plot of y versus x
I the points should generally form a straight line
2. Plot the residuals against the explanatory variable xI the points should present a random scatter of points around zero, there
should be no systematic pattern
x x
x
x
x x
x x
x x
x
x
x
x x
x
x x
x
x x x
x
x x
0
x
e
Linearity
x
x x
x
x
x
x x
x x
x x
x x
x
x
x
x
x
x
x
x
x
x
x
0
Lack of linearity
x
e
![Page 32: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/32.jpg)
31/43
Simple linear regression
Checking constant variance assumption:I Make a residual plot, i.e. plot the residuals against the fitted values
of y (yi = a + b · xi )I the points should present a random scatter of points
x x
x
x
x x
x x
x
x
x
x
x
x
x
x
x x
x
x x x
x
x x
0
e
Constant variance
0
Non-constant variance
e
x
x
x x
x
x
x
x x
x
x x x
x
x x
x
x x
x x
x x
x
x
x
![Page 33: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/33.jpg)
32/43
Simple linear regression
Example: Blood pressure and body weight
![Page 34: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/34.jpg)
33/43
Simple linear regression
Checking normality assumption:1. Draw a histogram of the residuals and “eyeball” the result2. Make a normal probability plot (P–P plot) of the residuals,
i.e. plot the expected cumulative probability of a normal distributionversus the observed cumulative probability at each value of theresidualI the points should form a straight diagonal line
![Page 35: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/35.jpg)
34/43
Simple linear regression
Example: Blood pressure and body weight
![Page 36: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/36.jpg)
35/43
Simple linear regression
Assessing goodness of fitI The estimated regression line is the “best” one available (in the
least-squares sense)I Yet, it can still be a very poor fit to the observed data
x
x
x
x x
x x
x
x x
x
x
x
x x
x
x
x x
x
x
x
x x
x
x
y
Good fit
x x
x
x
x
x
x
x
x
x x x
x x x
x
x x
x x
x
x
x
x
x
Bad fit
x
y
![Page 37: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/37.jpg)
36/43
Simple linear regression
To assess goodness of fit of a regression line (i.e. how well the linefits the data) we can:
1. Calculate the correlation coefficient R between the predicted andobserved values of yI A higher value of R indicates better fit (predicted and observed values of
y are closer to each other)
2. Calculate R2 (R Square in SPSS)I 0 ≤ R2 ≤ 1I A higher value of R2 indicates better fitI R2 = 1 indicates perfect fit (i.e. yi = yi for each i)I R2 = 0 indicates very poor fit
![Page 38: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/38.jpg)
37/43
Simple linear regression
Alternatively, R2 can be calculated as
R2 =
∑ni=1(yi − y)2∑ni=1(yi − y)2
=variation in y explained by x
total variation in y
I R2 is interpreted as proportion of total variability in y explained byexplanatory variable xI R2 = 1: x explains all variability in yI R2 = 0: x does not explain any variability in y
I R2 is usually expressed as a percentage; e.g., R2 = 0.93 indicatesthat 93% of total variation in y can be explained by x
![Page 39: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/39.jpg)
38/43
Simple linear regression
Example: Blood pressure and body weight
Model Summary
Model R R Square Adjusted R
Square
Std. Error of the
Estimate
1 ,950a ,903 ,897 1,74050
a. Predictors: (Constant), Weight
![Page 40: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/40.jpg)
39/43
Simple linear regressionPrediction: interpolation versus extrapolation
y
Range of actual data
x
Possible patterns of additional data
Interpolation Extrapolation Extrapolation
Extrapolation beyond the range of the data is risky!!
![Page 41: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/41.jpg)
40/43
Categorical explanatory variable
I So far we assumed that the predictor variable is numericalI We can also study an association between y and a categorical x ,
e.g. between blood pressure and gender or between brain size andethnicityI Categorical variables can be incorporated through dummy variables that
take on the values 0 and 1; to include a categorical variable with pcategories, p − 1 dummy variables are required
![Page 42: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/42.jpg)
41/43
Categorical explanatory variable
Example: IQ and blood group (4 categories, A, B, AB, 0)
1. Dummy variables for all blood groups
xA =
{1, if blood group is A0, otherwise
xB =
{1, if blood group is B0, otherwise
xAB =
{1, if blood group is AB0, otherwise
x0 =
{1, if blood group is 00, otherwise
2. One category is a reference categoryI category that results in useful comparisons (e.g. exposed versus
non-exposed, experimental versus standard treatment) or a categorywith large number of subjects
3. In the model we include all dummies except the one correspondingto the reference category
![Page 43: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/43.jpg)
42/43
Categorical explanatory variable
Model with blood group 0 as reference category
y = α + βA · xA + βB · xB + βAB · xAB + ε
and its estimated counterpart is
y = a + bA · xA + bB · xB + bAB · xAB
I Estimation of model parameters requires running multiple linearregression, unless the explanatory variable has only two categories
I Given that y represents IQ score, the estimated coefficients areinterpreted as follows:I a is the mean IQ for subjects with blood group 0, i.e. the reference
categoryI Each b represents the mean difference in IQ between subjects with a
blood group represented by the respective dummy variable and subjectswith blood group 0 (the reference category)
![Page 44: Correlation and simple linear regression S5 · 4/43 Relationship between two numerical variables If a linear relationship between x and y appears to be reasonable from the scatter](https://reader034.vdocuments.site/reader034/viewer/2022042110/5e8b6c2c42a52c751e0176f6/html5/thumbnails/44.jpg)
43/43
Categorical explanatory variable
Specifically:I bA is the mean difference in IQ between subjects with blood group
A and subjects with blood group 0I bB is the mean difference in IQ between subjects with blood group
B and subjects with blood group 0I bAB is the mean difference in IQ between subjects with blood group
AB and subjects with blood group 0
A test for the significance of a categorical explanatory variable with plevels involves the hypothesis that the coefficients of all p − 1 dummyvariables are zero. For that purpose, we need to use an overall F-test(next lecture) and not a t-test. The t-test can be used only when thevariable is binary.