Regression analysis
Linear regression Logistic regression
2
Relationship and association
3
Straight line
95 95.5 96 96.5 97 97.5 98 98.5 9921.523
21.5235
21.524
21.5245
21.525
21.5255
21.526
21.5265
H ip (cm )
1 cm
-0.0008BM
I
XbbY 10
XBMI 0008.01000
)(
)(
12
121 XX
YYb
onintersecti0 b
HIPBMI 10 bb
4
Best straight line?
5
Best straight line!
90 92 94 96 98 100 102 104 106 10814
16
18
20
22
24
26
28
30
32
(X1,Y1)
11 YYe
N
iii YYe
1
2ˆ
Least square estimation
6
Simple linear regression
1. Is the association linear?
-3 -2 -1 0 1 2 3-4
-2
0
2
4
6
8
10
12
7
Simple linear regression
1. Is the association linear?2. Describe the
association: what is b0 and b1
BMI = -12.6kg/m2+0.35kg/m3*Hip
21
XX
YYXXb
i
ii
n
XX i
XbYb 10
8
Simple linear regression
1. Is the association linear?2. Describe the association3. Is the slope significantly
different from 0?Help SPSS!!!
Coefficientsa
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig.B Std. Error Beta
1 (Constant) -12,581 2,331 -5,396 ,000
Hip ,345 ,023 ,565 15,266 ,000
a. Dependent Variable: BMI
9
Simple linear regression
1. Is the association linear?2. Describe the association3. Is the slope significantly
different from 0?4. How good is the fit?
How far are the data points fom the line on avarage?
11
22
r
YYXX
YYXXr
ii
ii
10
The Correlation Coefficient, r
R = 0
R = 1
R = 0.7
R = -0.5
11
r2 – Goodness of fitHow much of the variation can be explained by the model?
R2 = 0
R2 = 1
R2 = 0.5
R2 = 0.2
12
Multiple linear regression
Could waist measure descirbe some of the variation in BMI?BMI =1.3 kg/m2 + 0.42 kg/m3 * WaistOr even better:
WSTHIPBMI 210 bbb
0.17WST0.25HIP12.2- BMI
13
Multiple linear regression
Adding age: adj R2 = 0.352
Adding thigh: adj R2 = 0.352?
Coefficientsa
Model
Unstandardized
Coefficients
Standardized
Coefficients
t Sig.
95,0% Confidence Interval
for B
B Std. Error Beta Lower Bound Upper Bound
1 (Constant) -9,001 2,449 -3,676 ,000 -13,813 -4,190
Waist ,168 ,043 ,201 3,923 ,000 ,084 ,252
Hip ,252 ,031 ,411 8,012 ,000 ,190 ,313
Age -,064 ,018 -,126 -3,492 ,001 -,101 -,028
a. Dependent Variable: BMI
Coefficientsa
Model
Unstandardized
Coefficients
Standardized
Coefficients
t Sig.
95,0% Confidence Interval
for B
B Std. Error Beta Lower Bound Upper Bound
1 (Constant) 3,581 1,784 2,007 ,045 ,075 7,086
Waist ,168 ,043 ,201 3,923 ,000 ,084 ,252
Age -,064 ,018 -,126 -3,492 ,001 -,101 -,028
Thigh ,252 ,031 ,411 8,012 ,000 ,190 ,313
a. Dependent Variable: BMI
14
Assumptions
1. Dependent variable must be metric continuous
2. Independent must be continuous or ordinal
3. Linear relationship between dependent and all independent variables
4. Residuals must have a constant spread.
5. Residuals are normal distributed6. Independent variables are not
perfectly correlated with each other
Non-parametric correlation
15
16
Ranked Correlation
Kendall’s Spearman’s rs
Korrelation koefficienten er mellem -1 og 1. Hvor -1 er perfekt omvendt korrelation, 0 betyder ingen korrelation,
og 1 betyder perfekt korrelation.
Pearson is the correlation method for normal dataRemember the assumptions:1. Dependent variable must be metric continuous2. Independent must be continuous or ordinal3. Linear relationship between dependent and all independent
variables4. Residuals must have a constant spread.5. Residuals are normal distributed
17
Kendall’s - Et eksempel
18
Kendall’s - Et eksempel
121
nn
S QPS
19
Spearman – det samme eksempel
d2 1 4 9 1 1 1 9 9 1 16
0.68481010
5261
61
33
2
nn
drs
20
Korrelation i SPSS
21
Korrelation i SPSS
Correlations
a b
a Pearson
Correlation
1 ,685*
Sig. (2-tailed) ,029
N 10 10
b Pearson
Correlation
,685* 1
Sig. (2-tailed) ,029
N 10 10
*. Correlation is significant at the 0.05 level (2-tailed).
Correlations
a b
Kendall's tau_b a Correlation
Coefficient
1,000 ,511*
Sig. (2-tailed) . ,040
N 10 10
b Correlation
Coefficient
,511* 1,000
Sig. (2-tailed) ,040 .
N 10 10
Spearman's rho a Correlation
Coefficient
1,000 ,685*
Sig. (2-tailed) . ,029
N 10 10
b Correlation
Coefficient
,685* 1,000
Sig. (2-tailed) ,029 .
N 10 10
*. Correlation is significant at the 0.05 level (2-tailed).
Logistic regression
22
23
Logistic Regression
• If the dependent variable is categorical and especially binary?
• Use some interpolation method
• Linear regression cannot help us.
24
The sigmodal curve
0 1 1
1
1 e...
z
n n
p
z x x
-6 -4 -2 0 2 4 60
0.2
0.4
0.6
0.8
1
x
p
sigmodal curve
0 = 0;
1 = 1
25
The sigmodal curve
• The intercept basically just ‘scale’ the input variable
0 1 1
1
1 e...
z
n n
p
z x x
-6 -4 -2 0 2 4 60
0.2
0.4
0.6
0.8
1
x
p
sigmodal curve
0 = 0;
1 = 1
0 = 2;
1 = 1
0 = -2;
1 = 1
26
The sigmodal curve
0 1 1
1
1 e...
z
n n
p
z x x
-6 -4 -2 0 2 4 60
0.2
0.4
0.6
0.8
1
x
p
sigmodal curve
0 = 0;
1 = 1
0 = 0;
1 = 2
0 = 0;
1 = 0.5
• The intercept basically just ‘scale’ the input variable
• Large regression coefficient → risk factor strongly influences the probability
27
The sigmodal curve
0 1 1
1
1 e...
z
n n
p
z x x
-6 -4 -2 0 2 4 60
0.2
0.4
0.6
0.8
1
x
p
sigmodal curve
0 = 0;
1 = 1
0 = 0;
1 = -1
• The intercept basically just ‘scale’ the input variable
• Large regression coefficient → risk factor strongly influences the probability
• Positive regression coefficient → risk factor increases the probability
• Logistic regession uses maximum likelihood estimation, not least square estimation
28
Does age influence the diagnosis? Continuous independent variable
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
95% C.I.for EXP(B)
Lower Upper
Step 1a Age ,109 ,010 108,745 1 ,000 1,115 1,092 1,138
Constant -4,213 ,423 99,097 1 ,000 ,015
a. Variable(s) entered on step 1: Age.
age1
1
10
BBze
pz
29
Does previous intake of OCP influence the diagnosis? Categorical independent variable
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
95% C.I.for EXP(B)
Lower Upper
Step 1a OCP(1) -,311 ,180 2,979 1 ,084 ,733 ,515 1,043
Constant ,233 ,123 3,583 1 ,058 1,263
a. Variable(s) entered on step 1: OCP.
OCP1
1
10
BBze
pz
0.48051
1
1
1)1( 1, OCP If
0.55801
1
1
1)1( 0, OCP If
311.0233.01
233.0
10
0
eeYp
eeYp
BB
B
30
Odds ratio
zep
po
1
0.7327 ratio odds 311.01010
0
10
eeee
e BBBBB
BB
31
Multiple logistic regression
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
95% C.I.for EXP(B)
Lower Upper
Step 1a Age ,123 ,011 115,343 1 ,000 1,131 1,106 1,157
BMI ,083 ,019 18,732 1 ,000 1,087 1,046 1,128
OCP ,528 ,219 5,808 1 ,016 1,695 1,104 2,603
Constant -6,974 ,762 83,777 1 ,000 ,001
a. Variable(s) entered on step 1: Age, BMI, OCP.
BMIageOCP1
1
3210
BBBBze
pz
32
Predicting the diagnosis by logistic regression
What is the probability that the tumor of a 50 year old woman who has been using OCP and has a BMI of 26 is malignant?
z = -6.974 + 0.123*50 + 0.083*26 + 0.28*1 = 1.6140p = 1/(1+e-1.6140) = 0.8340
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
95% C.I.for EXP(B)
Lower Upper
Step 1a Age ,123 ,011 115,343 1 ,000 1,131 1,106 1,157
BMI ,083 ,019 18,732 1 ,000 1,087 1,046 1,128
OCP ,528 ,219 5,808 1 ,016 1,695 1,104 2,603
Constant -6,974 ,762 83,777 1 ,000 ,001
a. Variable(s) entered on step 1: Age, BMI, OCP.