logistic regression & prediction score 16-17-12-09 · nizam a. allied regression analysis and...

Post on 27-May-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

15/12/52

1

Logistic regression analysis&

developing a clinical prediction score

Ammarin Thakkinstian, Ph.D. Section for Clinical Epidemiology

and Biostatistics (SCEB)

• Part I– Logistic regression analysis

• Part II– Developing a clinical prediction score

15/12/52

2

Objective

• Construct the logit equation • Estimate the probability of event, the

adjusted odds ratio and its 95% confidence interval

• Interpret the results of logistic regression analysisanalysis

• Assess goodness of fit of the logit model & diagnostic measuring

Objective

• Develop a prediction score model using th l it ti & ROC l ithe logit equation & ROC curve analysis

• Calibrate the cut-off or threshold • Validate a prediction score model

15/12/52

3

Reference• Pagano M. and Gauvreau K. Principle of

Biostatistics California: Duxbury PressBiostatistics. California: Duxbury Press 1993; 379 - 424.

• Klienbaum GD., Kupper LL, Muller EK, and Nizam A. Allied regression analysis and other multivariable methods, 3rd edition. Washington: Duxbury Press 1998; 39 - 212.

• Hosmer DW, Lemeshow S. Applied logistic regression, 2ndedition. New York: John Weiley& Sons, Inc 2000.

Outline of talk• Construct logistic equation

Si l l i ti d l– Simple logistic model– Multiple logistic model

• Model selection – Assessing a goodness of fit of the model – Diagnostic measure g

• Creating a clinical prediction score – Derivative phase– Validation phase

15/12/52

4

When will we apply the logistic equation

Assessing association bet een factors• Assessing association between factors and outcome in which

• Outcome – Dichotomous only

• DM/Non-Dm, HT/Non-HT, CKD/non-CKD, , , ,Retinopathy/Non-Ratinophaty,

– Factors • Can be either continuous or categorical variables

Example I.

Factors associate with acute stroke• Design: Case-control study • Outcome variable: Case vs Control

– Case is patient who is diagnosed as h h i i h i t khaemorhagic or ischemic stroke

– Control is subject who has never had history of stroke

15/12/52

5

• Interested variables – Age, gender, BMI, Waist-hip ratio – Smoking, alcohol consumption – Physical activity – History of disease

• DM• HT • High Cholesterol, LDL, HDL, Trig

• Variables (cont)– Genetic factors

• tissue-type plasminogen activator (t-PA)• R353Q polymorphism of the Factor VII gene • Platelet glycoprotein (GP 1bα) gene

– Thr/Met & Kozak polymorphisms

15/12/52

6

Example II. Factors associate with retinopathy in diabetic type 2 patients

• Design – Cross-sectional study

• Outcome– Retinopathy vs Non-retinopathy

• Variables – Demographic data

• Age, gender BMI/Waist-hip ratio, smoking, alcoholAge, gender BMI/Waist hip ratio, smoking, alcohol – History of disease

• HT • Abnormal lipid profile

– Clinical data • SBP/DBPSBP/DBP • Kidney function (GFR or Cr) • HA1C• Medication

– ACR-I, ARB

15/12/52

7

Example III. Risk factors of chronic kidney disease (CKD)

• Design – Cross-sectional study

• Outcome – CKD versus non-CKD

• Variables – Age, gender, BMI/Waist-hip ratio – Alcohol consumption – Smoking – Exercise & Physical activity – History of illness

• DM, HT, Abnormal lipid profile, kidney stone , , p p , y– Medication used

• NSAID, Cyclo-oxygenase type 2 inhibitor (Cox-2), Traditional medicine

15/12/52

8

Example IV. A clinical decision rule to prioritize polysomnography (PSG) in patients with suspected sleep apneapatients with suspected sleep apnea • Design

– Prospective data collection on consecutive patients referred to a sleep centre.

– All consecutive new patients from February 2001 to fApril 2003 were included in the study. Data from

February 2001 to December 2002 were used to derive the decision rule, whereas data collected from January 2003 to April 2003 were used for validation of the rule.

• Setting– The Newcastle Sleep Disorders Centre,

University of Newcastle, NSW, Australia.• Patients

– Consecutive adult patients who had been scheduled for initial diagnostic PSG.

• Study ObjectivesT d i d lid t li i l d i i l th t– To derive and validate a clinical decision rule that can help to prioritize patients who are on waiting lists for PSG.

• Variables

15/12/52

9

15/12/52

10

Association between age & Sleep apnea

mean=531

Scatter plot of age and SA

SA

mean=430

20 40 60 80age

15/12/52

11

Group Age SA Non-SA N Mean P

1 < 30 22 53 75 0.29

2 30-44 146 99 245 0.60

3 45-60 225 79 304 0.74

4 60+ 176 37 213 0.83

7.8

.91

Probability of having SA according to age group

.3.4

.5.6

.7P

roba

bilit

y0

.1.2

<30 37.5 47.5 > 60 age group

15/12/52

12

• Mean value of SA given age group • E(Y|X) • Expected value (mean) of SA given X

0 ≤E(Y|X) ≤ 1

Logit equation:

=⎥⎥

⎢⎢

⎟⎟

⎜⎜

⎛+−+

==

∑k

k

jjj Xββ

)p(Y

exp1

11

10

∑+

∑+

=

=

+

= k

jjj

jjj

Xββ

Xββ

e

e

10

10

1

15/12/52

13

∑+

∑+

=

=

+

−=− k

jjj

k

jjj

Xββ

Xββ

e

ep1

0

10

1

11

∑+

∑+∑+

=

==

+

−+= k

jjj

k

jjj

k

jjj

Xββ

XββXββ

e

ee

10

1010

1

1

1

∑+=

+

= k

jjj Xββ

e1

0

1

1

∑+

∑+

+=−

∴=

=

Xββ

Xββ

e

e

pp

k

jjj

k

jjj

10

10

11

1

∑+

∑+

=

+

=

=

Xββ

Xββ

Xββ

e

e

p

k

k

jjj

k

jjj

10

10

1

∑=

∑+

+=

=−

=

k

jjj

Xββ

xββ

ep

p jjj

10

10

ln1

ln

15/12/52

14

Simple logistic regression

• Fit equation

snorebbP

P10ln +=⎥⎦

⎤⎢⎣⎡

P 101 ⎥⎦⎢⎣ −

Performing analysis in STATA

xi: logit SA i.snore, nologi.snore _Isnore_1-2 (naturally coded; _Isnore_2 omitted)

Logistic regression Number of obs = 837LR chi2(1) = 86.63Prob > chi2 = 0.0000

Log likelihood = -481.49775 Pseudo R2 = 0.0825

------------------------------------------------------------------------------SA | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------_Isnore_1 | 1.571043 .1717837 9.15 0.000 1.234354 1.907733

_cons | -.3846743 .1440453 -2.67 0.008 -.6669979 -.1023508------------------------------------------------------------------------------

15/12/52

15

Interpretation

• Patients with a history of snoring have the logit of sleep apnea 1.57 higher than patients without a history of snoring.

Interpretation

• The logit of sleep apnea for patients with & ith t hi t f i i th fwithout a history of snoring is therefore

equated

57.138.0)]([ln +−=+snoreSAodds

38.0)]([ln −=−snoreSAodds

15/12/52

16

InterpretationsnoreSAoddssnoreSAodds ++−=− −+

571

57.138.038.0)]([ln)]([ln

SAddsnoreSAodds

snoreSAoddssnoreSAodds

=

=⎥⎥⎦

⎢⎢⎣

=

+

+

)57.1exp()]([)]([

57.1.)]([)]([ln

57.1

ORsnoreSAoddssnoreSAodds

snoreSAodds

=−

=

+

)]([)]([

81.4

)p()]([

where

Testing association

1ORor 1 == 0: βHo

• Wald test

6221β 30.461.062.21 ===

seβZ

15/12/52

17

Testing association

• Likelihood ratio test

86.6 = 481.5)+2(-524.8- =G 10 ][2 LLLLG −−=

1-2 df with ~G 2χ

Estimate probability of having event

191

157.138.01

ln +−=−

xp

p

29.31

19.1

19.1

=

=−

=

ep

p

pfor Solve

sides, bothfor logarithm-anti Taking

77.029.429.3

29.329.4

29.329.3

=

=

=

−=

∧∧

p

p

pxp

pfor Solve

15/12/52

18

Multiple logistic regression

• Multiple factors associate with the outcome of interest

• Osteoporotic hip fractureA BMI f C ti t id l h l– Age, BMI, use of Corticosteroid, alcohol consumption, calcium intake, etc

Multiple logistic regression

• CKD– Age, Gender, BMI, use of NSAID, diabetes,

HT, Chol

• SASA– Age, gender, BMI, snore, stop breathing, etc

15/12/52

19

Multiple logistic regression

• Consider > 1 factor simultaneously • Cumulative factors can better predict

event than one factor • Control confounding effects, i.e., assess

effect of each factor controlling for other factorsfactors

ppxβxβxβxβxββDDit ++++++=

⎥⎥⎦

⎢⎢⎣

⎡−

+

...log 443322110

Steps of analysis

• Model selection – Only variables can well explain the interested

event • Clinical significance• Statistical significance

– Not too many (but not too small) variables y ( )

15/12/52

20

Model selection

• i) Univariate analysis ) y• age_gr , sex, BMI_gr, snore, stop_bre,

choking, awake_re, kick_leg, accident, smoker, alcohol, ht, dm allergie

Factors Group P value

SAn = (%)

Non-SA n = (%)

TABLE 1. Patients’ characteristics between SA and non-SA groups

Age , mean (SD)< 30

30 - 4445 - 59

> 60

GenderMaleFemale

BMI, mean (SD)< 25

25 - 29.930 - 39.9

> 40

15/12/52

21

Snoring YesNo

Stopping breathing YesNo

Ch kiChoking YesNo

Waking up refreshed YesSometimeNo

L ki kiLeg kicking YesSometimeNo

Accident due to sleepinessYesNo

FactorsGroup

P valueSAn = (%)

Non-SAn = (%)

ESS score, median (range)

Smoking YesEx-smokeEx smoke No

Alcohol consumptionYesNo

HypertensionYesNoNo

Diabetes mellitusYesNo

Allergy YesNo

15/12/52

22

Model selection

• ii) Multivariate analysis by simultaneously id i i bl 0 15 i t thconsidering variables p < 0.15 into the

model

AgegrβAgegrβAgegrβ

breStopββSASAit

_log

443322

10

+++

++=⎥⎥⎦

⎢⎢⎣

⎡−

+

ppxβSnoreβ

BMIgrβBMIgrβBMIgrβSexβ

...9

483726

5

+

++++

Confounder versus Interaction• Confounders• Confounders

• Crude OR versus Adjusted OR

15/12/52

23

15/12/52

24

Effect modifier

15/12/52

25

Model selection

Backward– Backward – Forward

Performance of the model

• Goodness of fit (Calibration)• How similar are the predicted and observed

outcomes?

15/12/52

26

Model classification• How well the model discriminate SA from

non-SA subjects? ff/• Assign the cut-off/threshold

• Construct 2x2 or kx2 tables• Estimate predictive values

– SenS– Spec

– PPV, NPV– Accuracy – Area under ROC

15/12/52

27

Model classification

• Area under the ROC– Summary statistics that can tell us whether

the logit model can discriminate disease from non-disease subjects.

– Plots sensitivity versus 1-specificity (false positive) for the whole range of estimated

b bilitiprobabilities

.75

1.00

0.25

0.50

0.S

ensi

tivity

0.00

0.00 0.25 0.50 0.75 1.001 - Specificity

Area under ROC curve = 0.8101

15/12/52

28

Interpretation of ROC

Diagnostic measures • Outliers

- Pearson’s chi-square residual q

)ˆ1(ˆ

)ˆ()ˆ,(

jjj

jjjjj ππm

πmyπyr

−=

square sum Residual

)ˆ1(ˆ)ˆ(

)ˆ,(2

2

jjj

jjjjj ππm

πmyπyr

−=

q

15/12/52

29

Outliers

- Deviance residual

2/1

)ˆ1(

)(ln)(ˆln2)ˆ,(

⎥⎥

⎢⎢

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟

⎜⎜

−−+⎟

⎜⎜

⎛±=

jj

jjjj

jj

jjjj πm

ymym

πmy

yπyd)1( ⎥

⎦⎢⎣ ⎪⎭⎪⎩

⎟⎠

⎜⎝

⎟⎠

⎜⎝ jjjj πmπm

• Leverage hjj values • Reflects distance of Xj from the centre mean

Outliers

Reflects distance of Xj from the centre mean• The higher the hjj, the longer distance that

where

)( 2/112/1 VXVXXXVH ′′= −

[ ])(ˆ1)(ˆv

andmatrix diagonal JxJV

j xxm jjj ππ −=

=

15/12/52

30

Influence of outliers

• Influence on prediction value of Y• Including/excluding the pattern/s that are

outlier would change Y values • Pearson residual change

2r–

2)1( jj

j

hr

χ−

=Δ 2

• Deviance residual change

)1(

2

jj

jj h

dD

−=Δ

)( jj

15/12/52

31

Influence on estimate coefficients

( ) ( )( )2

)()(ˆˆˆˆˆ

jj

jjj

hr

ββVXXβββ −′′

−=Δ −−

2)1( j

jj

hhr

−=

15

Delta Pearson chi-square versus predicted probability

510

H-L

dX

^20

0 .2 .4 .6 .8 1Pr(SA)

15/12/52

32

10

Delta D versus Probability

5H-L

dD

0

0 .2 .4 .6 .8 1Pr(SA)

1.5

Delta B versus Probability

.51

Preg

ibon

's d

beta

0

0 .2 .4 .6 .8 1Pr(SA)

15/12/52

33

66

155

10H

-L d

X^2

0

0 .2 .4 .6 .8 1Pr(SA)

Create scoring scheme using coefficients of each variable Factors Coefficients Score for individual

Stopping breathingYesNo

0.90

……………………..

Age> 60 2.2 ……………………..> 60

45 - 5930 - 44

< 30

2.21.51.00

……………………..

BMI> 40

30 - 39.925 - 29.9

< 25

2.31.51.10

……………………..

SnoringSnoringYesNo

0.90

……………………..

GenderMale

Female1.10

……………………..

Total score ……………………..

15/12/52

34

Calculate score

gen score_full = _b[_cons] + /// b[ Istop bre 1]* Istop bre 1 + ///_b[_Istop_bre_1]*_Istop_bre_1 + ///

_b[_Iage_gr_2]*_Iage_gr_2 + /// _b[_Iage_gr_3]*_Iage_gr_3 + ///

_b[_Iage_gr_4]*_Iage_gr_4 + ///

_b[_IBMI_gr_29]*_IBMI_gr_29 + ///_b[_IBMI_gr_39]*_IBMI_gr_39 + /// _b[_IBMI_gr_40]*_IBMI_gr_40 + ///

b[ Isex 2]* Isex 2 ///_b[_Isex_2]*_Isex_2 + ///

_b[_Isnore_1]*_Isnore_1

Discrimination performance roctab SA score, detail------------------------------------------------------------------------------

CorrectlyCutpoint Sensitivity Specificity Classified LR+ LR-p y p y------------------------------------------------------------------------------( >= 3.890326 ) 91.92% 50.00% 78.49% 1.8383 0.1617( >= 3.895265 ) 91.74% 51.12% 78.73% 1.8768 0.1616( >= 3.896797 ) 89.28% 54.48% 78.14% 1.9612 0.1968( >= 3.940307 ) 89.28% 55.22% 78.38% 1.9939 0.1941( >= 3.990148 ) 88.93% 55.22% 78.14% 1.9861 0.2005( >= 4.049621 ) 88.58% 55.22% 77.90% 1.9782 0.2069( >= 4.051153 ) 87.70% 57.09% 77.90% 2.0437 0.2155( >= 4.090991 ) 87.35% 57.46% 77.78% 2.0534 0.2202( .09099 ) 8 .35% 5 . 6% . 8% .053 0. 0( >= 5.355929 ) 55.54% 85.07% 64.99% 3.7209 0.5226( >= 5.440022 ) 48.51% 88.43% 61.29% 4.1934 0.5823( >= 5.441554 ) 48.33% 89.93% 61.65% 4.7972 0.5746( >= 5.455751 ) 48.15% 89.93% 61.53% 4.7798 0.5765( >= 5.474413 ) 47.28% 90.30% 61.05% 4.8731 0.5839( >= 5.635747 ) 40.95% 91.42% 57.11% 4.7715 0.6459( >= 5.649945 ) 40.77% 91.42% 56.99% 4.7510 0.6479

15/12/52

35

( >= 5.651477 ) 38.66% 92.91% 56.03% 5.4537 0.6602( >= 5.67371 ) 37.79% 92.91% 55.44% 5.3298 0.6696( >= 5.867904 ) 36.73% 93.66% 54.96% 5.7905 0.6755( >= 5.883634 ) 36.03% 93.66% 54.48% 5.6797 0.6830( > 6 137812 ) 22 85% 95 90% 46 24% 5 5664 0 8046( >= 6.137812 ) 22.85% 95.90% 46.24% 5.5664 0.8046( >= 6.237287 ) 18.45% 96.64% 43.49% 5.4950 0.8438------------------------------------------------------------------------------------------------------------------------------------------------------------

ROC -Asymptotic Normal--Obs Area Std. Err. [95% Conf. Interval]

--------------------------------------------------------837 0.8101 0.0165 0.77763 0.84249

Model selection based on model classification

• ROC curve analysis • Comparing area under ROC curves

15/12/52

36

Calibrate cutoff

• Score’s distributionScore s distribution – Tertile, quantile

• Yuden index – Sen+spec-1p

• LR+

Validation

• Internal validation – Data are from the same setting

• Split data • Bootstrap • Period

• External validation – Generalization – Data are from different setting

top related