1 what you've always wanted to know about logistic regression analysis, but were afraid to...

35
1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation Innovation Sciences & Industrial Engineering Phone: 5509 email: [email protected]

Upload: grace-durant

Post on 28-Mar-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

1

What you've always wanted to know about logistic regression analysis, but were afraid to

ask...

Februari, 1 2010

Gerrit RooksSociology of Innovation

Innovation Sciences & Industrial Engineering Phone: 5509

email: [email protected]

Page 2: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

This Lecture

• Why logistic regression analysis?• The logistic regression model• Estimation• Goodness of fit• An example

2

Page 3: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

3

What's the difference between 'normal' regression and logistic regression?

Regression analysis: – Relate one

or more independent (predictor) variables to a dependent (outcome) variable

Page 4: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

4

What's the difference between 'normal' regression and logistic regression?

• Often you will be confronted with outcome variables that are dichotomic:– success vs failure– employed vs unemployed– promoted or not– sick or healthy – pass or fail an exam

Page 5: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

5

ExampleRelationship between hours studied for exam and success

Hours # Failed exam

# Passed exam?

Total # students

Prob. pass exam

28 4 2 6 .33

29 3 2 5 .40

30 2 7 9 .78

31 2 7 9 .78

32 4 16 20 .80

33 1 14 15 .93

Page 6: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

6

Linear regression analysisWhy is this wrong?

Page 7: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

7

Logistic RegressionThe better alternative

Page 8: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

8

Page 9: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

9

The logistic regression equationpredicting probabilities

)( 11101

1)( Xbbe

YP

predictedprobability(always between0 and 1)

similar to regressionanalysis

Page 10: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

10

The Logistic functionSometimes authors rearrange the model

)(

)(

)( 1110

1110

1110 11

1)(

Xbb

Xbb

Xbb e

e

eYP

nn xcxcxccyp

yp

...)1(1

)1(ln 22110

or also

Page 11: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

11

How do we estimate coefficients?Maximum-likelihood estimation

• Parameters are estimated by `fitting' models, based on the available predictors, to the observed data

• The chosen model fits the data best, i.e. is closest to the data

• Fit is determined by the so-called log likelihood statistic

Page 12: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

12

Maximum likelihood estimationThe log-likelihood statistic

N

iiiii YPYYPYLL

1

)]}(1ln[)1())(ln({

Large values of LL indicate poor fit of the model

HOWEVER, THIS STATISTIC CANNOT BE USED TO EVALUATE THE FIT OF A SINGLE MODEL

Page 13: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

13

Quantity of Study Hours Outcome

3 0

34 1

17 0

6 0

12 0

15 1

26 1

29 1

An example to illustrate maximum likelihood and the log likelihood statistic

Suppose we know hours spentstudying and the outcome of an exam

Page 14: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

14

)05.0( 11

1)(P

XeY

Quantity of Study Hours Outcome

Predicted probability (b0=0; b1 = 0.05)

Predicted probability(b0=-6.44; b1 = 0.39)

3 0 .53 .01

34 1 .85 .99

17 0 .71 .53

6 0 .57 .02

12 0 .65 .14

15 1 .68 .34

26 1 .79 .97

29 1 .81 .99

)39.044.6( 11

1)(P X

eY

In ML different valuesfor the parameters are `tried'

Lets look at two possibilities: 1; b0 = 0 & b1= 0.05; 2, b0 = 0 & b1= 0.05

Page 15: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

15

Quantity of Study Hours Outcome

Predicted probability (b0=0; b1 = 0.05)

LL (b0=0; b1 = 0.05)

3 0 .53 -.75

34 1 .85 -.16

17 0 .71 -1.24

6 0 .57 -.84

12 0 .65 -1.05

15 1 .68 -.39

26 1 .79 -.24

29 1 .81 -.21

N

iiiii YPYYPYLL

1

)]}(1ln[)1())(ln({

We are now able to calculate the log likelihood statistic

Page 16: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

16

Outcome

Pr(b0=0;

b1 = 0.05)

LL (b0=0; b1 =

0.05)

Pr(b0=-6.44; b1 = 0.39)

LL(b0=-6.44; b1 =

0.39)

0 .53 -.75 .01 -.01

1 .85 -.16 .99 -.01

0 .71 -1.24 .53 -.75

0 .57 -.84 .02 -.02

0 .65 -1.05 .14 -.15

1 .68 -.39 .34 -1.08

1 .79 -.24 .97 -.03

1 .81 -.21 .99 -.01

∑ -4.88 -2.07

Two models and their log likelihood statistic

Based on a clever algorithm the model with the best fit (LL closest to 0) is chosen

Page 17: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

17

After estimationHow do I determine significance?

• Obviously SPSS does all the work for you

• How to interpret output of SPSS

• Two major issues1. Overall model fit

– Between model comparisons

– Pseudo R-square– Predictive accuracy /

classification test

2. Coefficients– Wald test– Likelihood ratio test– Odds ratios

)*39,044,6(1

1)(P

studyhourseY

Page 18: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

18

Model fit: Between model comparison

)]baseline()New([22 LLLL

The log-likelihood ratio test statistic can be used to test the fit of a model

The test statistic has achi-square distribution

Model fit reduced modelModel fit full model

Page 19: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

19

Model fit

)( 1101

1)(P Xbbe

Y

)]baseline()New([22 LLLL

The log-likelihood ratio test statistic can be used to test the fit of a model

Model fit reduced modelModel fit full model

)( 01

1)(P be

Y

Page 20: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

Between model comparison

• Estimate a null model• Baseline model

• Estimate an improved model• This model contains more

variables• Assess the difference in -

2LL between the models• This difference follows a

chi-square distribution• degrees of freedom = #

estimated parameters in proposed model – # estimated parameters in null model2020

)( 221101

1)(P XbXbbe

Y

)]baseline()New([22 LLLL

Model fit reduced model

Model fit full model

)( 1101

1)(P Xbbe

Y

Page 21: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

21

Overall model fitR and R2

2

22

)(

)ˆ(

yy

yyR

i

i

R2 in multiple regression is a measure of the variance explained by the model

SS due to regression

Total SS

Page 22: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

22

Overall model fitpseudo R2

Just like in multiple regression, logit R2 ranges 0.0 to 1.0

– Cox and Snell• cannot theoretically

reach 1

– Nagelkerke• adjusted so that it can

reach 1

)(2

)(2LOGIT

2

OriginalLL

ModelLLR

log-likelihood of modelbefore any predictors wereentered

log-likelihood of the modelthat you want to test

NOTE: R2 in logistic regression tends to be (even) smaller than in multiple regression

Page 23: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

23

What is a small or large R and R2?Strength of correlation

Small 0.10 to 0.29

Medium 0.30 to 0.49

Large 0.50 to 1.00

Page 24: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

24

Overall model fitClassification table

Classification Tablea

30 5 85,7

7 33 82,5

84,0

ObservedMissed Penalty

Scored Penalty

Result of PenaltyKick

Overall Percentage

Step 1

MissedPenalty

ScoredPenalty

Result of Penalty Kick

PercentageCorrect

Predicted

The cut value is ,500a.

How well does the model predict outcomes?

This means that we assume that if our model predictsthat a player will score with a probability of .51 (above .5)the prediction will be a score (lower than .50 is a miss).

spss output

Page 25: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

25

Testing significance of coefficientsThe Wald statistic: not really good

• In linear regression analysis this statistic is used to test significance

• In logistic regression something similar exists

• however, when b is large, standard error tends to become inflated, hence underestimation (Type II errors are more likely)

b

b

SEWald

t-distribution standard error of estimate

estimate

Page 26: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

26

Likelihood ratio testan alternative way to test significance of a coefficient

)( 1101

1)(P Xbbe

Y

)]Without()With([22 LLLL

To avoid type II errors for some variables you best use the Likelihood ratio test

model with variable model without variable

)( 01

1)(P be

Y

Page 27: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

27

Before we go to the exampleA recap

• Logistic regression– dichotomous outcome– logistic function– log-likelihood / maximum likelihood

• Model fit– likelihood ratio test (compare LL of models)– Pseudo R-square– Classification table– Wald test

Page 28: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

28

Illustration with SPSS

• Penalty kicks data, variables:– Scored: outcome variable,

• 0 = penalty missed, and 1 = penalty scored

– Pswq: degree to which a player worries– Previous: percentage of penalties scored by a

particulare player in their career

Page 29: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

29

Case Processing Summary

75 100,0

0 ,0

75 100,0

0 ,0

75 100,0

Unweighted Casesa

Included in Analysis

Missing Cases

Total

Selected Cases

Unselected Cases

Total

N Percent

If weight is in effect, see classification table for the totalnumber of cases.

a.

Dependent Variable Encoding

0

1

Original ValueMissed Penalty

Scored Penalty

Internal Value

SPSS OUTPUT Logistic Regression

Tells you somethingabout the number of observations and missings

Page 30: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

30

Classification Tablea,b

0 35 ,0

0 40 100,0

53,3

ObservedMissed Penalty

Scored Penalty

Result of PenaltyKick

Overall Percentage

Step 0

MissedPenalty

ScoredPenalty

Result of Penalty Kick

PercentageCorrect

Predicted

Constant is included in the model.a.

The cut value is ,500b.

Variables in the Equation

,134 ,231 ,333 1 ,564 1,143ConstantStep 0B S.E. Wald df Sig. Exp(B)

Variables not in the Equation

34,109 1 ,000

34,193 1 ,000

41,558 2 ,000

previous

pswq

Variables

Overall Statistics

Step0

Score df Sig.

Block 0: Beginning Block this table is based on the empty model, i.e. onlythe constant in the model

)( 01

1)(P be

Y

these variableswill be enteredin the modellater on

Page 31: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

31

Block 1: Method = Enter

Omnibus Tests of Model Coefficients

54,977 2 ,000

54,977 2 ,000

54,977 2 ,000

Step

Block

Model

Step 1Chi-square df Sig.

Model Summary

48,662a ,520 ,694Step1

-2 Loglikelihood

Cox & SnellR Square

NagelkerkeR Square

Estimation terminated at iteration number 6 becauseparameter estimates changed by less than ,001.

a.

)]baseline()New([22 LLLL

Block is useful to check significance of individual coefficients, see Field

New model

this is the teststatistic

after dividing by -2

Note: Nagelkerkeis larger than Cox

Page 32: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

32

Variables in the Equation

,065 ,022 8,609 1 ,003 1,067

-,230 ,080 8,309 1 ,004 ,794

1,280 1,670 ,588 1 ,443 3,598

previous

pswq

Constant

Step1

a

B S.E. Wald df Sig. Exp(B)

Variable(s) entered on step 1: previous, pswq.a.

Classification Tablea

30 5 85,7

7 33 82,5

84,0

ObservedMissed Penalty

Scored Penalty

Result of PenaltyKick

Overall Percentage

Step 1

MissedPenalty

ScoredPenalty

Result of Penalty Kick

PercentageCorrect

Predicted

The cut value is ,500a.

Block 1: Method = Enter (Continued)

Predictive accuracy has improved (was 53%)

estimatesstandard errorestimates

significance based on Wald statistic

change in odds

Page 33: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

33

Variables in the Equation

,065 ,022 8,609 1 ,003 1,067

-,230 ,080 8,309 1 ,004 ,794

1,280 1,670 ,588 1 ,443 3,598

previous

pswq

Constant

Step1

a

B S.E. Wald df Sig. Exp(B)

Variable(s) entered on step 1: previous, pswq.a.

Classification Tablea

30 5 85,7

7 33 82,5

84,0

ObservedMissed Penalty

Scored Penalty

Result of PenaltyKick

Overall Percentage

Step 1

MissedPenalty

ScoredPenalty

Result of Penalty Kick

PercentageCorrect

Predicted

The cut value is ,500a.

How is the classification table constructed?

)*230,0*065,028,1(1

1)(P Pred.

pswqpreviouseY

oops wrong prediction

oops wrong prediction

Page 34: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

34

How is the classification table constructed?

)*230,0*065,028,1(1

1)(P Pred.

pswqpreviouseY

pswq previous scored Predict. prob.

18 56 1 .68

17 35 1 .41

20 45 0 .40

10 42 0 .85

Page 35: 1 What you've always wanted to know about logistic regression analysis, but were afraid to ask... Februari, 1 2010 Gerrit Rooks Sociology of Innovation

35

How is the classification table constructed?

pswq previous

scored Predict. prob.

predicted

18 56 1 .68 1

17 35 1 .41 0

20 45 0 .40 0

10 42 0 .85 1

Classification Tablea

30 5 85,7

7 33 82,5

84,0

ObservedMissed Penalty

Scored Penalty

Result of PenaltyKick

Overall Percentage

Step 1

MissedPenalty

ScoredPenalty

Result of Penalty Kick

PercentageCorrect

Predicted

The cut value is ,500a.