© department of statistics 2012 stats 330 lecture 25: slide 1 stats 330: lecture 25

44
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

Upload: dominick-harris

Post on 28-Dec-2015

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1

Stats 330: Lecture 25

Page 2: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 2

Plan of the dayIn today’s lecture we discuss prediction and present a logistic regression case study. Topics covered will be

Prediction in logistic regressionIn-sample and out-of-sample error ratesCross-validation and bootstrap estimates of error ratesSensitivity and specificityROC curves

Then, a case study

Page 3: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 3

Housekeeping

• Error in slide 34 in lecture 23:

Function is now influenceplots

• Bug in ROC.curve – download replacement from web page

Page 4: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 4

Prediction

• Suppose we have fitted a logistic model and we want to use the model to predict new cases. If a new case presents with explanatory variables x, how do we predict the y-value, 0 or 1?

• Work out the estimated log-odds for the case

• Work out probability: Prob = exp(log-odds)/(1+exp(log.odds))• Predict

– Y=1 if prob >= 0.5 (equivalently log.odds >=0)– Y=0 if prob < 0.5 (equivalently log.odds <0)

0 1 1ˆ ˆ ˆlog-odds k kx x

Page 5: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 5

Estimating the prediction error

• Prediction error is the probability of a wrong classification ( 0’s predicted as 1’s, 1’s predicted as 0’s)

• As in linear regression, using the training data to estimate these proportions tends to give an optimistic estimate

• We can use cross-validation or the bootstrap to improve the estimate –see the case study

Page 6: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 6

Sensitivity and specificity

• Sensitivity: probability of predicting a 1 when the case is truly a 1: the “true positive rate”

• Specificity: probability of predicting a 0 when the case is truly a 0: the “true negative rate” (1-specificity is called the “false positive rate”)

• Ideally, want both to be close to 1• We would like to know what these would be for

new data – use cross-validation and the bootstrap as for normal regression

Page 7: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 7

Calculating sensitivity and specificity

Model predicts

Failure (0) Success (1)

Actual Failure (0) 100 200

Success ( 1) 250 600

Specificity = 100/(100+200) = 33%

Sensitivity = 600/(600+250) = 70%

In-sample error rate = (200+250)/1150

Page 8: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 8

ROC curves• We have predicted a “success” (Y=1) if the log-odds are positive.• We can generalize this to predict a success if log-odds >=c for some

constant c• If c is large and –ve, almost every case will be predicted as a

success (1) – Sensitivity close to 1, specificity close to 0

• If c is large and +ve, almost every case will be predicted as a failure (0)– Sensitivity close to 0, specificity close to 1

• Allows a trade-off between sensitivity and specificity• As c varies, the sensitivity and specificity change.• ROC curve is a plot of the points (1-specificity, sensitivity)

as c changes.

Page 9: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 9

False positive rate

Tru

e p

osi

tive

ra

te

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

AUC = 0.6567

Page 10: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 10

ROC curves - cont

False positive rate

False positive rate

Tru

e p

osit

ive

rate

False positive rate

Tru

e p

osit

ive

rate

Perfect prediction

Tru

e p

osit

ive

rate

Worst case prediction

Predictor no help

Page 11: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 11

Area under the curve

• For a perfect predictor, the area under the ROC curve (AUC) is 1.

• If the predictor is independent of the response, the sensitivity and specificity are both 0.5.

• AUC curve serves as a measure of how good the model is at predicting.

Page 12: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 12

Case study

The data comes from the University of Massachusetts AIDS Research Unit IMPACT study, a medical study performed in the US in the early 90’s. The study aimed to evaluate two different treatments for drug addiction.

Reference: Hosmer and Lemeshow, Applied Logistic Regression (2nd Ed), p28

Page 13: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 13

List of variablesVariable Description Codes/Values Name

Identification Code 1-575 IDAge at Enrollment Years AGEBeck Depression Score 0-54 BECKIV Drug Use History 1 = Never, IVHX

at Admission 2 = Previous, 3 = Recent No of prior Treatments 0-40 NDRUGTXSubject's Race 0 = White , RACE 1 = OtherTreatment Duration 0=short, 1 = Long TREAT Treatment Site 0 = A, 1 = B SITE Remained Drug Free 1 = Yes, 0 = No DFREE

Page 14: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 14

The variables

• The response DFREE is binary: records if subject is drug-free after conclusion of treatment.

• There is a mix of categorical and continuous explanatory variables

• Categorical: IVHX, RACE, TREAT, SITE

• Continuous: AGE, BECK, NDRUGTX.

Page 15: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 15

Questions

• Is the longer treatment more effective?

• Did Site A deliver the program more effectively than site B?

• What other variables have an effect on successful rehabilitation of addicts?

• Can we predict who is likely to be drug-free in 12 months?

Page 16: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 16

Analysis strategy

• Preliminary plots, tables

• Variable selection

• Model fitting

• Interpretation of coefficients

• Evaluation as a predictor of recovery from addiction.

Page 17: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 17

Preliminary plots

20 25 30 35 40 45 50 55

010

2030

4050

Age

BE

CH

Sco

re

1

2

3

4

5

6

7

8

9

10

1112

13

1415

16

17

18

19

20

21

22

23

24

25

26

27

28

29

3031

3233

34

35

3637

38

39 1

2

3

4

5

6

7

8

910

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

3031

32

33

34

35

36

37

38

391

2

34

56

7

8

9

10

11

12

13

14

15

16

17

1819

2021

22

23

24

25

26

27

2829

30

31

32

33

34

35

36

37

3839

1

2

3

4

5

6

7

8

9

10

11 12

13 14

15

16

1718

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

3435

36

37

38

39

1

2

3

4 5

6

7

8

9

10

11

12

13

1415

16

17

18

1920

2122

23

24

25

26 27

28

29

3031

32

33

34

35

36

37

38

39

1

2 3

4

5

6

7

8

910

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

2829

30

3132

33

34

35

36

37

38

391

2

3

4

56

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

2324

25

26

27

28

29

3031

32

33

34 35

36

3738

39

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16 17

1819

20

21

22

23

24

25

2627

2829

30

31

32

33

34

35

36

37

38

39

12

34

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

21

22

23

24

2526

27

2829

30

31

32

33

34

35

36

37

3839

1

2

3

4

5

6

7

8 9

10

11

12

13

14

1516

17

18

1920

21

22

23

2425

26

27

28

2930

31

32

3334

35

36

37

38

39

1

2

3

45

6

7

8

9

10

11

12

13

14

15

1617

18

19

20

21

22

23

24

25

26

27

28 293031 32

3334

35

36

37

38

39

1

2

34

5

6

7

8

9

10

11

12

13

14

15

16 17

18

19

20

21

22

23

2425

26

27

28

29

3031 32

33

34

35

36

37

38

39

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17 18

19

20

21

22

23

2425

262728

2930

31

32

33

34

35

36

3738

39

1

2

34

5 6

7

8

9

10

11

1213 14

15

16

17

18

19

2021

22

23

24

25

26

27

2829

30

31

32

33

34

35

36

37

38

39

1

2

3

4

5

6

7 89

10 11

12

13

14

15

16

17

18

192021

22

23

24

25

26

27

28

29

Red: Drug Free

Blue: Relapse

20 25 30 35 40 45 50 55

010

2030

40

Age

Num

ber of

Prio

r Tre

atm

ents

1

2

34

5

6

7

89

1011

1213

14

1516

17

18

19

20

21

22

23

24

2526

27

2829

30

3132

3334

35

36

37 3839

12

3

45

67

8

9

10

11

12

1314

15

1617

181920

2122

2324 25

2627

28

29

30

31

3233

34

35

3637

38 391

2

34

5

6

7

8 9101112

13

14 1516

17

18192021

2223

24

25

2627

28

2930

3132

33

34

3536

3738

39

1

2

3

45

6

78

9

10

11 1213

1415

1617

18

192021 22

2324

252627

2829

3031

32

3334

3536

3738

391

2

3

45

6

78

9

1011 12

13

1415

1617

18

19

20

21

2223

24

25

2627

28

29 30

31

3233

34

3536

37

3839

12

345

67 8

910

111213

14

1516

17

18

19

20

21

22 23

24

2526

27

2829

30

3132

3334

353637

38

39

1

23

4

56

78

9

1011

12

13

14

15

16

17

1819

20

2122 23

24

25

26

27

28

29 30

31

32

3334

35

3637

3839

1

2

34

56

7

8

910

1112

1314

1516 17

18

19

2021

22

23

24

2526

27

28

293031

32

3334

3536

37

38

39

1

23

4

5

6

78

910

1112

131415

1617

18

19

2021

22

23 242526

2728

2930

3132

33

34

35

36

37

38

39

1

2

34

5

67

8

910

1112

1314

15

16

17

18

19

20

21

222324

25

26

2728 29

3031

32

33

34

35

36

37

38

39

1

2

34 5

6

7

89

10

1112

13

1415

161718

1920

21 222324 25

26

2728

2930

31

32

33

3435

36

3738

39

1

2

3

456

7

89

10 1112

1314

1516 17

1819

2021 22

2324

25

26

27

28

2930

31

32

33

3435 36

3738

391

2

3

456

7 8

9

10

11 12

13

1415

16

17 18

1920

21 22

23

24

2526

27

28

29

30

313233

3435 36

3738

39 12

3

4

5

6

78

9

10

1112

13

14

15

16

17

18

19

20

21

22

23

2425

2627

2829

30

31

32

33

3435

36

37

3839

12

3

4

56

7 8910

1112

13

14

15

16

1718

19

20

21

22

23

24

2526

2728

29

Red: Drug Free

Page 18: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 18

Preliminary plots (2)

IVHX

Est

imat

ed p

roba

bility

of b

eing

dru

g free

Never Previous Recent

0.20

0.25

0.30

TREAT

Est

imat

ed p

roba

bility

of b

eing

dru

g free

Short Long

0.22

0.24

0.26

0.28

0.30

SITE

Est

imat

ed p

roba

bility

of b

eing

dru

g free

A B

0.24

0.25

0.26

0.27

0.28

0.29

Page 19: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 19

Preliminary plots (3)

• Seems like number of previous drug treatments have an effect

• Seems like factors IVHX (Recent IV drug use), SITE (Site A or Site B) and TREAT (short or long treatment) have an effect

Page 20: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 20

Preliminary fits (1)

Call:

glm(formula = DFREE ~ . - IVHX - ID + factor(IVHX), family = binomial,

data = drug.df)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.3465 -0.8091 -0.6326 1.1834 2.4231

Don’t include ID!

Page 21: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 21

Preliminary fits (2)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.4111283 0.5983427 -4.030 5.59e-05 ***AGE 0.0504143 0.0174057 2.896 0.00377 ** BECK 0.0002759 0.0107982 0.026 0.97961 NDRUGTX -0.0615329 0.0256441 -2.399 0.01642 * RACE 0.2260262 0.2233685 1.012 0.31159 TREAT 0.4424802 0.1992922 2.220 0.02640 * SITE 0.1489209 0.2176062 0.684 0.49375 factor(IVHX)2 -0.6036962 0.2875974 -2.099 0.03581 * factor(IVHX)3 -0.7336591 0.2549893 -2.877 0.00401 ** (Dispersion parameter for binomial family taken to be 1)Null deviance: 653.73 on 574 degrees of freedomResidual deviance: 619.25 on 566 degrees of freedomAIC: 637.25Number of Fisher Scoring iterations: 4

Page 22: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 22

Preliminary conclusions

• Important variables seem to be AGE, NDRUGTX, TREAT, IVHX

• Data are ungrouped, can’t assess goodness of fit with residual deviance

• No extremely large residuals

Page 23: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 23

Hosmer-Lemeshow test

> HLstat(drug.glm)

Value of HL statistic = 5.05

P-value = 0.752

No evidence of a bad fit using this test

Page 24: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 24

Variable selection (1)> anova(drug.glm, test="Chisq")Analysis of Deviance TableModel: binomial, link: logitResponse: DFREETerms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(>|Chi|)NULL 574 653.73 AGE 1 1.40 573 652.33 0.24BECK 1 0.57 572 651.76 0.45NDRUGTX 1 14.07 571 637.69 0.0001760RACE 1 3.06 570 634.63 0.08TREAT 1 4.96 569 629.67 0.03SITE 1 1.07 568 628.60 0.30factor(IVHX) 2 9.35 566 619.25 0.01

Page 25: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 25

Variable selection (2)Step: AIC= 632.59 DFREE ~ NDRUGTX + IVHX + AGE + TREAT

Call: glm(formula = DFREE ~ NDRUGTX + IVHX + AGE + TREAT,

family = binomial, data = drug.df)

Degrees of Freedom: 574 Total (i.e. Null); 569 Residual

Null Deviance: 653.7 Residual Deviance: 620.6 AIC: 632.6

Page 26: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 26

Sub-model> sub.glm<-glm(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, family=binomial, data=drug.df)

> summary(sub.glm)Call:glm(formula = DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, family = binomial, data = drug.df)

Deviance Residuals: Min 1Q Median 3Q Max -1.2598 -0.8051 -0.6284 1.1401 2.4574

Page 27: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 27

Sub-model (ii)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.33276 0.54838 -4.254 2.1e-05 ***NDRUGTX -0.06376 0.02563 -2.488 0.012844 * factor(IVHX)2 -0.62366 0.28470 -2.191 0.028484 * factor(IVHX)3 -0.80561 0.24453 -3.294 0.000986 ***AGE 0.05259 0.01721 3.056 0.002244 ** TREAT 0.45134 0.19860 2.273 0.023048 *(Dispersion parameter for binomial family taken to be 1)Null deviance: 653.73 on 574 degrees of freedomResidual deviance: 620.59 on 569 degrees of freedomAIC: 632.59Number of Fisher Scoring iterations: 4

All variables significan

t,but use caution

Page 28: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 28

Do we need interaction terms?

> sub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT , family=binomial, data=drug.df)

> sub2.glm<-glm(DFREE ~ NDRUGTX*IVHX + AGE*IVHX + AGE*TREAT + NDRUGTX*TREAT , family=binomial, data=drug.df)

>> anova(sub.glm, sub2.glm, test="Chisq")Analysis of Deviance Table

Model 1: DFREE ~ NDRUGTX + IVHX + AGE + TREATModel 2: DFREE ~ NDRUGTX * IVHX + AGE * IVHX + AGE * TREAT

+ NDRUGTX * TREAT Resid. Df Resid. Dev Df Deviance P(>|Chi|)1 569 620.59 2 563 616.16 6 4.42 0.62

Big p-value so interactions not required

Model with interactions

Page 29: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 29

0 10 20 30 40

-4-3

-2-1

01

2

NDRUGTX

s(N

DR

UG

TX

,1.5

2)

20 25 30 35 40 45 50 55

-4-3

-2-1

01

2

AGE

s(A

GE

,1)

Do we need to transform?

par(mfrow=c(1,2))sub.gam<-gam(DFREE ~ s(NDRUGTX) + factor(IVHX) + s(AGE) + TREAT , family=binomial, data=drug.df)plot(sub.gam)

Page 30: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 30

Transforming• Suggests a possible quadratic in NDRUGTX:

> subq.glm<-glm(DFREE ~ poly(NDRUGTX,2) + IVHX + AGE + TREAT, family=binomial, data=drug.df)> summary(subq.glm)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.70763 0.56715 -4.774 1.80e-06 ***poly(NDRUGTX, 2)1 -7.24501 2.93274 -2.470 0.01350 * poly(NDRUGTX, 2)2 4.21319 2.69624 1.563 0.11814 IVHX2 -0.59018 0.28608 -2.063 0.03912 * IVHX3 -0.76015 0.24725 -3.074 0.00211 ** AGE 0.05458 0.01730 3.154 0.00161 ** TREAT 0.44379 0.19904 2.230 0.02577 *

But term is not significant, so we stick with no transformation

Page 31: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 31

Diagnostics

-4 -3 -2 -1 0

-10

12

Predicted values

Res

idua

ls

Residuals vs Fitted

471 7350

-3 -2 -1 0 1 2 3

-10

12

Theoretical Quantiles

Std

. de

vian

ce r

esid

.

Normal Q-Q plot

4717

350

-4 -3 -2 -1 0

0.0

0.5

1.0

1.5

Predicted values

Std

. de

vian

ce r

esid

.

Scale-Location plot471 7

350

0 100 200 300 400 500

0.00

0.02

0.04

0.06

0.08

0.10

Obs. number

Coo

k's

dist

ance

Cook's distance plot

7

471

350

0 100 200 300 400 500

-10

12

Index plot of deviance residuals

Observation number

Dev

ianc

e R

esid

uals

781 255284322350

471

0 100 200 300 400 500

0.00

0.01

0.02

0.03

0.04

0.05

Leverage plot

Observation Number

Leve

rage

85

384 551571

0 100 200 300 400 500

0.00

0.02

0.04

0.06

0.08

Cook's Distance Plot

Observation number

Coo

k's

Dis

tanc

e

0 100 200 300 400 500

01

23

45

6

Deviance Changes Plot

Observation number

Dev

ianc

e ch

ange

s

7

81255284322

350

366399405

471

Pt 85

7, 4717,

471

Page 32: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 32

Influence of 7, 471, 85

None 7 85 471 All(Intercept) -2.333 -2.295 -2.222 -2.447 -2.293

NDRUGTX -0.064 -0.084 -0.065 -0.075 -0.100

IVHX2 -0.624 -0.595 -0.635 -0.680 -0.662

IVHX3 -0.806 -0.783 -0.785 -0.795 -0.747

AGE 0.053 0.053 0.049 0.057 0.054

TREAT 0.451 0.434 0.441 0.479 0.450

Effect on coefficients of removing cases:

None seem particularly influential! We will not delete them

Page 33: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 33

Over-dispersion qsub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT,

family=quasibinomial, data=drug.df)> summary(qsub.glm)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.33276 0.55435 -4.208 2.99e-05 ***NDRUGTX -0.06376 0.02591 -2.461 0.01414 * IVHX2 -0.62366 0.28780 -2.167 0.03065 * IVHX3 -0.80561 0.24720 -3.259 0.00118 ** AGE 0.05259 0.01740 3.023 0.00262 ** TREAT 0.45134 0.20076 2.248 0.02495 * ---(Dispersion parameter for quasibinomial family taken

to be 1.021892)Very close to 1 so no overdispersion

Page 34: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 34

InterpretationCoefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.33276 0.54838 -4.254 2.1e-05 ***NDRUGTX -0.06376 0.02563 -2.488 0.012844 * IVHX2 -0.62366 0.28470 -2.191 0.028484 * IVHX3 -0.80561 0.24453 -3.294 0.000986 ***AGE 0.05259 0.01721 3.056 0.002244 ** TREAT 0.45134 0.19860 2.273 0.023048 *

• As the number of prior treatments goes up, the probability of a drug-free recovery goes down

• The probability of a drug-free recovery for persons with no IV drug use is more than for persons with previous IV drug use

• The probability of a drug-free recovery for persons with previous IV drug use is more than for persons with recent IV drug use.

• The probability of a drug-free recovery goes up with age• The probability of a drug-free recovery is higher for

the long treatment

Page 35: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 35

Interpreting p-values after model selection

• We have seen that this is not valid, as model selection changes the distribution of the estimated coefficients.

• We can use the bootstrap to examine the revised distribution

• Leave TREAT in the model, use forward selection to select the other variables.

Page 36: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 36

Procedure

• Draw a bootstrap sample

• Do forward selection, record value of regression coef for TREAT (forced to be in every model)

• Repeat 200 times, draw histogram of the results

Page 37: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 37

R code# bootstrap sample

n = dim(drug.df)[1]

B=200

beta.boot = numeric(B)

for(b in 1:B){

ni = rmultinom(1, n ,prob=rep(1/n,n))

newdata = drug.df[rep(1:n,ni),]

drug.boot.glm = glm(DFREE ~ NDRUGTX + factor(IVHX) + AGE + BECK + TREAT + RACE + SITE,

family=binomial, data= newdata)

chosen = step(drug.boot.glm, list(lower = DFREE ~ TREAT, upper= formula(drug.boot.glm)),

direction = “forward", trace=0)

k = match("TREAT",names(coef(chosen)))

beta.boot[b] = summary(chosen)$coefficients[k,1]

}

Page 38: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 38

HistogramHistogram of beta.boot

beta.boot

Fre

qu

en

cy

-0.2 0.0 0.2 0.4 0.6 0.8 1.0

05

10

15

20

> mean(beta.boot)[1] 0.4540209> sd(beta.boot)[1] 0.2019222> z.val = mean(beta.boot)/ sd(beta.boot)> 2*(1-pnorm(z.val))[1] 0.02454468

Compare

Beta = 0.45134SE = 0.19860 P-value = 0.023048

Page 39: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 39

Prediction• Sensitivity: chance the model predicts a

successful recovery (drug-free at end of program), when one will actually occur

• Specificity: chance the model predicts a failure (return to drug use before end of program), when one actually will occur

Page 40: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 40

R codesub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT ,

family=binomial, data=drug.df) > pred = predict(sub.glm, type="response")> predcode = ifelse(pred<0.5, 0,1)> table(drug.df$DFREE,predcode) predicted 0 1 Actual 0 426 2 1 144 3Sensitivity = 3/147 = 0.02040816Specificity = 426/428 = 0.9953271Error rate = 146/575 = 0.2539130Proportion correctly classified = 429/575 = 0.746087

Page 41: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 41

ROC curveROC.curve(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, data= drug.df)# in the R330 package

Page 42: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 42

Prediction (2)

• Use 10-fold cross-validation– Split data into 10 parts– Calculate sensitivity and specificity for each

part, using model fitted to the remaining parts– Average results– Repeat for different splits, average repeats

• E.g. for one part

Page 43: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 43

CV and bootstrap Results

> cross.val(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, drug.df)

Mean Specificity = 0.9908424

Mean Sensitivity = 0.02854005

Mean Correctly classified = 0.7446491

> err.boot(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, data= drug.df)

$err

[1] 0.2539130

$Err

[1] 0.2552974

A poor classifier, but this doesn’t mean that the model fits poorly – there are very few cases with fitted probs over 0.5, and many with fitted probabilities between 0.2 and 0.5. We expect a moderate number of these to be misclassified, as some events (being drug free) with probs 0.2 to 0.5 have occurred.

Bootstrap estimate

Training set estimate

Page 44: © Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25

© Department of Statistics 2012 STATS 330 Lecture 25: Slide 44

Overall conclusions• Model seems to fit well

• Strong evidence that longer treatments are better

• No apparent difference between sites

• Age and prior IV drug use affect recovery

• Model predicts poorly for the covariates in the data set – effectively always predicts patients will not be drug free