Download - Stats 330: Lecture 19

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 1

Stats 330: Lecture 19


Plan of the day

In today’s lecture , we look at some general strategies for choosing models having lots of continuous and categorical explanatory variables, and discuss an example.


General Principle

• For a problem with both continuous and categorical explanatory variables, the most general model is to fit separate regressions for each possible combination of the factor levels.

• That is, we allow the categorical variables to interact with each other and the continuous variables.


Illustration• Two factors A and B, two continuous

explanatory variables X and Z• General model is y ~ A*B*X + A*B*Z• Suppose A has a levels and B has b levels, so

there are a b factor level combinations• Each combination has a separate regression

with 3 parameters– Constant term– Coefficient of X– Coefficient of Z


Illustration (Cont)

• There are a b constant terms, we can arrange them in a table

• Can split the table up into main effects and interactions as in 2 way anova

• Listed in output as Intercept, A, B and A:B


Illustration (Cont)• There are a b X-coefficients, we can

also arrange them in a table

• Again, we can split the table up into main effects and interactions as in 2 way anova

• Listed in output as X, A:X, B:X and A:B:X

• Ditto for Z

• If all the A:X, B:X, A:B:X interactions are zero, coefficient of X is the same for all the a b regressions


Model selection

• In these situations, the number of possible models is large

• Need variable selection techniques– Anova– stepwise

• Don’t include high order interactions unless you include lower order interactions


Caution

• Sometimes we don’t have enough data to fit a separate regression to each factor level combination (need at least one more data point than number of continuous variables per combination)

• In this case we drop out the higher level interactions, forcing coefficients to have common values.


Example: Risk factors for low birthweight

These data were collected at Baystate Medical Center, Springfield, Mass. during 1986, as part of a study to identify risk factors for low-birthweight babies.

The response variable was birthweight, and data was collected on a variety of continuous and categorical explanatory variables


Variables age : mother's age in years, continuous

lwt: mother's weight in pounds, continuous

race: mother's race (`1' = white, `2' = black, `3' = other), factor

smoke: smoking during pregnancy ( 1 =smoked, 0=didn’t smoke), factor

ht: history of hypertension (0=No, 1=Yes), factor

ui: presence of uterine irritability (0=No, 1=Yes), factor

bwt: birth weight in grams, continuous, response

Must be a

factor!!


Preliminary plots

100 150 200 250

10

00

20

00

30

00

40

00

50

00

Mothers weight (lbs)

Ba

by'

s w

eig

ht (

gm

)

0 1

1000

2000

3000

4000

5000

UI

Bab

y's

wei

ght

(gm

)

0 1

1000

2000

3000

4000

5000

Smoking

Bab

y's

wei

ght

(gm

)

0 1

1000

2000

3000

4000

5000

Hypertension

Bab

y's

wei

ght

(gm

)

1 2 3

1000

2000

3000

4000

5000

Race

Bab

y's

wei

ght

(gm

)


Plotting conclusions

some relationships between bwt and the covariates

–Slight relationship with lwt–Small effects due to the categorical

variables

On to fitting models……


Factor level combinations

• There are 2 continuous explanatory variables, and 4 categorical explanatory variables, race (3 levels), smoke (2 levels) ht (2 levels) and ui (2 levels). There are 3x2x2x2=24 factor level combinations.

• 24 regressions in all !!


Models• The most general model would fit separate

regression surfaces to each of the 24 combinations

• Assuming planes are appropriate, this means 24 x 3 = 72 parameters. There are 189 observations, so this is rather a lot of parameters. (usually we want at least 5 observations per parameter). In fact not all factor level combinations have enough data to fit a plane (need at least 3 points)

• The model fitting separate planes to each combination isbwt ~ age*race*smoke*ht*ui + lwt*race*smoke*ht*ui


Fitting• Can fit the model and use the anova

function to reduce number of variables> births.lm<-lm(bwt~age*race*smoke*ui*ht

+lwt*race*smoke*ui*ht, data=births.df)

> anova(births.lm)

• Also use the stepwise function with the forward option> null.lm<-lm(bwt~1,data=births.df) > step(null.lm, formula(births.lm), direction="forward")


Results: anovaAnalysis of Variance Table Df Sum Sq Mean Sq F value Pr(>F) age 1 806927 806927 2.0610 0.153251 race 2 4456772 2228386 5.6916 0.004167 ** smoke 1 7098861 7098861 18.1314 3.674e-05 ***ui 1 6513795 6513795 16.6370 7.414e-05 ***ht 1 2458238 2458238 6.2786 0.013317 * lwt 1 2779537 2779537 7.0993 0.008579 ** age:race 2 368694 184347 0.4708 0.625420 age:smoke 1 2220991 2220991 5.6727 0.018520 * race:smoke 2 1085210 542605 1.3859 0.253374 age:ui 1 187617 187617 0.4792 0.489886 race:ui 2 774013 387006 0.9885 0.374625 smoke:ui 1 43060 43060 0.1100 0.740641 age:ht 1 1573461 1573461 4.0188 0.046844 * race:ht 2 318415 159207 0.4066 0.666639 smoke:ht 1 115215 115215 0.2943 0.588322 race:lwt 2 1008962 504481 1.2885 0.278798 smoke:lwt 1 86923 86923 0.2220 0.638215


Results: anova (cont)Analysis of Variance Table Df Sum Sq Mean Sq F value Pr(>F)

ui:lwt 1 196810 196810 0.5027 0.479457 ht:lwt 1 1145508 1145508 2.9258 0.089300 . age:race:smoke 2 1063946 531973 1.3587 0.260218 age:race:ui 2 108742 54371 0.1389 0.870455 age:smoke:ui 1 533 533 0.0014 0.970632 race:smoke:ui 1 617235 617235 1.5765 0.211272 age:race:ht 2 1220320 610160 1.5584 0.213948 age:smoke:ht 1 406773 406773 1.0389 0.309752 race:smoke:lwt 2 1052747 526373 1.3444 0.263898 race:ui:lwt 2 786735 393367 1.0047 0.368668 smoke:ui:lwt 1 1128102 1128102 2.8813 0.091744 . race:ht:lwt 1 435519 435519 1.1124 0.293310 age:race:smoke:ui 1 2544108 2544108 6.4980 0.011832 * race:smoke:ui:lwt 1 150811 150811 0.3852 0.535806 Residuals 146 57162471 391524


Results: stepwise (forward/both)

Step: AIC= 2451.34 bwt ~ ui + race + smoke + ht + lwt + ht:lwt + race:smoke

Df Sum of Sq RSS AIC<none> 73000256 2451- race:smoke 2 1657370 74657625 2452+ ui:lwt 1 304152 72696104 2453+ smoke:ht 1 168685 72831571 2453- ht:lwt 1 1397486 74397742 2453+ age 1 149901 72850355 2453+ smoke:lwt 1 11843 72988412 2453+ race:ht 2 497275 72502981 2454+ race:lwt 2 441336 72558920 2454- ui 1 6968046 79968302 2467


Comparisons

• 3 models to compare– Full model– Model indicated by anova (model 2)

bwt ~ age +ui + race + smoke + ht + lwt + age:ht + age:smoke,

– Model chosen by stepwise (model 3) bwt ~ ui + race + smoke + ht + lwt + ht:lwt + race:smoke,


Model Adj R2 R2Param-eters

AIC

Full 0.2633 0.4279 42 2471

Model 2 0.2393 0.2757 9 2449

Model 3 0.2327 0.2694 9 2451

Additive model

0.1957 0.2213 7 2457

extractAIC(model3.lm)


Deleting?

• Point 133 seems influential – big Cov ratio, HMD

• Refitting without 133 now makes model 3 the best – will go with model 3

• Could also just use a purely additive model (i.e parallel planes) - but adjusted R2 and AIC are slightly worse.


Summary Model 3

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3158.801 267.867 11.792 < 2e-16 ***ui1 -548.459 133.567 -4.106 6.12e-05 ***race2 -561.784 187.680 -2.993 0.003152 ** race3 -500.440 133.004 -3.763 0.000228 ***smoke1 -529.973 133.865 -3.959 0.000109 ***ht1 -1978.134 711.642 -2.780 0.006026 ** lwt 2.426 1.788 1.357 0.176520 ht1:lwt 10.236 4.535 2.257 0.025217 * race2:smoke1 255.066 300.258 0.849 0.396750 race3:smoke1 510.755 244.031 2.093 0.037768 *


Interpretation (cont)Other things being equal:• Uterine irritability associated with lower birthweight• Smoking associated with lower birthweight, but

differently for different races• Hypertension associated with lower birthweight• Race associated with lower birthweight

– Black lower than white– “Other” lower than white

• Higher mother’s weight associated with higher birthweight, for hypertension group

• Smoking lowers birthweight more for race 1 (white).• These effects significant but small compared to

variability.


Interpretation of interactions

White Black Other

Smoke No 0 -561 -500

Smoke Yes -530 -836 -580

-836 = -530 -561 + 255


Diagnostics for model 2

Check for high-influence etc

1500 2000 2500 3000 3500

-200

0-1

000

010

00

Fitted values

Res

idua

ls

Residuals vs Fitted

132

136138

-3 -2 -1 0 1 2 3

-3-2

-10

12

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q-Q plot

132

136138

1500 2000 2500 3000 3500

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale-Location plot132

136138

0 50 100 150

0.00

0.02

0.04

0.06

0.08

0.10

Obs. number

Coo

k's

dist

ance

Cook's distance plot

130

132106

Point 133 !!

Download - Stats 330: Lecture 19

Top Related