12/22/2015330 lecture 171 stats 330: lecture 17. 12/22/2015330 lecture 172 factors in the models...

04/21/23 330 lecture 17 1

STATS 330: Lecture 17

04/21/23 330 lecture 17 2

Factors In the models discussed so far, all explanatory

variables have been numeric Now we want to incorporate categorical

variables into our models In R, categorical variables are called factors

04/21/23 330 lecture 17 3

Example Consider an experiment to measure the rate

of metal removal in a machining process on a lathe.

The rate depends on the speed setting of the lathe (fast, medium or slow, a categorical measurement) and the hardness of the material being machined (a continuous measurement)

04/21/23 330 lecture 17 4

Data hardness setting rate1 120 slow 682 140 slow 903 150 slow 984 125 slow 775 136 slow 886 165 medium 1227 140 medium 1048 120 medium 759 125 medium 8410 133 medium 9511 175 fast 13812 132 fast 10213 124 fast 9314 141 fast 11215 130 fast 100

04/21/23 330 lecture 17 5

120 130 140 150 160 170

70

80

90

10

01

10

12

01

30

14

0

Plot of rate versus hardness for different lathe speeds

hardness of metal

rate

of m

eta

l re

mo

val

s

s

s

s

s

m

m

m

m

m

f

f

f

f

f

smf

slowmediumfast

04/21/23 330 lecture 17 6

ModelA model consisting of 3 parallel lines seems

appropriate:

hardnessrate

hardnessrate

hardnessrate

S

M

F

Note same slope ie parallel lines

Different intercepts

04/21/23 330 lecture 17 7

Baseline versionWe can regard the fast setting as a baseline and

express the other settings as “baseline plus offsets”:

SS

MM

F

Baseline

Offset for medium line

04/21/23 330 lecture 17 8

Baseline version (2)We can then write the model as

ehardnessrate

ehardnessrate

ehardnessrate

S

M

:setting slow theFor

:setting medium theFor

:settingfast theFor

04/21/23 330 lecture 17 9

“Deviation from mean” version

Now let be the mean of F, M and S. Define

SS

MM

FF

“fast” line intercept

Mean of intercepts

04/21/23 330 lecture 17 10

“Deviation from mean” version (2)

Then

SS

MM

FF

Thus, is now the “average intercept, and there are 3 offsets, one for each line. The 3 offsets add to zero. This is the form used in the Stage 2 course.

04/21/23 330 lecture 17 11

Dummy variablesBack to baseline form: We can combine the 3 “baseline” equations into one by using “dummy variables”. Define

med = 1 if setting =“medium” and 0 otherwise

slow = 1 if setting =“slow” and 0 otherwise

Then we can write the model as

hardnessslowmedrate SM

04/21/23 330 lecture 17 12

FittingThe model can be fitted as usual using lm:

> med <-ifelse(metal.df$setting=="medium", 1,0)> slow<-ifelse(metal.df$setting=="slow", 1,0)> summary(lm(rate~med + slow + hardness, data=metal.df))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -22.17042 7.15425 -3.099 0.010124 * med -9.44980 1.87275 -5.046 0.000374 ***slow -19.00757 1.88875 -10.064 6.94e-07 ***hardness 0.93426 0.05008 18.654 1.13e-09 ***

04/21/23 330 lecture 17 13

Fitting (2)

Thus, the baseline has intercept -22.17042

The “medium” line has intercept -22.17042 -9.44980 = -31.62022

The “slow” line has intercept -22.17042 -19.00757 = -41.17799

04/21/23 330 lecture 17 14

120 130 140 150 160 170

70

80

90

10

01

10

12

01

30

14

0

Plot of rate versus hardness for different lathe speeds

hardness of metal

rate

of m

eta

l re

mo

val

s

s

s

s

s

m

m

m

m

m

f

f

f

f

f

slowmediumfast

baseline

Offset m

Offset s

04/21/23 330 lecture 17 15

Fitting (3)Making dummy variables is a pain. Fortunately R allows us to write

> summary(lm(rate ~ setting + hardness))

Estimate Std.Error t-value Pr(>|t|) (Intercept) -22.17042 7.15425 -3.099 0.010124 * settingmedium -9.44980 1.87275 -5.046 0.000374 ***settingslow -19.00757 1.88875 -10.064 6.94e-07 ***hardness 0.93426 0.05008 18.654 1.13e-09 ***

and get the same result, provided the variable setting is a factor.

04/21/23 330 lecture 17 16

Factors Since the data for setting in the input data was

character data, the variable setting was automatically recognized as a factor

In fact the 3 settings were 1000, 1200, 1400 rpm. What would happen if the input data had used these (numerical) values?

Answer: the lm function would have assumed that setting was a continuous variable and fitted a plane, not 3 parallel lines.

04/21/23 330 lecture 17 17

Factors (2)> rpm = rep(c(1000,1200,1400), c(5,5,5))> summary(lm(rate~ rpm + hardness, data=metal.df))

Call:lm(formula = rate ~ rpm + hardness, data = metal.df)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -88.674624 7.837602 -11.31 9.29e-08 ***rpm 0.047519 0.004521 10.51 2.09e-07 ***hardness 0.934226 0.047944 19.49 1.89e-10 ***

When rpm = 1000, the relationship is

-88.674624 + 0.047519 * 1000 + 0.934226 * hardness

i.e. -41.15562 + 0.934226 * hardness

04/21/23 330 lecture 17 18

Factors (3)Intercept Slope

factor non-factor factor non-factor

Fast -22.17042 -22.14802 0.93426 0.93423

Medium -31.62022 -31.65182 0.93426 0.93423

Slow -41.17799 -41.15562 0.93426 0.93423

The non-factor model constrains the 3 intercepts to be equally spaced. OK for this data set, but not in general.

04/21/23 330 lecture 17 19

Factors (4) To avoid this, we could

• recode the variable as character, or (easier)• Use the factor function to coerce the numerical

data into a factor

rpm.as.factor = factor(rpm)

04/21/23 330 lecture 17 20

Factors (5)We can fit the “factor” model using the R code

> rpm.as.factor = factor(rpm)> summary(lm(rate~rpm.as.factor + hardness))Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -41.17799 6.84927 -6.012 8.77e-05 ***rpm.as.factor1200 9.55777 1.86692 5.120 0.000334 ***rpm.as.factor1400 19.00757 1.88875 10.064 6.94e-07 ***hardness 0.93426 0.05008 18.654 1.13e-09 ***

These estimates are different!! What’s going on??

04/21/23 330 lecture 17 21

Levels The different values of a factor are called “levels” The levels of the factor setting are fast, medium,

slow> levels(setting)

[1] "fast" "medium" "slow"

The levels of the factor rpm.as.factor are 1000,1200,1400> levels(rpm.as.factor)

[1] "1000" "1200" "1400"

04/21/23 330 lecture 17 22

Levels (2) By default, the levels are listed in alphabetical

order The first level is selected as the baseline Thus,

using setting, the baseline is “fast”

Using rpm.as.factor, the baseline is “1000”

04/21/23 330 lecture 17 23

Levels (3)

> rpm.newbaseline<-factor(rpm,levels=c("1400", "1200", "1000"))> summary(lm(rate~rpm.newbaseline + hardness, data=metal.df))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -22.17042 7.15425 -3.099 0.010124 * rpm.newbaseline1200 -9.44980 1.87275 -5.046 0.000374 ***rpm.newbaseline1000 -19.00757 1.88875 -10.064 6.94e-07 ***hardness 0.93426 0.05008 18.654 1.13e-09 ***

Can change the order using the factor function

04/21/23 330 lecture 17 24

Non-parallel lines What if the lines aren’t parallel? Then the betas

are different: the model becomes

hardnessrate

hardnessrate

hardnessrate

SS

MM

FF

04/21/23 330 lecture 17 25

Baseline version for the betas

As before, we can regard the fast setting as a baseline and express the other settings as “baseline plus offsets”:

SS

MM

F

Baseline

Offset for medium line

slope

04/21/23 330 lecture 17 26

Baseline version for both parameters

We can then write the model as

ehardnesshardnessrate

ehardnesshardnessrate

ehardnessrate

SS

MM

:setting slow theFor

:setting medium theFor

:settingfast theFor

04/21/23 330 lecture 17 27

Dummy variables for both parameters

As before, we can combine these 3 equations into one by using “dummy variables”. Define med and slow as before, and

h.med = hardness x med

h.slow = hardness x slow

Then we can write the model as

slowhmedh

hardnessslowmedrate

SM

SM

..

04/21/23 330 lecture 17 28

Fitting in RThe model formula for this non-parallel model is

rate ~ setting + hardness + setting:hardness

or, even more compactly, as rate ~ setting * hardness

> summary(lm(rate ~ setting*hardness))Estimate Std. Error t value Pr(>|t|) (Intercept) -12.18162 10.32795 -1.179 0.2684 settingmedium -30.15725 15.49375 -1.946 0.0834 . settingslow -33.60120 19.58902 -1.715 0.1204 hardness 0.86312 0.07295 11.831 8.69e-07 ***settingmedium:hardness 0.14961 0.11125 1.345 0.2116 settingslow:hardness 0.10546 0.14356 0.735 0.4813

04/21/23 330 lecture 17 29

Is the non-parallel model necessary?

This amounts to testing if M and S are zero, or, equivalently, if the parallel model

rate ~ setting + hardness

is an an adequate submodel of the non-parallel model

rate ~ setting * hardness

As in Lecture 6, we use the anova function to compare the two models:

04/21/23 330 lecture 17 30

> model1<-lm(rate ~ setting + hardness)> model2<-lm(rate ~ setting * hardness)> anova(model1, model2)Analysis of Variance Table

Model 1: rate ~ setting + hardnessModel 2: rate ~ setting * hardness Res.Df RSS Df Sum of Sq F Pr(>F)1 11 95.451 2 9 78.807 2 16.644 0.9504 0.4222

Conclusion: since the F-value is small and the p-value 0.4222 is large, we conclude that the submodel (ie the parallel lines model) is adequate.

12/22/2015330 lecture 171 stats 330: lecture 17. 12/22/2015330 lecture 172 factors in the models...

Documents

factors330 lecture

medium line330 lecture

model as330 lecture

fast setting

slow line

med slow hardness

data hardness

variable setting