12/22/2015330 lecture 171 stats 330: lecture 17. 12/22/2015330 lecture 172 factors in the models...

30
01/28/22 330 lecture 17 1 STATS 330: Lecture 17

Upload: jesse-kennedy

Post on 18-Jan-2016

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 1

STATS 330: Lecture 17

Page 2: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 2

Factors In the models discussed so far, all explanatory

variables have been numeric Now we want to incorporate categorical

variables into our models In R, categorical variables are called factors

Page 3: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 3

Example Consider an experiment to measure the rate

of metal removal in a machining process on a lathe.

The rate depends on the speed setting of the lathe (fast, medium or slow, a categorical measurement) and the hardness of the material being machined (a continuous measurement)

Page 4: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 4

Data hardness setting rate1 120 slow 682 140 slow 903 150 slow 984 125 slow 775 136 slow 886 165 medium 1227 140 medium 1048 120 medium 759 125 medium 8410 133 medium 9511 175 fast 13812 132 fast 10213 124 fast 9314 141 fast 11215 130 fast 100

Page 5: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 5

120 130 140 150 160 170

70

80

90

10

01

10

12

01

30

14

0

Plot of rate versus hardness for different lathe speeds

hardness of metal

rate

of m

eta

l re

mo

val

s

s

s

s

s

m

m

m

m

m

f

f

f

f

f

smf

slowmediumfast

Page 6: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 6

ModelA model consisting of 3 parallel lines seems

appropriate:

hardnessrate

hardnessrate

hardnessrate

S

M

F

Note same slope ie parallel lines

Different intercepts

Page 7: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 7

Baseline versionWe can regard the fast setting as a baseline and

express the other settings as “baseline plus offsets”:

SS

MM

F

Baseline

Offset for medium line

Page 8: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 8

Baseline version (2)We can then write the model as

ehardnessrate

ehardnessrate

ehardnessrate

S

M

:setting slow theFor

:setting medium theFor

:settingfast theFor

Page 9: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 9

“Deviation from mean” version

Now let be the mean of F, M and S. Define

SS

MM

FF

“fast” line intercept

Mean of intercepts

Page 10: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 10

“Deviation from mean” version (2)

Then

SS

MM

FF

Thus, is now the “average intercept, and there are 3 offsets, one for each line. The 3 offsets add to zero. This is the form used in the Stage 2 course.

Page 11: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 11

Dummy variablesBack to baseline form: We can combine the 3 “baseline” equations into one by using “dummy variables”. Define

med = 1 if setting =“medium” and 0 otherwise

slow = 1 if setting =“slow” and 0 otherwise

Then we can write the model as

hardnessslowmedrate SM

Page 12: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 12

FittingThe model can be fitted as usual using lm:

> med <-ifelse(metal.df$setting=="medium", 1,0)> slow<-ifelse(metal.df$setting=="slow", 1,0)> summary(lm(rate~med + slow + hardness, data=metal.df))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -22.17042 7.15425 -3.099 0.010124 * med -9.44980 1.87275 -5.046 0.000374 ***slow -19.00757 1.88875 -10.064 6.94e-07 ***hardness 0.93426 0.05008 18.654 1.13e-09 ***

Page 13: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 13

Fitting (2)

Thus, the baseline has intercept -22.17042

The “medium” line has intercept -22.17042 -9.44980 = -31.62022

The “slow” line has intercept -22.17042 -19.00757 = -41.17799

Page 14: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 14

120 130 140 150 160 170

70

80

90

10

01

10

12

01

30

14

0

Plot of rate versus hardness for different lathe speeds

hardness of metal

rate

of m

eta

l re

mo

val

s

s

s

s

s

m

m

m

m

m

f

f

f

f

f

slowmediumfast

baseline

Offset m

Offset s

Page 15: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 15

Fitting (3)Making dummy variables is a pain. Fortunately R allows us to write

> summary(lm(rate ~ setting + hardness))

Estimate Std.Error t-value Pr(>|t|) (Intercept) -22.17042 7.15425 -3.099 0.010124 * settingmedium -9.44980 1.87275 -5.046 0.000374 ***settingslow -19.00757 1.88875 -10.064 6.94e-07 ***hardness 0.93426 0.05008 18.654 1.13e-09 ***

and get the same result, provided the variable setting is a factor.

Page 16: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 16

Factors Since the data for setting in the input data was

character data, the variable setting was automatically recognized as a factor

In fact the 3 settings were 1000, 1200, 1400 rpm. What would happen if the input data had used these (numerical) values?

Answer: the lm function would have assumed that setting was a continuous variable and fitted a plane, not 3 parallel lines.

Page 17: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 17

Factors (2)> rpm = rep(c(1000,1200,1400), c(5,5,5))> summary(lm(rate~ rpm + hardness, data=metal.df))

Call:lm(formula = rate ~ rpm + hardness, data = metal.df)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -88.674624 7.837602 -11.31 9.29e-08 ***rpm 0.047519 0.004521 10.51 2.09e-07 ***hardness 0.934226 0.047944 19.49 1.89e-10 ***

When rpm = 1000, the relationship is

-88.674624 + 0.047519 * 1000 + 0.934226 * hardness

i.e. -41.15562 + 0.934226 * hardness

Page 18: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 18

Factors (3)Intercept Slope

factor non-factor factor non-factor

Fast -22.17042 -22.14802 0.93426 0.93423

Medium -31.62022 -31.65182 0.93426 0.93423

Slow -41.17799 -41.15562 0.93426 0.93423

The non-factor model constrains the 3 intercepts to be equally spaced. OK for this data set, but not in general.

Page 19: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 19

Factors (4) To avoid this, we could

• recode the variable as character, or (easier)• Use the factor function to coerce the numerical

data into a factor

rpm.as.factor = factor(rpm)

Page 20: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 20

Factors (5)We can fit the “factor” model using the R code

> rpm.as.factor = factor(rpm)> summary(lm(rate~rpm.as.factor + hardness))Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -41.17799 6.84927 -6.012 8.77e-05 ***rpm.as.factor1200 9.55777 1.86692 5.120 0.000334 ***rpm.as.factor1400 19.00757 1.88875 10.064 6.94e-07 ***hardness 0.93426 0.05008 18.654 1.13e-09 ***

These estimates are different!! What’s going on??

Page 21: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 21

Levels The different values of a factor are called “levels” The levels of the factor setting are fast, medium,

slow> levels(setting)

[1] "fast" "medium" "slow"

The levels of the factor rpm.as.factor are 1000,1200,1400> levels(rpm.as.factor)

[1] "1000" "1200" "1400"

Page 22: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 22

Levels (2) By default, the levels are listed in alphabetical

order The first level is selected as the baseline Thus,

using setting, the baseline is “fast”

Using rpm.as.factor, the baseline is “1000”

Page 23: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 23

Levels (3)

> rpm.newbaseline<-factor(rpm,levels=c("1400", "1200", "1000"))> summary(lm(rate~rpm.newbaseline + hardness, data=metal.df))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -22.17042 7.15425 -3.099 0.010124 * rpm.newbaseline1200 -9.44980 1.87275 -5.046 0.000374 ***rpm.newbaseline1000 -19.00757 1.88875 -10.064 6.94e-07 ***hardness 0.93426 0.05008 18.654 1.13e-09 ***

Can change the order using the factor function

Page 24: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 24

Non-parallel lines What if the lines aren’t parallel? Then the betas

are different: the model becomes

hardnessrate

hardnessrate

hardnessrate

SS

MM

FF

Page 25: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 25

Baseline version for the betas

As before, we can regard the fast setting as a baseline and express the other settings as “baseline plus offsets”:

SS

MM

F

Baseline

Offset for medium line

slope

Page 26: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 26

Baseline version for both parameters

We can then write the model as

ehardnesshardnessrate

ehardnesshardnessrate

ehardnessrate

SS

MM

:setting slow theFor

:setting medium theFor

:settingfast theFor

Page 27: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 27

Dummy variables for both parameters

As before, we can combine these 3 equations into one by using “dummy variables”. Define med and slow as before, and

h.med = hardness x med

h.slow = hardness x slow

Then we can write the model as

slowhmedh

hardnessslowmedrate

SM

SM

..

Page 28: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 28

Fitting in RThe model formula for this non-parallel model is

rate ~ setting + hardness + setting:hardness

or, even more compactly, as rate ~ setting * hardness

> summary(lm(rate ~ setting*hardness))Estimate Std. Error t value Pr(>|t|) (Intercept) -12.18162 10.32795 -1.179 0.2684 settingmedium -30.15725 15.49375 -1.946 0.0834 . settingslow -33.60120 19.58902 -1.715 0.1204 hardness 0.86312 0.07295 11.831 8.69e-07 ***settingmedium:hardness 0.14961 0.11125 1.345 0.2116 settingslow:hardness 0.10546 0.14356 0.735 0.4813

Page 29: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 29

Is the non-parallel model necessary?

This amounts to testing if M and S are zero, or, equivalently, if the parallel model

rate ~ setting + hardness

is an an adequate submodel of the non-parallel model

rate ~ setting * hardness

As in Lecture 6, we use the anova function to compare the two models:

Page 30: 12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been

04/21/23 330 lecture 17 30

> model1<-lm(rate ~ setting + hardness)> model2<-lm(rate ~ setting * hardness)> anova(model1, model2)Analysis of Variance Table

Model 1: rate ~ setting + hardnessModel 2: rate ~ setting * hardness Res.Df RSS Df Sum of Sq F Pr(>F)1 11 95.451 2 9 78.807 2 16.644 0.9504 0.4222

Conclusion: since the F-value is small and the p-value 0.4222 is large, we conclude that the submodel (ie the parallel lines model) is adequate.