lecture 27

20
Lecture 27 • Polynomial Terms for Curvature • Categorical Variables

Upload: keefe-bray

Post on 31-Dec-2015

28 views

Category:

Documents


0 download

DESCRIPTION

Lecture 27. Polynomial Terms for Curvature Categorical Variables. Polynomial Terms for Curvature. To model a curved relationship between y and x, we can add squared (and cubic or higher order) terms as explanatory variables. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 27

Lecture 27

• Polynomial Terms for Curvature

• Categorical Variables

Page 2: Lecture 27

Polynomial Terms for Curvature• To model a curved relationship between y and x, we can add squared

(and cubic or higher order) terms as explanatory variables.• Fit as a multiple regression with two explanatory variables and • Coefficients are not directly interpretable. Change in the mean of Y

that is associated with a one unit increase in X depends on X.• To test whether the multiple regression model with X and X2 as

predictors provides better predictions than the multiple regression model with just X, use the p-value of the t-test on the X2 coefficient (null hypothesis is that X2 has a zero coefficient).

• Plot residuals vs. X to determine whether quadratic model is appropriate. If there is still a pattern in the mean, can try a cubic model with X, X2 and X3.

2210}|{ XXXY

X2X

)*2()(}|{}1|{ 022100 XXXYXXY

Page 3: Lecture 27

R e s p o n s e R e v e n u e P a r a m e t e r E s t i m a t e s T e r m E s t i m a t e S t d E r r o r t R a t i o P r o b > | t | I n t e r c e p t - 1 4 5 4 . 5 2 1 2 7 9 . 9 9 2 8 - 5 . 1 9 < . 0 0 0 1 I n c o m e 2 0 9 . 8 1 4 8 1 2 4 . 0 8 4 1 8 . 7 1 < . 0 0 0 1 I n c o m e s q - 4 . 1 7 0 5 0 3 0 . 5 0 4 0 0 9 - 8 . 2 7 < . 0 0 0 1

S t r o n g e v i d e n c e t h a t u s i n g i n c o m e a n d i n c o m e s q u a r e d t e r m p r o v i d e s b e t t e r p r e d i c t i o n s t h a n j u s t u s i n g i n c o m e t e r m ( p - v a l u e < . 0 0 0 1 ) . O v e r l a y P l o t

7 0 0

8 0 0

9 0 0

1 0 0 0

11 0 0

1 2 0 0

1 3 0 0

1 4 0 0

Y

1 0 1 5 2 0 2 5 3 0 3 5 4 0

In c o me

Y R e v e n u e

P re d ic te d R e v e n u e

-150

-100

-50

0

50

100

150

Res

idua

l Rev

enue

15 20 25 30 35

Income

Bivariate Fit of Residual Revenue By Income

Page 4: Lecture 27

Regression Model for Fast Food Chain Data

• Interactions and polynomial terms can be combined in a multiple regression model

• For fast food chain data, we consider the model

• This is called a second-order model because it includes all squares and interactions of original explanatory variables.

incomeageincomeage

incomeageincomeagerevenue

****

**},|{

62

52

4

210

Page 5: Lecture 27

fastfoodchain.jmp results

• Strong evidence of a quadratic relationship between revenue and age, revenue and income. Moderate evidence of an interaction between age and income.

Parameter Estimates Term Estimate Std Error t Ratio Prob>|t|

Intercept -1133.981 320.0193 -3.54 0.0022 Income 173.20317 28.20399 6.14 <.0001 Age 23.549963 32.23447 0.73 0.4739 Income sq -3.726129 0.542156 -6.87 <.0001 Age sq -3.868707 1.179054 -3.28 0.0039 (Income)( Age) 1.9672682 0.944082 2.08 0.0509

Page 6: Lecture 27

Categorical variables

• Categorical (nominal) variables: Variables that define group membership, e.g., sex (male/female), color (blue/green/red), county (Bucks County, Chester County, Delaware County, Philadelphia County).

• Categorical variables can be incorporated into regression through dummy variables.

• We will look at categorical variables that have two categories.

Page 7: Lecture 27

Sex discrimination revisited• At the beginning of the class, in case study

1.2, we examined data from a sex discrimination case.

Oneway Analysis of Salaries By Sex

Sa

lari

es

4000

5000

6000

7000

8000

Female Male

Sex

t-Test Difference t-Test DF Prob > |t|

Estimate -818.0 -6.293 91 <.0001 Std Error 130.0 Lower 95% -1076.2 Upper 95% -559.8

Strong evidence that male clerks are paid more than female hires. But bank’s defense lawyers say that this isbecause males have higher education and experience, i.e., thereare omitted confounding variables.

Page 8: Lecture 27

Multiple regression model for sex discrimination

• Let’s look at controlling for education level first. • To examine bank’s claim, we want to look at

and compare to

• How do we incorporate a categorical explanatory variable into multiple regression? Dummy variables.

},|{ SexEducationSalary

},|{ 1 MaleSexxEducationSalary

},|{ 1 FemaleSexxEducationSalary

Page 9: Lecture 27

Dummy variables• Define

• Multiple regression model:

• , the coefficient on the dummy variable for sex, is the difference in mean earnings between the populations of men and women with the same education levels.

female is iperson if 0

male is iperson if 1

i

i

Sexdummy

Sexdummy

22110

21 },|{

xx

xSexdummyxEducSalary

21

1

},|{

},|{

FemaleSexxEducationSalary

MaleSexxEducationSalary

2

Page 10: Lecture 27

Categorical variables in JMP

• To color and mark the points by a categorical variable such as Sex, click red triangle to left on first column and select Color or Mark by Column. Select Set Marker by Value to use different marker by column.

Page 11: Lecture 27

Response Salary Summary of Fit RSquare 0.363354 RSquare Adj 0.349206 Root Mean Square Error 572.4368 Mean of Response 5420.323 Observations (or Sum Wgts) 93 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Lower 95% Upper 95% Intercept 4173.1251 339.1811 12.30 <.0001 3499.2827 4846.9676 EDUC 80.697765 27.67291 2.92 0.0045 25.720708 135.67482 Sexdummy 691.80826 132.2319 5.23 <.0001 429.10655 954.50997

There is strong evidence that males and females of the same education level have different salaries (p-value < .0001). The 95% confidence interval for the difference in mean salaries between males and females of the same education level is ($429.11,$954.51).

Page 12: Lecture 27

Parallel Regression Lines

• The model

implies that

• Regression lines for males and females as education varies are parallel.

• No interaction between sex and education.

2211021 },|{ xxxSexdummyxEducSalary

1101

11201

},|{

)(},|{

xFemaleSexxEducSalary

xMaleSexxEducSalary

21

1

},|{

},|{

FemaleSexxEducationSalary

MaleSexxEducationSalary

Page 13: Lecture 27

Response Salary Whole Model Regression Plot

4000

5000

6000

7000

8000

Sa

lar

y

7 8 9 10 11 12 13 14 15 16 17

EDUC

FEM ALE

M ALE

Plot produced by JMP version 5 in Fit Model output that shows the parallel regression lines and the actual observations.

Page 14: Lecture 27

Interactions with Dummy Variables

• The model

assumes that difference between men and women’s mean salaries for fixed levels of education is the same for all levels of education.

• There might be an interaction between sex and education. Difference between men and women might differ depending on level of education.

22110

21 },|{

xx

xSexdummyxEducSalary

Page 15: Lecture 27

Interaction Model• Multiple regression model that allows for interaction

between sex and education:

• To add interaction in JMP, create a new colun sexdummy*educ. Right click on column, select formula and use the formula sexdummy*educ..

)*(

},|{

21322110

21

xxxx

xSexdummyxEducSalary

2321

1

},|{

},|{

xMaleSexxEducationSalary

FemaleSexxEducationSalary

Difference in mean salary between men and women of sameeducation level depends on the education level.

Page 16: Lecture 27

Response Salary Summary of Fit RSquare 0.37279 RSquare Adj 0.351648 Root Mean Square Error 571.3617 Mean of Response 5420.323 Observations (or Sum Wgts) 93 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 4395.3228 389.21 11.29 <.0001 EDUC 62.13056 31.94336 1.95 0.0549 Sexdummy -274.8597 845.7489 -0.32 0.7460 Sexdummy*EDUC 73.585794 63.59228 1.16 0.2503

There is no evidence of an interaction (p-value = .2503). Note, in the interaction model, the coefficient on Sexdummy is not easily interpretable. The best way to understand the estimates of an interaction model is to plot the two separate regression lines as shown on the next slide.

Page 17: Lecture 27

R esponse Sa lary W ho le M odel R eg ression P lo t

4 0 0 0

5 0 0 0

6 0 0 0

7 0 0 0

8 0 0 0

Sa

lary

7 8 9 1 0 11 1 2 1 3 1 4 1 5 1 6 1 7

E D UC

F E MA L E

MA L E

The model with one continuous explanatory variable, one categorical variable and an interaction is called the separate regression lines modelbecause regression lines of y on continuous explanatory variables for two levels of dummy variable are “separate,”neither coincident nor parallel.

Page 18: Lecture 27

Multiple regression with education, experience and sex

• We can easily control for both education and experience in the sex discrimination case by adding them both to the multiple regression. A model without interactions is:

• Note that

• is difference between mean salaries of males and females of same education and experience level.

3322110

321 },,|{

xxx

xSexdummyxExperxEducSalary

321

21

},,|{

},,|{

FemaleSexxExperxEducSalary

MaleSexxExperxEducSalary

3

Page 19: Lecture 27

Response Salary Summary of Fit RSquare 0.398068 RSquare Adj 0.377778 Root Mean Square Error 559.7297 Mean of Response 5420.323 Observations (or Sum Wgts) 93 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Lower 95% Upper 95% Intercept 3943.6599 346.7729 11.37 <.0001 3254.6295 4632.6903 EDUC 87.667515 27.23294 3.22 0.0018 33.556244 141.77879 EXPER 1.4632657 0.645874 2.27 0.0259 0.1799284 2.746603 Sexdummy 676.17906 129.4805 5.22 <.0001 418.9041 933.45402

Strong evidence that males and females have different mean salaries for same level of education and experience. 95% confidence interval for difference between mean male and mean female salaries for same level of education and experience is ($418.90, $933.45).

Page 20: Lecture 27

Response Salary Summary of Fit RSquare 0.42282 RSquare Adj 0.389648 Root Mean Square Error 554.3651 Mean of Response 5420.323 Observations (or Sum Wgts) 93 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Lower 95% Upper 95% Intercept 4169.488 387.3417 10.76 <.0001 3399.6045 4939.3715 EDUC 62.685629 30.99385 2.02 0.0462 1.0819919 124.28927 EXPER 2.1959717 0.838037 2.62 0.0104 0.530283 3.8616604 Sexdummy -344.7752 900.2427 -0.38 0.7027 -2134.105 1444.5546 Sexdummy*EDUC 88.42193 64.48341 1.37 0.1738 -39.74584 216.5897 Sexdummy*EXPER -1.346959 1.330667 -1.01 0.3142 -3.991804 1.2978853

There is not strong evidence of an interaction between sex and education (p-value=.1738) nor between sex and experience (p-value=.3142). We do not need to use interaction terms in the model.