lecture 27
DESCRIPTION
Lecture 27. Polynomial Terms for Curvature Categorical Variables. Polynomial Terms for Curvature. To model a curved relationship between y and x, we can add squared (and cubic or higher order) terms as explanatory variables. - PowerPoint PPT PresentationTRANSCRIPT
Lecture 27
• Polynomial Terms for Curvature
• Categorical Variables
Polynomial Terms for Curvature• To model a curved relationship between y and x, we can add squared
(and cubic or higher order) terms as explanatory variables.• Fit as a multiple regression with two explanatory variables and • Coefficients are not directly interpretable. Change in the mean of Y
that is associated with a one unit increase in X depends on X.• To test whether the multiple regression model with X and X2 as
predictors provides better predictions than the multiple regression model with just X, use the p-value of the t-test on the X2 coefficient (null hypothesis is that X2 has a zero coefficient).
• Plot residuals vs. X to determine whether quadratic model is appropriate. If there is still a pattern in the mean, can try a cubic model with X, X2 and X3.
2210}|{ XXXY
X2X
)*2()(}|{}1|{ 022100 XXXYXXY
R e s p o n s e R e v e n u e P a r a m e t e r E s t i m a t e s T e r m E s t i m a t e S t d E r r o r t R a t i o P r o b > | t | I n t e r c e p t - 1 4 5 4 . 5 2 1 2 7 9 . 9 9 2 8 - 5 . 1 9 < . 0 0 0 1 I n c o m e 2 0 9 . 8 1 4 8 1 2 4 . 0 8 4 1 8 . 7 1 < . 0 0 0 1 I n c o m e s q - 4 . 1 7 0 5 0 3 0 . 5 0 4 0 0 9 - 8 . 2 7 < . 0 0 0 1
S t r o n g e v i d e n c e t h a t u s i n g i n c o m e a n d i n c o m e s q u a r e d t e r m p r o v i d e s b e t t e r p r e d i c t i o n s t h a n j u s t u s i n g i n c o m e t e r m ( p - v a l u e < . 0 0 0 1 ) . O v e r l a y P l o t
7 0 0
8 0 0
9 0 0
1 0 0 0
11 0 0
1 2 0 0
1 3 0 0
1 4 0 0
Y
1 0 1 5 2 0 2 5 3 0 3 5 4 0
In c o me
Y R e v e n u e
P re d ic te d R e v e n u e
-150
-100
-50
0
50
100
150
Res
idua
l Rev
enue
15 20 25 30 35
Income
Bivariate Fit of Residual Revenue By Income
Regression Model for Fast Food Chain Data
• Interactions and polynomial terms can be combined in a multiple regression model
• For fast food chain data, we consider the model
• This is called a second-order model because it includes all squares and interactions of original explanatory variables.
incomeageincomeage
incomeageincomeagerevenue
****
**},|{
62
52
4
210
fastfoodchain.jmp results
• Strong evidence of a quadratic relationship between revenue and age, revenue and income. Moderate evidence of an interaction between age and income.
Parameter Estimates Term Estimate Std Error t Ratio Prob>|t|
Intercept -1133.981 320.0193 -3.54 0.0022 Income 173.20317 28.20399 6.14 <.0001 Age 23.549963 32.23447 0.73 0.4739 Income sq -3.726129 0.542156 -6.87 <.0001 Age sq -3.868707 1.179054 -3.28 0.0039 (Income)( Age) 1.9672682 0.944082 2.08 0.0509
Categorical variables
• Categorical (nominal) variables: Variables that define group membership, e.g., sex (male/female), color (blue/green/red), county (Bucks County, Chester County, Delaware County, Philadelphia County).
• Categorical variables can be incorporated into regression through dummy variables.
• We will look at categorical variables that have two categories.
Sex discrimination revisited• At the beginning of the class, in case study
1.2, we examined data from a sex discrimination case.
Oneway Analysis of Salaries By Sex
Sa
lari
es
4000
5000
6000
7000
8000
Female Male
Sex
t-Test Difference t-Test DF Prob > |t|
Estimate -818.0 -6.293 91 <.0001 Std Error 130.0 Lower 95% -1076.2 Upper 95% -559.8
Strong evidence that male clerks are paid more than female hires. But bank’s defense lawyers say that this isbecause males have higher education and experience, i.e., thereare omitted confounding variables.
Multiple regression model for sex discrimination
• Let’s look at controlling for education level first. • To examine bank’s claim, we want to look at
and compare to
• How do we incorporate a categorical explanatory variable into multiple regression? Dummy variables.
},|{ SexEducationSalary
},|{ 1 MaleSexxEducationSalary
},|{ 1 FemaleSexxEducationSalary
Dummy variables• Define
• Multiple regression model:
• , the coefficient on the dummy variable for sex, is the difference in mean earnings between the populations of men and women with the same education levels.
female is iperson if 0
male is iperson if 1
i
i
Sexdummy
Sexdummy
22110
21 },|{
xx
xSexdummyxEducSalary
21
1
},|{
},|{
FemaleSexxEducationSalary
MaleSexxEducationSalary
2
Categorical variables in JMP
• To color and mark the points by a categorical variable such as Sex, click red triangle to left on first column and select Color or Mark by Column. Select Set Marker by Value to use different marker by column.
Response Salary Summary of Fit RSquare 0.363354 RSquare Adj 0.349206 Root Mean Square Error 572.4368 Mean of Response 5420.323 Observations (or Sum Wgts) 93 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Lower 95% Upper 95% Intercept 4173.1251 339.1811 12.30 <.0001 3499.2827 4846.9676 EDUC 80.697765 27.67291 2.92 0.0045 25.720708 135.67482 Sexdummy 691.80826 132.2319 5.23 <.0001 429.10655 954.50997
There is strong evidence that males and females of the same education level have different salaries (p-value < .0001). The 95% confidence interval for the difference in mean salaries between males and females of the same education level is ($429.11,$954.51).
Parallel Regression Lines
• The model
implies that
• Regression lines for males and females as education varies are parallel.
• No interaction between sex and education.
2211021 },|{ xxxSexdummyxEducSalary
1101
11201
},|{
)(},|{
xFemaleSexxEducSalary
xMaleSexxEducSalary
21
1
},|{
},|{
FemaleSexxEducationSalary
MaleSexxEducationSalary
Response Salary Whole Model Regression Plot
4000
5000
6000
7000
8000
Sa
lar
y
7 8 9 10 11 12 13 14 15 16 17
EDUC
FEM ALE
M ALE
Plot produced by JMP version 5 in Fit Model output that shows the parallel regression lines and the actual observations.
Interactions with Dummy Variables
• The model
assumes that difference between men and women’s mean salaries for fixed levels of education is the same for all levels of education.
• There might be an interaction between sex and education. Difference between men and women might differ depending on level of education.
22110
21 },|{
xx
xSexdummyxEducSalary
Interaction Model• Multiple regression model that allows for interaction
between sex and education:
• To add interaction in JMP, create a new colun sexdummy*educ. Right click on column, select formula and use the formula sexdummy*educ..
)*(
},|{
21322110
21
xxxx
xSexdummyxEducSalary
2321
1
},|{
},|{
xMaleSexxEducationSalary
FemaleSexxEducationSalary
Difference in mean salary between men and women of sameeducation level depends on the education level.
Response Salary Summary of Fit RSquare 0.37279 RSquare Adj 0.351648 Root Mean Square Error 571.3617 Mean of Response 5420.323 Observations (or Sum Wgts) 93 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 4395.3228 389.21 11.29 <.0001 EDUC 62.13056 31.94336 1.95 0.0549 Sexdummy -274.8597 845.7489 -0.32 0.7460 Sexdummy*EDUC 73.585794 63.59228 1.16 0.2503
There is no evidence of an interaction (p-value = .2503). Note, in the interaction model, the coefficient on Sexdummy is not easily interpretable. The best way to understand the estimates of an interaction model is to plot the two separate regression lines as shown on the next slide.
R esponse Sa lary W ho le M odel R eg ression P lo t
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
Sa
lary
7 8 9 1 0 11 1 2 1 3 1 4 1 5 1 6 1 7
E D UC
F E MA L E
MA L E
The model with one continuous explanatory variable, one categorical variable and an interaction is called the separate regression lines modelbecause regression lines of y on continuous explanatory variables for two levels of dummy variable are “separate,”neither coincident nor parallel.
Multiple regression with education, experience and sex
• We can easily control for both education and experience in the sex discrimination case by adding them both to the multiple regression. A model without interactions is:
• Note that
• is difference between mean salaries of males and females of same education and experience level.
3322110
321 },,|{
xxx
xSexdummyxExperxEducSalary
321
21
},,|{
},,|{
FemaleSexxExperxEducSalary
MaleSexxExperxEducSalary
3
Response Salary Summary of Fit RSquare 0.398068 RSquare Adj 0.377778 Root Mean Square Error 559.7297 Mean of Response 5420.323 Observations (or Sum Wgts) 93 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Lower 95% Upper 95% Intercept 3943.6599 346.7729 11.37 <.0001 3254.6295 4632.6903 EDUC 87.667515 27.23294 3.22 0.0018 33.556244 141.77879 EXPER 1.4632657 0.645874 2.27 0.0259 0.1799284 2.746603 Sexdummy 676.17906 129.4805 5.22 <.0001 418.9041 933.45402
Strong evidence that males and females have different mean salaries for same level of education and experience. 95% confidence interval for difference between mean male and mean female salaries for same level of education and experience is ($418.90, $933.45).
Response Salary Summary of Fit RSquare 0.42282 RSquare Adj 0.389648 Root Mean Square Error 554.3651 Mean of Response 5420.323 Observations (or Sum Wgts) 93 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Lower 95% Upper 95% Intercept 4169.488 387.3417 10.76 <.0001 3399.6045 4939.3715 EDUC 62.685629 30.99385 2.02 0.0462 1.0819919 124.28927 EXPER 2.1959717 0.838037 2.62 0.0104 0.530283 3.8616604 Sexdummy -344.7752 900.2427 -0.38 0.7027 -2134.105 1444.5546 Sexdummy*EDUC 88.42193 64.48341 1.37 0.1738 -39.74584 216.5897 Sexdummy*EXPER -1.346959 1.330667 -1.01 0.3142 -3.991804 1.2978853
There is not strong evidence of an interaction between sex and education (p-value=.1738) nor between sex and experience (p-value=.3142). We do not need to use interaction terms in the model.