B Y : S H I V A N I C H O U D H A R Y &
E M I L Y P H I L L I P S
Predicting Housing Sales Price in the Year 2008
Objective
Accurately predict Sales Price in 2008 via House characteristics
Which of these characteristics are important in this prediction?
Dataset obtained from the United States Census Bureau from the http: //www.census.gov /construction/nrc/index.html website
Data is collected through survey of construction
Funded by Department of Housing and Urban Development
Methodology
Split data into training (75%) and test (25%)
Complete Univariate Analysis of variables
Check for Heteroscedasticity, multicollinearity, etc.
Step-wise Model Selection
Test significance, residual analysis, etc.
Check model on test dataset
Re-run on full dataset
Data Distribution
7,042 in whole dataset- 5,281 in train, 1,761 in test
1 continuous response, 1 continuous regressor and 6 categorical regressors,
Variable Type
Sales Price Continuous (Response)
Square Foot Area of the House Continuous (Regressor)
Bedrooms Categorical (Regressor)
Full Bathrooms Categorical (Regressor)
Half Bathrooms Categorical (Regressor)
Stories Categorical (Regressor)
Parking Facility Categorical (Regressor)
Metropolitan Area Categorical (Regressor)
Scatterplot
Checking Heteroscedasticity
Spread vs Level
Box-Cox Transformation
Reducing Heteroscedasticity
Creating Linear Relationships
Scatterplot of Re-expressed Values
Problem- Interpretation
Our final Box-Cox Transformation gave a lambda of -0.333 (the reciprocal cube root)
This is hard to interpret, and thus not optimal.
-0.333 ~ 0
The log is easier to explain
Proof of Similarity of Transformation
Proof of Similarity of Transformation
Outliers: Hat Matrix
Cutoff: 2p/n ~ 0.003
Final Model
All 3 methodologies (forward, backward, and stepwise) using Log transforms agreed on the final model
No metropolitan area
Test data confirmed this model as a good fit
R^2 = 0.5371031 for test
R^2 = 0.5228 for training
Refit this model on the entire dataset for more accuracy
R^2 = 0.5277
X1 = Log Square Foot Area of House
X2 = 2 full bathrooms if=1
X3 = 3 full bathrooms if =1
X4 = 4 or more full bathrooms if=1
X5 = 1 half bathroom if=1
X6 = 2 or more half bathrooms if=1
X7 = 3 bedrooms if =1
X8 = 4 bedrooms if =1
X9 = 5 or more bedrooms if =1
X10 = 2 car garage if=1
X11= 3 or more car garage if=1
X12 = other parking if=1
X13 = 2 or more stories if =1
X14 = split-level if =1
Variable Coefficient Stan. error t-statistic p-value Meaning
Intercept 7.195195 0.134570 53.468 < 2e-16
X1 0.673273 0.018480 36.432 < 2e-16 Log Square Foot Area ofHouse
X2 0.014853 0.031145 0.477 0.633 2 full bathrooms if =1
X3 0.203439 0.033720 6.033 1.69e-09 3 full bathrooms if =1
X4 0.421026 0.039484 10.663 < 2e-16 4 or more full bathrooms if =1
X5 0.113380 0.011296 10.037 < 2e-16 1 half bathroom if =1
X6 0.182157 0.031005 5.875 4.42e-09 2 or more half bathrooms if =1
X7 -0.164756 0.015939 -10.337 < 2e-16 3 bedrooms if =1
X8 -0.185899 0.018351 -10.130 < 2e-16 4 bedrooms if =1
X9 -0.266615 0.025485 -10.462 < 2e-16 5 or more bedrooms if=1
X10 -0.003572 0.018385 -0.194 0.846 2 car garage if =1
X11 0.145093 0.021573 6.726 1.88e-11 3 or more car garage if=1
X12 -0.075753 0.026314 -2.879 0.004 Other parking if=1
X13 0.067372 0.011786 5.716 1.13e-08 2 or more stories if =1
X14 -0.002316 0.061797 -0.037 0.970 Split-Level house if=1
Testing a Subset of Regression Coefficients
Full Model: F-statistic= 560.9, p-value < 2.2e-16
Can conclude there is predictive value in the equation as a whole
Variable Taken out F-Statistic P-value
Square Foot Area of House 1327.3 < 2.2e-16
Full Bathrooms 118.3 < 2.2e-16
Half Bathrooms 55.972 < 2.2e-16
Bedrooms 45.704 < 2.2e-16
Parking Facility 55.833 < 2.2e-16
Stories 16.48 7.24e-08
Example of Whole vs Individual Sig
Variable Level of Var t-statistic P-value Signif. code
Parking Facility
Level 2 -0.194 0.846
Parking Facility
Level 3 6.726 1.88e-11 ***
Parking Facility
Level 4 -2.879 0.004 **
F-statistic P-value
55.833 < 2.2e-16
Residuals vs Fitted
Normal Q-Q Plot of Residuals
Problems we faced
Necessary transformations for variables
Missing data (chose to exclude)
Low Level of Multicollinearity
Categorical Data
Outliers
Possible overfitting (huge dataset)?
Conclusion
We were able to develop a model that moderately well predicted the Sales Price for houses in 2008
We found variables that appear to be important in this prediction