predictive model
TRANSCRIPT
Income Analysis
Ping Yin11/10/2016
Contents• Executive Summary ------------------------------------------------------------------------------------- 3• Introduction ---------------------------------------------------------------------------------------------- 4• Purpose ---------------------------------------------------------------------------------------------------- 5• Methodology
Data Selection ----------------------------------------------------------------------------------- 6Exploration ----------------------------------------------------------------------------------- 7-24Preparation & Transformation ---------------------------------------------------------- 25-34Model Development & Assessment --------------------------------------------------- 35-44Model Comparison ------------------------------------------------------------------------ 45-47
• Options and Recommendations ---------------------------------------------------------------- 48-52• Summary ------------------------------------------------------------------------------------------------- 53• Appendix ------------------------------------------------------------------------------------------------- 54
Executive Summary• After data preparation and partition, three models are built in SAS
studio, EM, and DataRobot
• The same test dataset is scored by these models
• The model built in EM has the best performance
Introduction• Can we predict Income level based on age, gender, education, etc.?
• What is my income level after I graduate?
Purpose
• Figure out the best predictive model for Income dataset
• Predict my Income level
• Practice skills for preparing data, building model, and model assessment
Data Selection• Income dataset is originally extracted from 1994 Census bureau database
• Downloaded from Kaggle.com
• Reasons for choosing it:• Target variable, Income, is categorical variable• Medium size: 10+ columns and 30K+ rows• Used in Macro and DataRobot projects
Exploration• Using SAS studio to explore data• 32,561 observations• 15 variables: 6 Num, 9 Char• Num: Age Capitalgain Capitalloss Weekhour Edunum Fnlwgt• Char: Income Relationship Education Occupation Sex Marital
Workclass Race Nativecountry• Target: Income (“>50K” , “<=50k”)
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
Exploration
ExplorationData issues :
• Missing value: Workclass Occupation Nativecountry• Multiple levels: Education Marital Workclass Nativecountry• Numeric variables: Capitalgain Capitalloss• Screen variable: Fnlwgt
Preparation & Transformations• Solutions:
• Imputing missing value using subject matter knowledge: impute missing value for Workclass and Occupation with “Unemployeed”
• Imputing missing value using mode value: impute missing value for Nativecountry with “United-States”
Preparation & Transformations• Solutions:
• Coverting Capitalgain and Capitalloss from Num to Char• Binning multiple-level variables: Education Marital Workclass
Preparation & Transformations• Solutions:
• Binning Nativecountry and creating a new variable: region
Preparation & Transformations• Reasons for dropping variable Fnlwgt:
• It is the weight on the Current Population Survey files, not original data from Census• It shows near zero importance in last week DataRobot project
Preparation & Transformations• Reasons for not handling with variable Occupation:
• 15 levels• Do not have a sound criterion
• Reasons for not handling with variable Race and Relationship:• 5-6 Levels • Each level is meaningful
Preparation & TransformationsAfter preparation:
Preparation & Transformations
Preparation & Transformations
Preparation & Transformations• Data partition using Strata method
Now it is ready to go!
Training dataset
Test dataset
SAS Studio
Enterprise Miner
DataRobot
Model Development & Assessment: SAS Studio
Model Development & Assessment: SAS Studio
Model Development & Assessment: SAS Studio
Model Development & Assessment: SAS Studio
Model Development & Assessment: EM
Model Development & Assessment: EM
Model Development & Assessment: DataRobot
Model Development & Assessment: DataRobot
Model Development & Assessment: DataRobot
Model Development & Assessment: DataRobot
Model Comparison
Model Comparison• The best model in this project:
EM Studio DataRobot
Model Comparison: Predict my Income levelPing Dataset
EM
Studio
DataRobot
Options and Recommendations
Using 60% data to build a model
Using 70% data to build a model
Options and RecommendationsMacro Project
DataRobot Project
The overall best model
Options and Recommendations• Factors which may cause these differences:
• Dropping variable Fnlwgt
• Reducing levels
• Variable transformation: Capitalgain Capitalloss
• Increase speed, but decrease model performance
Options• Using DataRobot to build models without handling “data issues”
• Keep trying in SAS studio
Summary• We can predict Income level based on these characteristics • For Income dataset, DataRobot is most robust to build models
• Be aware of unexpected outcomes for data preparing
• Back and forth, until getting an ideal result
AppendixLink to Data:
https://www.kaggle.com/uciml/adult-census-Income
Thanks !