predictive model

54
Income Analysis Ping Yin 11/10/2016

Upload: ping-yin

Post on 10-Jan-2017

109 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Predictive model

Income Analysis

Ping Yin11/10/2016

Page 2: Predictive model

Contents• Executive Summary ------------------------------------------------------------------------------------- 3• Introduction ---------------------------------------------------------------------------------------------- 4• Purpose ---------------------------------------------------------------------------------------------------- 5• Methodology

Data Selection ----------------------------------------------------------------------------------- 6Exploration ----------------------------------------------------------------------------------- 7-24Preparation & Transformation ---------------------------------------------------------- 25-34Model Development & Assessment --------------------------------------------------- 35-44Model Comparison ------------------------------------------------------------------------ 45-47

• Options and Recommendations ---------------------------------------------------------------- 48-52• Summary ------------------------------------------------------------------------------------------------- 53• Appendix ------------------------------------------------------------------------------------------------- 54

Page 3: Predictive model

Executive Summary• After data preparation and partition, three models are built in SAS

studio, EM, and DataRobot

• The same test dataset is scored by these models

• The model built in EM has the best performance

Page 4: Predictive model

Introduction• Can we predict Income level based on age, gender, education, etc.?

• What is my income level after I graduate?

Page 5: Predictive model

Purpose

• Figure out the best predictive model for Income dataset

• Predict my Income level

• Practice skills for preparing data, building model, and model assessment

Page 6: Predictive model

Data Selection• Income dataset is originally extracted from 1994 Census bureau database

• Downloaded from Kaggle.com

• Reasons for choosing it:• Target variable, Income, is categorical variable• Medium size: 10+ columns and 30K+ rows• Used in Macro and DataRobot projects

Page 7: Predictive model

Exploration• Using SAS studio to explore data• 32,561 observations• 15 variables: 6 Num, 9 Char• Num: Age Capitalgain Capitalloss Weekhour Edunum Fnlwgt• Char: Income Relationship Education Occupation Sex Marital

Workclass Race Nativecountry• Target: Income (“>50K” , “<=50k”)

Page 8: Predictive model

Exploration

Page 9: Predictive model

Exploration

Page 10: Predictive model

Exploration

Page 11: Predictive model

Exploration

Page 12: Predictive model

Exploration

Page 13: Predictive model

Exploration

Page 14: Predictive model

Exploration

Page 15: Predictive model

Exploration

Page 16: Predictive model

Exploration

Page 17: Predictive model

Exploration

Page 18: Predictive model

Exploration

Page 19: Predictive model

Exploration

Page 20: Predictive model

Exploration

Page 21: Predictive model

Exploration

Page 22: Predictive model

Exploration

Page 23: Predictive model

Exploration

Page 24: Predictive model

ExplorationData issues :

• Missing value: Workclass Occupation Nativecountry• Multiple levels: Education Marital Workclass Nativecountry• Numeric variables: Capitalgain Capitalloss• Screen variable: Fnlwgt

Page 25: Predictive model

Preparation & Transformations• Solutions:

• Imputing missing value using subject matter knowledge: impute missing value for Workclass and Occupation with “Unemployeed”

• Imputing missing value using mode value: impute missing value for Nativecountry with “United-States”

Page 26: Predictive model

Preparation & Transformations• Solutions:

• Coverting Capitalgain and Capitalloss from Num to Char• Binning multiple-level variables: Education Marital Workclass

Page 27: Predictive model

Preparation & Transformations• Solutions:

• Binning Nativecountry and creating a new variable: region

Page 28: Predictive model

Preparation & Transformations• Reasons for dropping variable Fnlwgt:

• It is the weight on the Current Population Survey files, not original data from Census• It shows near zero importance in last week DataRobot project

Page 29: Predictive model

Preparation & Transformations• Reasons for not handling with variable Occupation:

• 15 levels• Do not have a sound criterion

• Reasons for not handling with variable Race and Relationship:• 5-6 Levels • Each level is meaningful

Page 30: Predictive model

Preparation & TransformationsAfter preparation:

Page 31: Predictive model

Preparation & Transformations

Page 32: Predictive model

Preparation & Transformations

Page 33: Predictive model

Preparation & Transformations• Data partition using Strata method

Page 34: Predictive model

Now it is ready to go!

Training dataset

Test dataset

SAS Studio

Enterprise Miner

DataRobot

Page 35: Predictive model

Model Development & Assessment: SAS Studio

Page 36: Predictive model

Model Development & Assessment: SAS Studio

Page 37: Predictive model

Model Development & Assessment: SAS Studio

Page 38: Predictive model

Model Development & Assessment: SAS Studio

Page 39: Predictive model

Model Development & Assessment: EM

Page 40: Predictive model

Model Development & Assessment: EM

Page 41: Predictive model

Model Development & Assessment: DataRobot

Page 42: Predictive model

Model Development & Assessment: DataRobot

Page 43: Predictive model

Model Development & Assessment: DataRobot

Page 44: Predictive model

Model Development & Assessment: DataRobot

Page 45: Predictive model

Model Comparison

Page 46: Predictive model

Model Comparison• The best model in this project:

EM Studio DataRobot

Page 47: Predictive model

Model Comparison: Predict my Income levelPing Dataset

EM

Studio

DataRobot

Page 48: Predictive model

Options and Recommendations

Using 60% data to build a model

Using 70% data to build a model

Page 49: Predictive model

Options and RecommendationsMacro Project

DataRobot Project

The overall best model

Page 50: Predictive model

Options and Recommendations• Factors which may cause these differences:

• Dropping variable Fnlwgt

• Reducing levels

• Variable transformation: Capitalgain Capitalloss

• Increase speed, but decrease model performance

Page 51: Predictive model

Options• Using DataRobot to build models without handling “data issues”

• Keep trying in SAS studio

Page 52: Predictive model

Summary• We can predict Income level based on these characteristics • For Income dataset, DataRobot is most robust to build models

• Be aware of unexpected outcomes for data preparing

• Back and forth, until getting an ideal result

Page 53: Predictive model

AppendixLink to Data:

https://www.kaggle.com/uciml/adult-census-Income

Page 54: Predictive model

Thanks !