kaggle "give me some credit" challenge overview
DESCRIPTION
Full description of the work associated with this project can be found at: http://www.npcompleteheart.com/project/kaggle-give-me-some-credit/TRANSCRIPT
![Page 1: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/1.jpg)
Predicting delinquency on debt
![Page 2: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/2.jpg)
What is the problem?
![Page 3: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/3.jpg)
What is the problem?
• X Store has a retail credit card available to customers
![Page 4: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/4.jpg)
What is the problem?
• X Store has a retail credit card available to customers
• There can be a number of sources of loss from this product, but one is customer’s defaulting on their debt
![Page 5: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/5.jpg)
What is the problem?
• X Store has a retail credit card available to customers
• There can be a number of sources of loss from this product, but one is customer’s defaulting on their debt
• This prevents the store from collecting payment for products and services rendered
![Page 6: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/6.jpg)
Is this problem big enough to matter?
![Page 7: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/7.jpg)
Is this problem big enough to matter?
• Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years
![Page 8: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/8.jpg)
Is this problem big enough to matter?
• Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years
• If only 5% of their carried debt was the store credit card this is potentially an:
![Page 9: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/9.jpg)
Is this problem big enough to matter?
• Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years
• If only 5% of their carried debt was the store credit card this is potentially an:
• Average loss of $8.12 per customer
![Page 10: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/10.jpg)
Is this problem big enough to matter?
• Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years
• If only 5% of their carried debt was the store credit card this is potentially an:
• Average loss of $8.12 per customer
• Potential overall loss of $1.2 million
![Page 11: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/11.jpg)
What can be done?
![Page 12: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/12.jpg)
What can be done?
• There are numerous models that can be used to predict which customers will default
![Page 13: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/13.jpg)
What can be done?
• There are numerous models that can be used to predict which customers will default
• This could be used to decrease credit limits or cancel credit lines for current risky customers to minimize potential loss
![Page 14: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/14.jpg)
What can be done?
• There are numerous models that can be used to predict which customers will default
• This could be used to decrease credit limits or cancel credit lines for current risky customers to minimize potential loss
• Or better screen which customers are approved for the card
![Page 15: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/15.jpg)
How will I do this?
![Page 16: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/16.jpg)
How will I do this?
• This is a basic classification problem with important business implications
![Page 17: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/17.jpg)
How will I do this?
• This is a basic classification problem with important business implications
• We’ll examine a few simplistic models to get an idea of performance
![Page 18: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/18.jpg)
How will I do this?
• This is a basic classification problem with important business implications
• We’ll examine a few simplistic models to get an idea of performance
• Explore decision tree methods to achieve better performance
![Page 19: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/19.jpg)
What will the models predict delinquency?
Each customer has a number of attributes
![Page 20: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/20.jpg)
What will the models predict delinquency?
Each customer has a number of attributes
John SmithDelinquent: YesAge: 23Income: $1600Number of Lines: 4
![Page 21: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/21.jpg)
What will the models predict delinquency?
Each customer has a number of attributes
John SmithDelinquent: YesAge: 23Income: $1600Number of Lines: 4
Mary RasmussenDelinquent: NoAge: 73Income: $2200Number of Lines: 2
![Page 22: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/22.jpg)
What will the models predict delinquency?
Each customer has a number of attributes
John SmithDelinquent: YesAge: 23Income: $1600Number of Lines: 4
Mary RasmussenDelinquent: NoAge: 73Income: $2200Number of Lines: 2
...
![Page 23: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/23.jpg)
What will the models predict delinquency?
Each customer has a number of attributes
John SmithDelinquent: YesAge: 23Income: $1600Number of Lines: 4
Mary RasmussenDelinquent: NoAge: 73Income: $2200Number of Lines: 2
...
We will use the customer attributes to predict whether they were delinquent
![Page 24: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/24.jpg)
How do we make sure that our solution actually has predictive power?
![Page 25: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/25.jpg)
How do we make sure that our solution actually has predictive power?
We have two slices of the customer dataset
![Page 26: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/26.jpg)
How do we make sure that our solution actually has predictive power?
We have two slices of the customer dataset
Train150,000
customers
Delinquencyin dataset
![Page 27: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/27.jpg)
How do we make sure that our solution actually has predictive power?
We have two slices of the customer dataset
Train Test150,000
customers
Delinquencyin dataset
101,000customers
Delinquencynot indataset
![Page 28: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/28.jpg)
How do we make sure that our solution actually has predictive power?
We have two slices of the customer dataset
Train Test150,000
customers
Delinquencyin dataset
101,000customers
Delinquencynot indataset
None of the customers in the test dataset are used to train the model
![Page 29: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/29.jpg)
Internally we validate our model performance with cross-fold validation
Using only the train dataset we can get a sense of how well our model performs without externally validating it
Train
![Page 30: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/30.jpg)
Internally we validate our model performance with cross-fold validation
Using only the train dataset we can get a sense of how well our model performs without externally validating it
TrainTrain 1
Train 2
Train 3
![Page 31: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/31.jpg)
Internally we validate our model performance with cross-fold validation
Using only the train dataset we can get a sense of how well our model performs without externally validating it
TrainTrain 1
Train 2
Train 3
Train 1
Train 2
AlgorithmTraining
![Page 32: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/32.jpg)
Internally we validate our model performance with cross-fold validation
Using only the train dataset we can get a sense of how well our model performs without externally validating it
TrainTrain 1
Train 2
Train 3
Train 1
Train 2
AlgorithmTraining
AlgorithmTesting
Train 3
![Page 33: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/33.jpg)
What matters is how well we can predict the test dataset
We judge this using the accuracy, which is the number of our predictions correct out of the total number of predictions made
So with 100,000 customers and an 80% accuracy we will have correctly predicted whether 80,000 customers will default or not in the next two years
![Page 34: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/34.jpg)
Putting accuracy in context
![Page 35: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/35.jpg)
Putting accuracy in context
We could save $600,000 over two years if we correctly predicted 50% of the customers that would default and changed their account to prevent it
![Page 36: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/36.jpg)
Putting accuracy in context
We could save $600,000 over two years if we correctly predicted 50% of the customers that would default and changed their account to prevent it
The potential loss is minimized by ~$8,000 for every 100,000 customers with each percentage point increase in accuracy
![Page 37: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/37.jpg)
Looking at the actual data
![Page 38: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/38.jpg)
Looking at the actual data
![Page 39: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/39.jpg)
Looking at the actual data
![Page 40: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/40.jpg)
Looking at the actual data
Assume$2,500
![Page 41: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/41.jpg)
Looking at the actual data
Assume$2,500
Assume0
![Page 42: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/42.jpg)
There is a continuum of algorithmic choices to tackle the problem
Simpler,Quicker
Complex,Slower
![Page 43: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/43.jpg)
There is a continuum of algorithmic choices to tackle the problem
Simpler,Quicker
Complex,Slower
RandomChance
![Page 44: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/44.jpg)
There is a continuum of algorithmic choices to tackle the problem
Simpler,Quicker
Complex,Slower
RandomChance
50%
![Page 45: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/45.jpg)
There is a continuum of algorithmic choices to tackle the problem
Simpler,Quicker
Complex,Slower
RandomChance
50%
![Page 46: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/46.jpg)
There is a continuum of algorithmic choices to tackle the problem
Simpler,Quicker
Complex,Slower
RandomChance
50%
SimpleClassification
![Page 47: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/47.jpg)
For simple classification we pick a single attribute and find the best split in the customers
![Page 48: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/48.jpg)
For simple classification we pick a single attribute and find the best split in the customers
![Page 49: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/49.jpg)
For simple classification we pick a single attribute and find the best split in the customers
Num
ber
of C
usto
mer
s
Times Past Due
![Page 50: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/50.jpg)
For simple classification we pick a single attribute and find the best split in the customers
Num
ber
of C
usto
mer
s
Times Past Due
True PositiveTrue NegativeFalse PositiveFalse Negative
1
![Page 51: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/51.jpg)
For simple classification we pick a single attribute and find the best split in the customers
Num
ber
of C
usto
mer
s
Times Past Due
True PositiveTrue NegativeFalse PositiveFalse Negative
1 2
![Page 52: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/52.jpg)
For simple classification we pick a single attribute and find the best split in the customers
Num
ber
of C
usto
mer
s
Times Past Due
True PositiveTrue NegativeFalse PositiveFalse Negative
1 2
![Page 53: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/53.jpg)
For simple classification we pick a single attribute and find the best split in the customers
Num
ber
of C
usto
mer
s
Times Past Due
True PositiveTrue NegativeFalse PositiveFalse Negative
1 2
![Page 54: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/54.jpg)
For simple classification we pick a single attribute and find the best split in the customers
Num
ber
of C
usto
mer
s
Times Past Due
True PositiveTrue NegativeFalse PositiveFalse Negative
1 2 ...
![Page 55: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/55.jpg)
We evaluate possible splits using accuracy, precision, and sensitivity
Acc = Number correctTotal Number
![Page 56: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/56.jpg)
We evaluate possible splits using accuracy, precision, and sensitivity
Acc = Number correctTotal Number
Prec = True PositivesNumber of People
Predicted Delinquent
![Page 57: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/57.jpg)
We evaluate possible splits using accuracy, precision, and sensitivity
Acc = Number correctTotal Number
Prec = True PositivesNumber of People
Predicted Delinquent
Sens = True PositivesNumber of PeopleActually Delinquent
![Page 58: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/58.jpg)
0 20 40 60 80 100Number of Times 30-59 Days Past Due
0
0.2
0.4
0.6
0.8
AccuracyPrecisionSensitivity
We evaluate possible splits using accuracy, precision, and sensitivity
Acc = Number correctTotal Number
Prec = True PositivesNumber of People
Predicted Delinquent
Sens = True PositivesNumber of PeopleActually Delinquent
![Page 59: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/59.jpg)
0 20 40 60 80 100Number of Times 30-59 Days Past Due
0
0.2
0.4
0.6
0.8
AccuracyPrecisionSensitivity
We evaluate possible splits using accuracy, precision, and sensitivity
Acc = Number correctTotal Number
Prec = True PositivesNumber of People
Predicted Delinquent
Sens = True PositivesNumber of PeopleActually Delinquent
![Page 60: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/60.jpg)
0 20 40 60 80 100Number of Times 30-59 Days Past Due
0
0.2
0.4
0.6
0.8
AccuracyPrecisionSensitivity
We evaluate possible splits using accuracy, precision, and sensitivity
Acc = Number correctTotal Number
Prec = True PositivesNumber of People
Predicted Delinquent
Sens = True PositivesNumber of PeopleActually Delinquent
0.61 KGI on Test Set
![Page 61: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/61.jpg)
However, not all fields are as informative
Using the number of times past due 60-89 dayswe achieve a KGI of 0.5
![Page 62: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/62.jpg)
However, not all fields are as informative
Using the number of times past due 60-89 dayswe achieve a KGI of 0.5
The approach is naive and could be improved but our time is better spent on different algorithms
![Page 63: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/63.jpg)
Exploring algorithmic choices further
Simpler,Quicker
Complex,Slower
RandomChance
0.50
SimpleClassification
0.50-0.61
![Page 64: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/64.jpg)
Exploring algorithmic choices further
Simpler,Quicker
Complex,Slower
RandomChance
0.50
SimpleClassification
0.50-0.61
RandomForests
![Page 65: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/65.jpg)
A random forest starts from a decision tree
Customer Data
![Page 66: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/66.jpg)
A random forest starts from a decision tree
Customer Data
Find the best split in a set ofrandomly chosen attributes
![Page 67: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/67.jpg)
A random forest starts from a decision tree
Customer Data
Find the best split in a set ofrandomly chosen attributes
Is age <30?
![Page 68: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/68.jpg)
A random forest starts from a decision tree
Customer Data
Find the best split in a set ofrandomly chosen attributes
Is age <30?
No
75,000 Customers>30
![Page 69: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/69.jpg)
A random forest starts from a decision tree
Customer Data
Find the best split in a set ofrandomly chosen attributes
Is age <30?
No
75,000 Customers>30
Yes
25,000 Customers <30
![Page 70: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/70.jpg)
A random forest starts from a decision tree
Customer Data
Find the best split in a set ofrandomly chosen attributes
Is age <30?
No
75,000 Customers>30
Yes
25,000 Customers <30
...
![Page 71: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/71.jpg)
A random forest is composed of many decision trees
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
![Page 72: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/72.jpg)
A random forest is composed of many decision trees
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1 ...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
![Page 73: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/73.jpg)
A random forest is composed of many decision trees
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1 ...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
Class assignment of a customer is based on how manyof the decision trees “vote” on how to split an attribute
![Page 74: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/74.jpg)
A random forest is composed of many decision trees
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1 ...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
We use a large number of trees to not over-fit to the training data
Class assignment of a customer is based on how manyof the decision trees “vote” on how to split an attribute
![Page 75: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/75.jpg)
The Random Forest algorithm are easily implemented
In Python or R for initial testing and validation
![Page 76: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/76.jpg)
The Random Forest algorithm are easily implemented
In Python or R for initial testing and validation
![Page 77: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/77.jpg)
The Random Forest algorithm are easily implemented
In Python or R for initial testing and validation
Also parallelized with Mahout and Hadoop since there is no dependence from one tree to the next
![Page 78: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/78.jpg)
A random forest performs well on the test set
Random Forest 10 trees: 0.779 KGI
![Page 79: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/79.jpg)
A random forest performs well on the test set
Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI
![Page 80: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/80.jpg)
A random forest performs well on the test set
Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI1000 trees: 0.850 KGI
![Page 81: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/81.jpg)
A random forest performs well on the test set
Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI1000 trees: 0.850 KGI
![Page 82: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/82.jpg)
A random forest performs well on the test set
Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI1000 trees: 0.850 KGI
0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
ClassificationRandom Forests
![Page 83: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/83.jpg)
Exploring algorithmic choices further
Simpler,Quicker
Complex,Slower
RandomChance
0.50
SimpleClassification
0.50-0.61
RandomForests
0.78-0.85
![Page 84: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/84.jpg)
Exploring algorithmic choices further
Simpler,Quicker
Complex,Slower
RandomChance
0.50
SimpleClassification
0.50-0.61
RandomForests
0.78-0.85
Gradient TreeBoosting
![Page 85: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/85.jpg)
Boosting Trees is similar to a Random Forest
Customer Data
Find the best split in a set ofrandomly chosen attributes
Is age <30?
No
Customers >30 Data
Yes
Customers <30 Data
...
![Page 86: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/86.jpg)
Boosting Trees is similar to a Random Forest
Customer Data
Is age <30?
No
Customers >30 Data
Yes
Customers <30 Data
...
Do an exhaustive searchfor best split
![Page 87: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/87.jpg)
How Gradient Boosting Trees differs from Random Forest
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
The first tree is optimized to minimize a loss function describing the data
![Page 88: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/88.jpg)
How Gradient Boosting Trees differs from Random Forest
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
The first tree is optimized to minimize a loss function describing the data
The next tree is then optimized to fit whatever variability the first
tree didn’t fit
![Page 89: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/89.jpg)
How Gradient Boosting Trees differs from Random Forest
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
The first tree is optimized to minimize a loss function describing the data
The next tree is then optimized to fit whatever variability the first
tree didn’t fit
This is a sequential process in comparison to the random forest
![Page 90: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/90.jpg)
How Gradient Boosting Trees differs from Random Forest
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
The first tree is optimized to minimize a loss function describing the data
The next tree is then optimized to fit whatever variability the first
tree didn’t fit
This is a sequential process in comparison to the random forest
We also run the risk of over-fitting to the data, thus the learning rate
![Page 91: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/91.jpg)
Implementing Gradient Boosted Trees
In Python or R it is easy for initial testing and validation
![Page 92: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/92.jpg)
Implementing Gradient Boosted Trees
In Python or R it is easy for initial testing and validation
There are implementations that use Hadoop but it’s more complicated to achieve the best performance
![Page 93: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/93.jpg)
Gradient Boosting Trees performs well on the dataset
100 trees, 0.1 Learning: 0.865022 KGI
![Page 94: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/94.jpg)
Gradient Boosting Trees performs well on the dataset
100 trees, 0.1 Learning: 0.865022 KGI1000 trees, 0.1 Learning: 0.865248 KGI
![Page 95: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/95.jpg)
Gradient Boosting Trees performs well on the dataset
100 trees, 0.1 Learning: 0.865022 KGI1000 trees, 0.1 Learning: 0.865248 KGI
0 0.6 0.8Learning Rate
0.75
0.8
0.85
KG
I
0.2 0.4
![Page 96: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/96.jpg)
Gradient Boosting Trees performs well on the dataset
100 trees, 0.1 Learning: 0.865022 KGI1000 trees, 0.1 Learning: 0.865248 KGI
0 0.6 0.8Learning Rate
0.75
0.8
0.85
KG
I
0.2 0.4
0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
ClassificationRandom Forests
Boosting Trees
![Page 97: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/97.jpg)
Moving one step further in complexity
Simpler,Quicker
Complex,Slower
RandomChance
0.50
SimpleClassification
0.50-0.61
RandomForests
0.78-0.85
Gradient TreeBoosting
0.71-0.8659
BlendedMethod
![Page 98: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/98.jpg)
Or more accurately an ensemble ofensemble methods
Algorithm Progression
![Page 99: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/99.jpg)
Or more accurately an ensemble ofensemble methods
Algorithm Progression
Random Forest
![Page 100: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/100.jpg)
Or more accurately an ensemble ofensemble methods
Algorithm Progression
Random Forest
Extremely Random Forest
![Page 101: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/101.jpg)
Or more accurately an ensemble ofensemble methods
Algorithm Progression
Random Forest
Extremely Random Forest
Gradient Tree Boosting
![Page 102: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/102.jpg)
Or more accurately an ensemble ofensemble methods
Algorithm ProgressionTrain Data Probabilities
Random Forest
Extremely Random Forest
Gradient Tree Boosting
0.10.50.010.80.7...
![Page 103: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/103.jpg)
Or more accurately an ensemble ofensemble methods
Algorithm ProgressionTrain Data Probabilities
Random Forest
Extremely Random Forest
Gradient Tree Boosting
0.10.50.010.80.7...
0.150.60.00.750.68
.
.
.
![Page 104: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/104.jpg)
Or more accurately an ensemble ofensemble methods
Algorithm ProgressionTrain Data Probabilities
Random Forest
Extremely Random Forest
Gradient Tree Boosting
0.10.50.010.80.7...
0.150.60.00.750.68
.
.
.
![Page 105: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/105.jpg)
Combine all of the model information
Train Data Probabilities
0.10.50.010.80.7...
0.150.60.00.750.68
.
.
.
![Page 106: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/106.jpg)
Combine all of the model information
Train Data Probabilities
0.10.50.010.80.7...
0.150.60.00.750.68
.
.
.
Optimize the set of train probabilities to the known delinquencies
![Page 107: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/107.jpg)
Combine all of the model information
Train Data Probabilities
0.10.50.010.80.7...
0.150.60.00.750.68
.
.
.
Optimize the set of train probabilities to the known delinquencies
Apply the same weighting scheme to the set of test data probabilities
![Page 108: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/108.jpg)
Implementation can be done in a number of ways
Testing in Python or R is slower, due to the sequential nature of applying the algorithms
Could be faster parallelized, running each algorithm separately and combining the results
![Page 109: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/109.jpg)
Assessing model performance
Blending Performance, 100 trees: 0.864394 KGI
![Page 110: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/110.jpg)
Assessing model performance
Blending Performance, 100 trees: 0.864394 KGI
0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
ClassificationRandom Forests
Boosting TreesBlended
![Page 111: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/111.jpg)
Assessing model performance
Blending Performance, 100 trees: 0.864394 KGI
But this performance and the possibility of additional gains comes at a distinct time cost.
0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
ClassificationRandom Forests
Boosting TreesBlended
![Page 112: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/112.jpg)
Examining the continuum of choices
Simpler,Quicker
Complex,Slower
RandomChance
0.50
SimpleClassification
0.50-0.61
RandomForests
0.78-0.85
Gradient TreeBoosting
0.71-0.8659
BlendedMethod
0.864
![Page 113: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/113.jpg)
What would be best to implement?
![Page 114: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/114.jpg)
What would be best to implement?
There is a large amount of optimization in the blended method that could be done
![Page 115: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/115.jpg)
What would be best to implement?
There is a large amount of optimization in the blended method that could be done
However, this algorithm takes the longest to run.This constraint will apply in testing and validation also
![Page 116: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/116.jpg)
What would be best to implement?
There is a large amount of optimization in the blended method that could be done
However, this algorithm takes the longest to run.This constraint will apply in testing and validation also
Random Forests returns a reasonably good result.It is quick and easily parallelized
![Page 117: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/117.jpg)
What would be best to implement?
There is a large amount of optimization in the blended method that could be done
However, this algorithm takes the longest to run.This constraint will apply in testing and validation also
Random Forests returns a reasonably good result.It is quick and easily parallelized
Gradient Tree Boosting returns the best result and runs reasonably fast.It is not as easily parallelized though
![Page 118: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/118.jpg)
What would be best to implement?
Random Forests returns a reasonably good result.It is quick and easily parallelized
Gradient Tree Boosting returns the best result and runs reasonably fast.It is not as easily parallelized though
![Page 119: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/119.jpg)
Increases in predictive performance have real business value
Using any of the more complex algorithms we achieve an increase of 35% in comparison to random
![Page 120: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/120.jpg)
Increases in predictive performance have real business value
Using any of the more complex algorithms we achieve an increase of 35% in comparison to random
Potential decrease of ~$420k in losses by identifyingcustomers likely to default in the training set alone
![Page 121: Kaggle "Give me some credit" challenge overview](https://reader033.vdocuments.site/reader033/viewer/2022052905/558653ffd8b42a5c128b45a1/html5/thumbnails/121.jpg)
Thank you for your time