anatomy of a data product and lending club data
TRANSCRIPT
![Page 1: Anatomy of a Data Product and Lending Club Data](https://reader031.vdocuments.site/reader031/viewer/2022030306/586f792f1a28ab10258b6ed5/html5/thumbnails/1.jpg)
H2O.aiMachine Intelligence
ANATOMY OF A DATA PRODUCT
B U I L D I N G M A C H I N E L E A R N I N G M O D E L S F O R A P P L I C AT I O N S
A M Y WA N G
![Page 2: Anatomy of a Data Product and Lending Club Data](https://reader031.vdocuments.site/reader031/viewer/2022030306/586f792f1a28ab10258b6ed5/html5/thumbnails/2.jpg)
H2O.ai Machine Intelligence
H2O.ai
H2O Company
H2O Software
• Team: ~35. Founded in 2012, Mountain View, CA• Stanford Math & Systems Engineers
• Open Source Software (Apache 2.0 License)• Ease of Use via Web Interface• R, Python, Scala, Spark & Hadoop Interfaces• Distributed Algorithms Scale to Big Data
![Page 3: Anatomy of a Data Product and Lending Club Data](https://reader031.vdocuments.site/reader031/viewer/2022030306/586f792f1a28ab10258b6ed5/html5/thumbnails/3.jpg)
H2O.ai Machine Intelligence
Scientific Advisory CouncilDr. Trevor Hastie
Dr. Rob Tibshirani
Dr. Stephen Boyd
• John A. Overdeck Professor of Mathematics, Stanford University• PhD in Statistics, Stanford University• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining• Co-author with John Chambers, Statistical Models in S• Co-author, Generalized Additive Models • 108,404 citations (via Google Scholar)
• Professor of Statistics and Health Research and Policy, Stanford University• PhD in Statistics, Stanford University• COPPS Presidents’ Award recipient• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining• Author, Regression Shrinkage and Selection via the Lasso• Co-author, An Introduction to the Bootstrap
• Professor of Electrical Engineering and Computer Science, Stanford University• PhD in Electrical Engineering and Computer Science, UC Berkeley• Co-author, Convex Optimization• Co-author, Linear Matrix Inequalities in System and Control Theory• Co-author, Distributed Optimization and Statistical Learning via the Alternating Direction
Method of Multipliers
![Page 4: Anatomy of a Data Product and Lending Club Data](https://reader031.vdocuments.site/reader031/viewer/2022030306/586f792f1a28ab10258b6ed5/html5/thumbnails/4.jpg)
H2O.ai Machine Intelligence
What makes an effective data product?
• Solves a problem• Summarizes and visualizes the problem with historical
data • Contain Machine Learning models that generates
predictions and recommendations the user can use to make a decision informed by data
• Returns insights on predictions and not just decision themselves
• Formulated such that new input collected will auto-magically update the model and keep the product from being outdated
![Page 5: Anatomy of a Data Product and Lending Club Data](https://reader031.vdocuments.site/reader031/viewer/2022030306/586f792f1a28ab10258b6ed5/html5/thumbnails/5.jpg)
H2O.ai Machine Intelligence
Data Product Flow
Historical Data
Data Engineering
Machine Learning
Model
Input Data
Prediction
Actual Results Returned Back
![Page 6: Anatomy of a Data Product and Lending Club Data](https://reader031.vdocuments.site/reader031/viewer/2022030306/586f792f1a28ab10258b6ed5/html5/thumbnails/6.jpg)
H2O.aiMachine Intelligence
SUMMARY OF APPROVED LOAN APPLICANTS
C O D E S N I P P E T F O R R :
CREDIT SCORE SUMMARIES
CREDIT SCORE DISTRIBUTION OF GOOD LOANS
CREDIT SCORE DISTRIBUTION OF BAD LOANS
The dataset had a total of half a million records from 2007 up to 2015. The first step is to import the data and create a new column that categorizes the loan as either a good loan or a bad loan (the user has defaulted or the account has been charged off).
In H2O Flow, you can grab the distribution of credit scores for good loans vs bad loans. It is easy to see that owners of bad loans typically have the lowest credit score, which will be the biggest driving force in predicting whether a loan is good or not. However we want a model that actually takes into account other features so that loans aren’t automatically cut off at a certain threshold.
![Page 7: Anatomy of a Data Product and Lending Club Data](https://reader031.vdocuments.site/reader031/viewer/2022030306/586f792f1a28ab10258b6ed5/html5/thumbnails/7.jpg)
H2O.aiMachine Intelligence
MODELING
We are ready to start modeling the data. In this case we are going to stick with the six features that are also available in the rejectedStats dataset. The rejectedStats data is a compilation of all the applications that have been rejected, so without any other information we can try to predict for missed opportunities.
T H I S C O D E S N I P P E T W I L L B U I L D A G B M M O D E L W I T H 2 0 0 T R E E S A N D M A X D E P T H O F 6 :
You can check the scoring history on the training and validation frame in flow, as well as check out the variable importance for the model. As expected risk_score or the users’ credit score plays a large role in whether the loan will default.
SCORING HISTORY
In our case, the AUC value for both the training and validation set is higher than 0.97 which is extremely good. It means we can find most of the bad loans without losing too many potential good loans due to false positives.
![Page 8: Anatomy of a Data Product and Lending Club Data](https://reader031.vdocuments.site/reader031/viewer/2022030306/586f792f1a28ab10258b6ed5/html5/thumbnails/8.jpg)
H2O.aiMachine Intelligence
POST MODEL ANALYSIS
SUMMARY
CREDIT SCORE DISTRIBUTION OF GOOD LOANSCREDIT SCORE DISTRIBUTION OF BAD LOANSLooking at the distributions of credit score for the bad loans, it does look like credit scores are not the only factor in predicting a good loan. However the models don’t find any borrowers with a score of higher than 630; one way to remedy this is to give more value for bad borrowers with a high credit score by using row weights in GBM.
This particular GBM model saves the user about $181 million in bad loans while losing out on $97 million in profit from missing good loans. So the GBM model built in about 1.5 minutes
will save the lenders a net total of $83 MILLION.
![Page 9: Anatomy of a Data Product and Lending Club Data](https://reader031.vdocuments.site/reader031/viewer/2022030306/586f792f1a28ab10258b6ed5/html5/thumbnails/9.jpg)
H2O.ai Machine Intelligence Customers • Community •
Evangelists
November 9, 10, 11Computer History Museum
H2OWORLD.H2O.AI
20% off registrationusing code:
h2ocommunity