anatomy of a data product and lending club data

H2O.aiMachine Intelligence

ANATOMY OF A DATA PRODUCT

B U I L D I N G M A C H I N E L E A R N I N G M O D E L S F O R A P P L I C AT I O N S

A M Y WA N G

H2O.ai Machine Intelligence

H2O.ai

H2O Company

H2O Software

• Team: ~35. Founded in 2012, Mountain View, CA• Stanford Math & Systems Engineers

• Open Source Software (Apache 2.0 License)• Ease of Use via Web Interface• R, Python, Scala, Spark & Hadoop Interfaces• Distributed Algorithms Scale to Big Data


Scientific Advisory CouncilDr. Trevor Hastie

Dr. Rob Tibshirani

Dr. Stephen Boyd

• John A. Overdeck Professor of Mathematics, Stanford University• PhD in Statistics, Stanford University• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining• Co-author with John Chambers, Statistical Models in S• Co-author, Generalized Additive Models • 108,404 citations (via Google Scholar)

• Professor of Statistics and Health Research and Policy, Stanford University• PhD in Statistics, Stanford University• COPPS Presidents’ Award recipient• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining• Author, Regression Shrinkage and Selection via the Lasso• Co-author, An Introduction to the Bootstrap

• Professor of Electrical Engineering and Computer Science, Stanford University• PhD in Electrical Engineering and Computer Science, UC Berkeley• Co-author, Convex Optimization• Co-author, Linear Matrix Inequalities in System and Control Theory• Co-author, Distributed Optimization and Statistical Learning via the Alternating Direction

Method of Multipliers


What makes an effective data product?

• Solves a problem• Summarizes and visualizes the problem with historical

data • Contain Machine Learning models that generates

predictions and recommendations the user can use to make a decision informed by data

• Returns insights on predictions and not just decision themselves

• Formulated such that new input collected will auto-magically update the model and keep the product from being outdated


Data Product Flow

Historical Data

Data Engineering

Machine Learning

Model

Input Data

Prediction

Actual Results Returned Back


SUMMARY OF APPROVED LOAN APPLICANTS

C O D E S N I P P E T F O R R :

CREDIT SCORE SUMMARIES

CREDIT SCORE DISTRIBUTION OF GOOD LOANS

CREDIT SCORE DISTRIBUTION OF BAD LOANS

The dataset had a total of half a million records from 2007 up to 2015. The first step is to import the data and create a new column that categorizes the loan as either a good loan or a bad loan (the user has defaulted or the account has been charged off).

In H2O Flow, you can grab the distribution of credit scores for good loans vs bad loans. It is easy to see that owners of bad loans typically have the lowest credit score, which will be the biggest driving force in predicting whether a loan is good or not. However we want a model that actually takes into account other features so that loans aren’t automatically cut off at a certain threshold.


MODELING

We are ready to start modeling the data. In this case we are going to stick with the six features that are also available in the rejectedStats dataset. The rejectedStats data is a compilation of all the applications that have been rejected, so without any other information we can try to predict for missed opportunities.

T H I S C O D E S N I P P E T W I L L B U I L D A G B M M O D E L W I T H 2 0 0 T R E E S A N D M A X D E P T H O F 6 :

You can check the scoring history on the training and validation frame in flow, as well as check out the variable importance for the model. As expected risk_score or the users’ credit score plays a large role in whether the loan will default.

SCORING HISTORY

In our case, the AUC value for both the training and validation set is higher than 0.97 which is extremely good. It means we can find most of the bad loans without losing too many potential good loans due to false positives.


POST MODEL ANALYSIS

SUMMARY

CREDIT SCORE DISTRIBUTION OF GOOD LOANSCREDIT SCORE DISTRIBUTION OF BAD LOANSLooking at the distributions of credit score for the bad loans, it does look like credit scores are not the only factor in predicting a good loan. However the models don’t find any borrowers with a score of higher than 630; one way to remedy this is to give more value for bad borrowers with a high credit score by using row weights in GBM.

This particular GBM model saves the user about $181 million in bad loans while losing out on $97 million in profit from missing good loans. So the GBM model built in about 1.5 minutes

will save the lenders a net total of $83 MILLION.

H2O.ai Machine Intelligence Customers • Community •

Evangelists

November 9, 10, 11Computer History Museum

H2OWORLD.H2O.AI

20% off registrationusing code:

h2ocommunity