imposing domain knowledge on algorithmic learning · 2017-10-04 · » algorithmic learning...

32
Confidential. This presentation is provided for the recipient only and cannot be reproduced or shared without Fair Isaac Corporation's express consent. © 2013 Fair Isaac Corporation. 1 Imposing Domain Knowledge on Algorithmic Learning An Effective Approach to Construct Deployable Predictive Models Dr. Gerald Fahner FICO August 28, 2013

Upload: others

Post on 21-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

Confidential. This presentation is provided for the recipient only and cannot be reproduced or shared without Fair Isaac Corporation's express consent.

© 2013 Fair Isaac Corporation. 1

Imposing Domain Knowledge on Algorithmic Learning An Effective Approach to Construct Deployable Predictive Models

Dr. Gerald Fahner FICO August 28, 2013

Page 2: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 2

Objective

» What lenders seek from models:

1. Accurate predictions of customer behavior

2. Insights into fitted relationships

3. Ability to impose domain knowledge before deploying models

» Algorithmic learning procedures address 1. and 2., but pay scant attention to 3.

» As a consequence, algorithmic models may not be deployable for some credit scoring applications

We propose an effective approach to impose domain knowledge on algorithmic learning that paves the way for deployment

Page 3: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 3 © 2013 Fair Isaac Corporation. Confidential. 3

Agenda

» Algorithmic Learning Illustration

» Case Study - Modeling US Home Equity Data

» Diagnosing Complex Models

» Algorithmic Learning-Informed Construction of Palatable Scorecards

» Summary

Page 4: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 4

Sketch of Tree Ensemble Models

Tree 1

Tree 2

Tree M

Aggregation Mechanism

Data

Prediction Function (x1, x2) E[y]

Ensemble score

Examples: Random Forest [1]

Stochastic Gradient Boosting [2]

x2 x1

y E[y]

x2 x1

Page 5: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 5

Demonstration Problem

Next fit Stochastic Gradient Boosting to approximate the data

-2-1

01

2

-2-1

01

2-10

-5

0

5

10

x2 x1

Simulated noisy data points

True Prediction

function

Simulated noisy data from an underlying “true” prediction function to create a synthetic development data set

Page 6: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 6

Stochastic Gradient Boosting 1 Tree

-2-1

01

2

-2-1

01

2-10

-5

0

5

10

-2-1

01

2

-2-1

01

2-10

-5

0

5

10

Residual error Prediction Function

Page 7: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 7

Stochastic Gradient Boosting 5 Trees

-2-1

01

2

-2-1

01

2-10

-5

0

5

10

-2-1

01

2

-2-1

01

2-10

-5

0

5

10

Residual error Prediction Function

Page 8: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 8

Stochastic Gradient Boosting 25 Trees

-2-1

01

2

-2-1

01

2-10

-5

0

5

10

-2-1

01

2

-2-1

01

2-10

-5

0

5

10

Residual error Prediction Function

Page 9: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 9

Stochastic Gradient Boosting 100 Trees

-2-1

01

2

-2-1

01

2-10

-5

0

5

10

-2-1

01

2

-2-1

01

2-10

-5

0

5

10

Residual error Prediction Function

Page 10: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 10

Stochastic Gradient Boosting 200 Trees

-2-1

01

2

-2-1

01

2-10

-5

0

5

10

-2-1

01

2

-2-1

01

2-10

-5

0

5

10

Residual error pure noise

Prediction Function

Page 11: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 11 © 2013 Fair Isaac Corporation. Confidential. 11

Agenda

» Algorithmic Learning Illustration

» Case Study - Modeling US Home Equity Data

» Diagnosing Complex Models

» Algorithmic Learning-Informed Construction of Palatable Scorecards

» Summary

Page 12: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 12

Problem and Data

Predictor Candidates Description

REASON ‘Home improvement’ or ‘debt consolidation’

JOB Six occupational categories

LOAN Amount of loan request

MORTDUE Amount due on existing mortgage

VALUE Value of current property

DEBTINC Debt-to-income ratio

YOJ Years at present job

DEROG Number of major derogatory reports

CLNO Number of trade lines

DELINQ Number of delinquent trade lines

CLAGE Age of oldest trade line in months

NINQ Number of recent credit inquiries

Predict likelihood of default on US home equity loans. Binary Good/Bad target.

Sample size: N(total) = 5,960, N(bad) = 1,189. 12 predictor candidates

Data source as of 07/25/2013: http://old.cba.ua.edu/~mhardin/DATAMiningdatasets2/hmeq.xls

Page 13: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 13

Stochastic Gradient Boosting (SGB) Experiments

» Use SGB to approximate log(Odds) of being “Good” [3]

» Will use Area Under Curve (AUC) as evaluation measure throughout

5-fold cross-validation of AUC

Additive w/Interactions

1. Model that is additive in the predictors (trees have 2 leaves)

2. Model cable of capturing interactions between predictors (trees have 15 leaves)

Interactions appear to be substantial

Page 14: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 14

Eminently Useful Model Diagnostics Following [4]

DEBTINC 1.00

DELINQ 0.49

VALUE 0.45

CLAGE 0.40

DEROG 0.34

LOAN 0.25

CLNO 0.24

MORTDUE 0.23

NINQ 0.23

JOB 0.20

YOJ 0.20

REASON 0.07

Variable Importance Relative to most

important variable

Interaction Test Statistics Whether a variable interacts

with any other variables

DEBTINC 0.029

VALUE 0.022

CLNO 0.014

CLAGE 0.013

DELINQ 0.010

YOJ 0.008

MORTDUE 0.007

LOAN 0.007

DEROG 0.006

JOB 0.006

Page 15: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 15

Black Box Busters Diagnosing Predictive Relationships Learned by Complex Models

Range of x1

)...( 1121 Pxxx

Other predictors fixed to their

values for observation 1

Trained

SGB

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -8

-6

-4

-2

0

2

4

6

8

Score out complex model at regular grid points across the range of one

predictor at a time (here x1), while keeping all other predictors fixed

Here we fix the other predictors to their values for observation 1

Score

(logOdds)

as function

of x1, after

centering

Sweep through range of x1

Curve 1

Page 16: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 16

)...( 2221 Pxxx

)...( 3321 Pxxx

)...( 1121 Pxxx

Trained

SGB

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -8

-6

-4

-2

0

2

4

6

8

Other predictors fixed to their

values for observations 1, 2, 3

Range of x1

Sweep through range of x1 Curves 1, 2, 3

Diagnosing Predictive Relationships Learned by Complex Models

Score

(logOdds)

as function

of x1, after

centering

Page 17: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 17

)...( 21 NPN xxx

)...( 1121 Pxxx

Trained

SGB

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -8

-6

-4

-2

0

2

4

6

8

Other predictors fixed to their

values for observations 1 to N

Range of x1

Sweep through range of x1 Curves 1 to N

Conditioning on Observations 1 to N

Score

(logOdds)

as function

of x1, after

centering

Page 18: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 18

Partial Dependence Plot

Defined in [4] as sample average over the N curves

When x1 enters the model

additively…

… then this plot summarizes

influence of x1 on score well

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -8

-6

-4

-2

0

2

4

6

8

When x1 participates in significant

interactions…

… then explore relationship further

by creating two-dimensional partial

dependence plots

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -8

-6

-4

-2

0

2

4

6

8

Range of x1 Range of x1

Page 19: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 19

One-dimensional Partial Dependence Plots Examples from Case Study

Mis

sin

g

Debt-to-income ratio

Counts

Weight of Evidence

Partial dependence (higher the better)

Mis

sin

g

Number of recent inquiries

Concerned about increasing score?

Concerned about increasing score?

Page 20: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 20

Two-dimensional Partial Dependence Plot

(Debt-to-income ratio) (Property value)

Interaction between Debt-to-income ratio and Property value

Property value matters less for

larger Debt-to-income ratios

Page 21: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 21

Pros and Cons of Algorithmic Learning

Pros

Objective, no strong assumptions

Accurate fit to historic data

Push-button procedures (close to)

Informative diagnostics, considerable insights

Cons

Predictive relationships may not be palatable

No obvious way to impose domain expertise

May not be deployable

Page 22: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 22 © 2013 Fair Isaac Corporation. Confidential. 22

Agenda

» Algorithmic Learning Illustration

» Case Study - Modeling US Home Equity Data

» Diagnosing Complex Models

» Algorithmic Learning-Informed Construction of Palatable Scorecards

» Summary

Page 23: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 23

Avg. number of bins / characteristic

5-fold cross-validation of AUC

Alternative Modeling Strategies to Fit a Scorecard

Structure of score formula

Score = Sum of staircase functions in 12 binned predictors

Categorical and missing values receive their own bins

Continuous predictors are binned into approximately equal-size bins

Comparing two fitting objectives

1. Approximate log(Odds) of being “Good”

• Penalized Bernoulli Likelihood

2. Approximate ensemble score from Stochastic Gradient Boosting

• Penalized Least Squares

Constraints on score formula

Apply linear inequality constraints between score weight of neighboring bins covering these intervals:

Debt ratio in [5%, inf)

Number of inquiries in [0, inf)

to force monotonic decreasing patterns

Constraining staircase functions is an effective approach to avoid

counterintuitive score weight patterns

''

Approximate

log(Odds)

Approximate

SGB score

4 5 8 12 25 68 117

Page 24: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 24

Conjecture About Statistical Benefit of Approximating Ensemble Score

» Ensemble score can be seen as well smoothed version of original dependent variable, carrying less noise than original target and also approximately unbiased

We conjecture that variance of the estimates is reduced by predicting ensemble score instead of predicting the binary target. As a result, score performance is improved

Note:

» Use of penalized MLE to approximate original target also reduces variance of estimates. But empirically we find this approach to be less effective

Page 25: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 25

Constrained Scorecards Address Palatability Issues

Debt to income ratio Number of recent inquiries

L

E

S

S

P

A

L

A

T

A

B

L

E

M

O

R

E

P

A

L

A

T

A

B

L

E

SGB SGB

Scorecard Scorecard

Page 26: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 26

Capturing Interactions With Segmented Scorecards 2-stage Approach for Growing Palatable Scorecard Trees

Stage I

» Use algorithmic learning to

generate ensemble score

» Generate model diagnostics

Stage II » Recursively grow segmented scorecard

tree to approximate ensemble score

» Guide split decisions by interaction

diagnostics

» May impose domain knowledge via

constraints on score formula

Model

Diagnostics

Tree 1

Tree 2

Tree M

Data

Ensemble

score

Ensemble model

+

Palatable (segmented) scorecards

Domain knowledge

x6 > 630

x1 > 7

x2 > 3

Deploy

model

s/c

1

s/c

2

s/c

3

s/c

4

s/c

5

x9 < 55

Page 27: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 27

Score

card

1

CART-like Greedy Recursive Scorecard Segmentation Informed by Algorithmic Learning and Domain Knowledge

Parent population

{DEBTINC, VALUE, CLNO,..}

Interaction test statistics

provides list of segmentation

candidate (domain knowledge

can override)

Fit (constrained) scorecards as

palatable approximations of

ensemble score

Sub-pop 1 Sub-pop 2

Score weights Score weights

Develop candidate

scorecards for tentative

splits and pick winner*, if any

Apply recursively until

no improvement, or until

reaching low count

threshold

Score

card

2

*Need objective function to decide on splitting:

» Least Squares is consistent with scorecard fit objective. But can use other favorite metrics (e.g. AUC with respect to original target) to make split decisions

» Best practice to evaluate splitting objective on a validation set (different from test set)

Page 28: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 28

All Modeling Strategies Compared Cross-validated Alternative Approaches on Same Folds

Additive w/Interactions 2-stage / Approximate SGB Approximate log(odds)

(2 leaves) (15 leaves)

5-fold cross-validation of AUC

Stochastic Gradient Boosting Scorecard Segmented S/c Scorecard Segmented S/c

Page 29: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 29

Results With Segmented Scorecards

Case study experiences

» Split decisions are noisy, but more stable when approximating SGB score instead of log(Odds)

» Various tree alternatives come up during cross-validation - finding the best tree is an illusion

» 2-stage approach effective at producing segmented scorecards with performance close to SGB, while a gap remains

Example result from 2-stage approach

Our experiences with big data sets

» Applied 2-stage approach to credit scoring and fraud problems with several million observations and 100+ predictors. Often find segmentations outperforming single scorecards by a few % in AUC, KS

» Restricting segmentation candidates based on SGB interaction diagnostics reduces hypothesis space and accelerates tree search

» Automated segmentations tend to match intuition (clean/delinquent, young/old, …) and shave weeks off tedious manual segmentation analysis

» Segmented scorecards at times achieve close to SGB performance, at other times there remains a gap

Academic research on scorecard segmentation

» Segmentation sometimes but not always helps [5], [6]

Page 30: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 30

Summary

» Construction of accurate, deployable credit scoring models faces a unique challenge:

» Constrained (segmented) scorecards can meet this challenge

» Proposed 2-stage approach combines algorithmic learning and domain knowledge to inform the search for highly predictive, deployable (segmented) scorecards

» We are testing it for projects across credit scoring, application and insurance claims fraud and see good potential to increase the effectiveness of practical scoring models

Find a highly predictive model within a flexible yet palatable model family

Page 31: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

Confidential. This presentation is provided for the recipient only and cannot be reproduced or shared without Fair Isaac Corporation's express consent.

© 2013 Fair Isaac Corporation. 31

THANK YOU

Page 32: Imposing Domain Knowledge on Algorithmic Learning · 2017-10-04 · » Algorithmic learning procedures address 1. and 2., but pay scant attention to 3. » As a consequence, algorithmic

© 2013 Fair Isaac Corporation. Confidential. 32

References

[1] Random Forests, by Leo Breiman, Machine Learning, Volume 45, Issue 1 (Oct., 2001), pp. 5-32.

[2] Stochastic Gradient Boosting, by Jerome Friedman, (March 1999).

[3] Greedy Function Approximation: A Gradient Boosting Machine, by Jerome Friedman, The Annals of Statistics, Volume 29, Number 5 (2001), pp. 1189-1232.

[4] Predictive Learning Via Rule Ensembles, by Jerome Friedman and Bogdan Popescu, Ann. Appl. Stat. Volume 2, Number 3 (2008), pp. 916-954.

[5] Does Segmentation Always Improve Model Performance in Credit Scoring?, by Katarzyna Bijak and Lyn Thomas, Expert Systems with Applications, Volume 39, Issue 3 (Feb. 2012), pp. 2433-2442.

[6] Optimal Bipartite Scorecards, by David Hand et. al., Expert Systems with Applications, Volume 29, Issue 3 (Oct. 2005), pp. 684-690.