selection bias in credit scorecard evaluationabellott/modelrisk/modelriskadams.pdf · selection...

21
UvA-DARE is a service provided by the library of the University of Amsterdam (http://dare.uva.nl) UvA-DARE (Digital Academic Repository) Quality of diagnostic accuracy studies : the development, use, and evaluation of QUDAS Whiting, P.F. Link to publication Citation for published version (APA): Whiting, P. F. (2006). Quality of diagnostic accuracy studies : the development, use, and evaluation of QUDAS. General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. Download date: 18 Jun 2020

Upload: others

Post on 11-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Selection Bias in Credit Scorecard Evaluation

David J. Hand1,2 and Niall M. Adams1,3

1Department of MathematicsImperial College London

2Winton Capital Management

3Heilbronn Institute for Mathematical ResearchUniversity of Bristol

September 2012

1/26

Page 2: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Contents

I Selection bias problems

I Scorecard evaluation

I Demonstration and simulations

I Real data example

I A step toward a solution

I Conclusion

2/26

Page 3: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Selection Bias

Perennial problem arising from making inferential statements abouta population based on a non-random sample from the population.

Familiar example: reject inference → problem of constructing ascorecard from the entire population, when available data onlyrefers to previously accepted customers.

The scorecard may not perform well on the whole population, dueto these “holes” in the data space.

Reject inference methods attempt to infer the good/bad status ofpreviously rejected applicants → that is, to fill in the data space.

Can’t get something for nothing, so RI methods attempt to find asource for this missing information. Sadly, this will be the case forthis talk also!

3/26

Page 4: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Scorecard Evaluation

This talk is concerned with the impact of selection bias oncomparative application scorecard evaluation.

Scorecard evaluation is essential to the industry. Performancedeteriorates, related to changes in

I applicant population

I economic climate

I competitive environment

Scorecards are updated regularly. Such updating involvescomparing with the performance of a new scorecard against anexisting scorecard, to determine whether it is superior.

Moreover, each stage of the scorecard construction process (suchas determining whether a new characteristic improves performance)can be regarded as comparative performance evaluation

4/26

Page 5: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Framework

Suppose we have an application scorecard (the old scorecard) andwe wish to see how another (the new scorecard) performs relativeto the old.

Apply both scorecards to the same sample of customers. Thesecustomers have been accepted by the old scorecard, and thisproduces an asymmetry. This produces an unfortunate effect onpopular performance metrics.

5/26

Page 6: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Punchline: Selection bias in this framework can lead to incorrectlyfavouring a new scorecard over an existing scorecard, and raisesthe costs and risks in unnecessarily replacing the new scorecardwith the old.

6/26

Page 7: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Demonstration and Simulations

Notation:

I f (so , sn) joint density of old and new scores for bad class

I g(so , sn) joint density of old and new scores for good class

I fo , fn, go , gn corresponding marginal densities

I Fo , Fn, Go , Gn corresponding marginal cumulativedistribution functions

Assume (stochastic dominance):

I Fo(s) > Go(s)

I Fn(s) > Gn(s)

Not restrictive, implies ROC curve never crosses chance diagonal.

7/26

Page 8: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Application scorecards select customers by comparing score s withthreshold t

decision =

{accept s > t

reject otherwise

Selection process based on old scorecard, so data in test samplebased only on applicant with score pairs

{(so , sn) : so > t}

Broadly, truncation of the old scorecard is absolute, while thetruncation of the new scorecard depends on the correlationbetween old and new scorecards.

8/26

Page 9: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Precise impact of this truncation, and size of resulting bias, willdepend on the shape of the various score distributions.

To illustrate the effect, let old and new scorecard joint distributionsbe bivariate normal, yielding identical marginal score distributionsfor both good and bad classes, except for a difference in the mean.

This means that the old and new scorecards produce identicalperformance measures when applied to the entire population:neither is better than the other.

For simplicity, assume mean of new and old bad score distributionis zero, and common mean of good score distribution is µ > 0.Similarly, common standard deviation of all distribution is σ = 1.

9/26

Page 10: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Thus,

f (so , sn) =1

2π√

1 − ρ2exp

{−

1

2(1 − ρ2)(s2o + s2

n − 2ρso sn)

}

where ρ is the correlation between the old and new scores, and

g(so , sn) =1

2π√

1 − ρ2exp

{−

1

2(1 − ρ2)

[(s2o − µ) + (s2

n − µ) − 2ρ(so − µ)(sn − µ)]}

−1 0 1 2

−1

01

2

old

new

−1 0 1 2

−1

01

2

old

new

10/26

Page 11: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Old Scorecard

For the old scorecard, when the distribution of scores is truncatedat t, so only those with s > t are retained, the marginal density forbad cases is

fot(s) =φ(s)

(1− Φ(t))=

φ(s)

Φ(−t), s > t

where

I φ is the standard normal PDF

I Φ is the standard normal CDF

Similarly, the marginal (truncated) score density for good cases

got(s) =φ(s − µ)

(1− Φ(t − µ))=φ(s − µ)

Φ(µ− t), s > t

11/26

Page 12: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

New Scorecard

Marginal distribution of the new scorecard for bads is

fnt(s) =φ(s)

{1−Φ(t−ρs)/

√1−ρ2

}1−Φ(t)

=φ(s)Φ

((ρs−t)/

√1−ρ2

)Φ(−t)

and for the goods

gnt(s) =φ(s−µ)

{1−Φ(t−µ−ρ(s−µ))/

√1−ρ2

}1−Φ(t−µ)

=φ(s−µ)Φ(µ+ρ(s−µ)−t)/

√1−ρ2

Φ(µ−t)

Particular cases of skew-normal distributions (e.g. Cohen, 1991).

12/26

Page 13: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Performance measures

The densities above have corresponding CDFs denoted Fnt , Fot ,Got , Gnt .

Having this simple framework provides the chance to reason aboutthe extent of overlap of the distributions, noting that the overlap isthe same for the whole population, in both cases.

Three standard measures, given distribution functions F and G

I KS statistic: maxs |F (s)− G (s)|I AUC:

∫F (s)dG (s)

I H measure: (a coherent performance measure, the same forall scorecards. See Hand (2009) for details).

13/26

Page 14: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Since concerned with difference between score distributions, define

I PKS = maxs |Fot(s)− Got(s)| −maxs |Fnt(s)− Gnt(s)|I PAUC =

∫Fot(s)dGot(s)−

∫Fnt(s)dGnt(s)

I PH difference in H measure

In all cases, negative values imply that the new scorecard is betterthan the old scorecard.

14/26

Page 15: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Simulation results IHere, µ = 2 and t varies. Each plot displays performance measure, withdecreasing lines corresponding to increasing correlation. (Left - KS, right- AUC, bottom - H).

15/26

Page 16: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

In all cases, as t increases, the preference for the new scorecardincreases. Recall that this is only an artifact of the biased sample -with the full population, all distributions have the same overlap.

As expected, the effect becomes more marked with decreasingcorrelation between old and new scores.

16/26

Page 17: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Simulation results II

Interesting to consider performance with µ. Following tables referto AUC, but other measures exhibit similar behaviour.

Note that the choices of µ are selected to explore the range ofperformance. For example

µ Population AUC

1.9 0.672.2 0.762.5 0.832.8 0.88

Perhaps µ = 2.2 is most representative of typical applicationscorecards?

17/26

Page 18: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

AUC results

ρ = 0.2,

t µ = 1.9 µ = 2.2 µ = 2.5 µ = 2.8

0.1 -0.112 -0.073 -0.033 -0.0210.5 -0.185 -0.129 -0.062 -0.0390.9 -0.280 -0.209 -0.110 -0.073

ρ = 0.8,

t µ = 1.9 µ = 2.2 µ = 2.5 µ = 2.8

0.1 -0.003 -0.016 -0.005 -0.0030.5 -0.005 -0.034 -0.014 -0.0080.9 -0.008 -0.064 -0.032 -0.020

18/26

Page 19: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

The effect is

I again stronger for low correlation than high correlation.

I more profound the less separable are the distributions (this isexpected, as the problem gets easier, the difference willbecome smaller).

If we consider µ = 2.2, then the effect might be strong enough,even for high correlation, to encourage us to change scorecard?

In that case, we have been misled, simply by a selection effect.

Replacing the scorecard incurs costs for software and staffretraining. Perhaps better to err on the side of caution?

19/26

Page 20: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Real Data Example

I UPL data from major UK bank

I 66K cases, on 30 variables

I data pre-scored and subject to previous variable selection

I clear evidence of population drift (examples in Adams et al.(2010))

Population drift is a potential confounder with this bias. Toremove the effect of drift, data order randomised.

Note, that this is simply meant to be an illustration of the biasdemonstrated above, present in real data.

20/26

Page 21: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Old model (LR), built on 1 year’s worth of data. 11 variables,including

I age indicators

I loan amount and indicators

I search indicator

I times: address, bank, employment

Now, consider second year of data.

Select the 80% best scores from the old model. This set ofapplicants constitutes the test set, for comparing new model withold.

21/26

Page 22: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

New model (LR), built on same data. 15 variables, including

I age

I search indicators

I address indicators

I employment indicators

I card status indicators

Overall correlation of scores: 0.37

I selected sample: AUC(old)=0.633, AUC(new)=0.661

I full test data: AUC(old)=0.681, AUC(new)=0.676

Again, this is not meant to be a realistic scorecard constructionexercise. The effect of the bias here is small – but still misleading.

22/26

Page 23: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Symmetric fix?

Consider {(so , sn) : so > t, sn > t ′

}By this means, neither scorecard is based on the entire populationof scores (which is impossible!), but the truncation is applied inthe same manner to both scorecards. This may reduce relativeselection distortion.

This correction is aimed at reducing the bias in favour of the newscorecard - to support a fairer comparison.

23/26

Page 24: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Apply this idea to the real data. Let t, t ′ be s.t. they select thesame proportion (20%, as before) in each class. We get that

Method AUC Difference: Old - New

Whole pop. 0.5Single truncation -2.8Double truncation 1.5

I Yes, differences small

I Single truncation suggests the wrong answer

I Double truncation get the difference in the right direction:correctly showing that the old scorecard is superior

24/26

Page 25: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

Conclusion

I Have demonstrated another problem arising from selectionbias - a preference for a new scorecard in comparativeevaluations.

I This is not a question of estimation uncertainty, but afundamental bias induced by the selection effect.

I Like other selection bias problems, any solution would requireextra information. Solutions unlikely to be procedurallypalatable.

I Interesting interaction between this problem, and populationdrift. What other sources contribute to complicate thesecomparisons?

25/26

Page 26: Selection Bias in Credit Scorecard Evaluationabellott/ModelRisk/ModelRiskAdams.pdf · Selection Bias in Credit Scorecard Evaluation David J. Hand1;2 and Niall M. Adams1;3 1Department

References

Hand, D.J. (2009) Measuring classifier performance: a coherentalternative to the area under the ROC curve. Machine Learning,77, 103-123.

Hand, D.J. and Adams, N.M. (2012) Selection bias in creditscorecard evaluation. Journal of the Operational Research Society,under revision.

Adams, N.M., Tasoulis, D.K., Anagnostopoulos, C. and Hand, D.J.(2010) Temporally-adaptive linear classification for handlingpopulation drift in credit scoring. In COMPSTAT2010, Proceedingsof the 19th International Conference on Computational Statistics,Lechevallier, Y. And Saporta. (Eds), 2010, Springer, 167-176.

Cohen A.C. (1991) Truncated and Censored Samples: Theory andApplications. Marcel Dekker, New York.

26/26