how to combat modeling hurdles › ... › day2_1645_demedina… · how to combat modeling hurdles...

Post on 30-Jun-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

How to Combat Modeling Hurdles

1. Compensate for needles in the haystack 2. Pre-process your data carefully 3. Be aware of institutional challenges 4. Set your misclassification costs 5. Increase model acceptance with visualization 6. Insist on accurately labeled historical data 7. Look for collusion 8. Prepare for an ever-changing landscape

1

1. Compensate for Needles in the Haystack

¨  Up-sample / down-sample data to obtain balanced training data ¤  Pitfall 1: up-sampling exaggerates importance of particular instance of

the rare event ¤  Pitfall 2: not partitioning into training and testing FIRST

2

¨  Importance of baselining ¤  is a model that’s

right 99.9% of the time a good one? That depends...

2. Pre-Process your Data Carefully

3 * first name has been changed

Fuzzy Matching is Crucial

4

¨  Borrow techniques from text mining

¨  Change GUI inputs to restrict entries

3. Be Aware of Institutional Challenges

5

                 

¨  PR considerations ¤  Sensitivity to customer relations

¨  Understanding that the models are not deciding fraud, but flagging cases as suspect ¤  internal jobs, budget concerns ¤  “automated screening process”

¨  Convince data keepers of importance of maintaining data for fraud detection purposes

                 

4. Set Your Misclassification Costs

¨  Terminology: ¤  specificity/sensitivity (biometrics, fraud) ¤  type I/II (statistics) ¤  false positive/negative (medicine) ¤ precision/recall (information retrieval) ¤  false alarm/false dismissal (fraud, others)

Predic(on  Truth   Not  Flagged   Flagged  

Not  Fraud  True  

nega(ve  False  posi(ve  

Fraud   False  nega(ve  

True  posi(ve  

6

False Positives Can Range from Mildly Embarrassing…..

….To Hugely Costly

Airport evacuation due to a bomb scare…

False Negatives: Not Always A Huge Deal…

Approving a $250 fraudulent insurance claim

But Sometimes Life or Death

5. Increase Model Acceptance with Visualization

¨  Present users with an intuitive and easy-to-use GUI to encourage use of model results

¨  Use interactive graphs and charts ¨  Explain how model generated results

11

Contracting with the USPS

12

USPS managed $33 Billion in contracts (FY2009)

RADR Risk Assessment Data Repository

RADR Risk Assessment Data Repository

RADR Risk Assessment Data Repository

RADR Risk Assessment Data Repository

6. Insist on Accurately Labeled Historical Data

17

¨  Undetected fraud shows up in training set as examples of non-fraud -> can confuse model

¨  Identified fraud might not be in training set

¨  Flagged cases that proved to be non-fraudulent are not recorded as such

¨  Institutional challenges with keepers of the data

7. Look for Collusion

18

Breakout Fraud: ¨  Collusion where every

member flies “below the radar”.

¨  The group works in concert to commit one large act of fraud or several small ones.

¨  Perhaps 5 people pretending to be 100.

¨  Link analysis algorithms are very useful in detecting this type of fraud.

8. Prepare for an Ever-changing Landscape

19

¨  Moving target: fraudsters constantly refining and expanding their schemes

¨  Models must be very closely guarded

¨  Models must be updated often

¨  Subject matter expertise is crucial

How to Combat Modeling Hurdles

1. Compensate for needles in the haystack 2. Pre-process your data carefully 3. Be aware of institutional challenges 4. Set your misclassification costs 5. Increase model acceptance with visualization 6. Insist on accurately labeled historical data 7. Look for collusion 8. Prepare for an ever-changing landscape

20

Thank you. Questions? antonia@datamininglab.com

Antonia de Medinaceli

Antonia de Medinaceli is currently Director of Fraud

Analytics at Elder Research, Inc., the nation’s largest independent data mining consultancy. Ms. de Medinaceli has applied Data Mining

technologies to a range of projects, including direct marketing, web site personalization, pattern recognition in digital images, and financial analysis. In addition to her consulting experience, she has also co-taught Data Mining short courses with the Elder Research team. Her previous industry experience was largely focused on the design and implementation of algorithms for the optimization of large-scale systems. These projects included flight network optimization, data fusion, and efficiency improvements in manufacturing settings.

22

top related