active learning for fraud prevention

©2016 PayPal Inc. Confidential and proprietary.

Active Learning for Fraud Prevention

Venkatesh Ramanathan• May 21, 2016


Agenda Introduction

Fraud Prevention

Algorithm

Experiments

Conclusion


INTRODUCTION

© 2016 PayPal Inc. Confidential and proprietary.

About Me•Software Engineer/Data Scientist/ML Researcher•Ph. D Computer Science•Research in Face Recognition, Phishing/Spam, Fraud Prevention

4

developers

+2.5 MILLION


payments/year

4.9 BILLION

payments/second at

peak

~300

active customer accounts

184M

petabytes of data

42database

calls/ quarter

4.5T

PayPal operates one of the

largestPRIVATE CLOUDSin the world

We have transformed core

business processes into robustSERVICE-BASED

PLATFORMS

The power of our platform

Our technology transformation enables us to:• Process payments at tremendous scale• Accelerate the innovation of new products• Engage world-class developers & technologists

About PayPal


FRAUD PREVENTION

Fraud Prevention @ PayPal

Robust feature engineering, machine learning and statistical models

Highly scalable and multi-layered infrastructure software

Superior team of data scientists, researchers, financial and intelligence analysts

Images source: images.google.com

Fraud Prevention @ PayPal

• Employs advanced machine learning and statistical models to flag fraudulent behavior up-front

• More sophisticated algorithms after transaction is completeTransaction

Level

• Monitor account level activity to identify abusive behavior• Abusive pattern include frequent payments, suspicious

profile changesAccount Level

• Monitor account-to-account interaction• Frequent transfer of money from several accounts to one

central account Network Level

Fraud Prevention – What are we up against?

Fraudsters are becoming increasingly smarter and adaptive

Need cost-effective solutions that can model complex attack patterns not previously observed

Need scalable and computationally efficient prediction models


Fraud Prevention – What are we up against?

•Much harder to get performance lift on our flagship models• Need to re-look at all aspects of

traditional model building• Need out-of-the-box thinking

10

Area we are missing (AUC 0.96)


Fraud Prevention – What can we do to build better models?

11

feature1 …. featureN ……… Target (Label)

d1d2…dM…..

Better feature

Better labeling

Advanced ML

Algorithms

Bigger better data


ALGORITHM – ACTIVE LEARNING


Active Learning – What is it?

• Supervised learning algorithms require data to be labeled• Labelling is difficult, time-

consuming and expensive : Active Learning to the rescue• Idea – ML Algorithm can achieve

better accuracy if it is allowed to “choose the data” from which it learns*• Overcome labelling bottleneck

by asking queries (unlabeled data) to be labeled by human

13

Unlabeled Data

Labeled Data

Human Annotator

Machine Learning Model

(Re)Build Model

Select Queries

Source*: Burr Settles



• Scenarios• Membership Query Synthesis – request labels for ‘any’

unlabeled instance in input space• Stream-based Selective Sampling – unlabeled instance

is drawn one at a time & learner decides whether to discard or query• Pool-based Sampling – instances are queried from a pool

according to informative-ness measure

14



• Query Strategy Frameworks• Uncertainty Sampling• Query-By-Committee• Expected Model Change• Expected Error Reduction• Variance Reduction• Density Weighted

Methods

15


Active Learning – Toy Example

16

Toy data – 400 instances Model using random sampling70% accuracy

Model using active learning Uncertainty sampling – 90% accuracy


Active Learning For Fraud Prevention – Why is it unique?

17

• Data is unbalanced• Fraud labelling require trained experts. Can’t be outsourced• Fraud labelling is time consuming• Fraud labelling require more than just individual instances.

Require before & after transactions• Fraud labelling require data from other entities (ex: IP

address)• Fraud labelling require aggregate data• Fraud tag mature at different times (ex: chargeback) & not

instantaneous


Active Learning For Fraud Prevention – High Level Framework

18

Labeled Data

Create Bags

Deep Learning Model

GBT Model

(Re)Build Models

Unlabeled Data

Predict

Query By Committee

Human Expert

Create Statistics

Active Feature

Engineering SimulateFeatures


Modeling Algorithm – Deep Learning

19

Input LayerHidden Layers

Output Layer

• If a network has many layers of non-linearity, it is “deep”• Need scalable platform• Need lots of training data


Modeling Algorithm – Deep Learning

20

•Network Topology – Feed forward•Key Parameters•# of hidden layers•# of neurons @ each hidden layer•Regularization• Activation function


Modeling Algorithm – Gradient Boosting Trees

21

• GBT = Gradient Descent + Boosting• Fit an additive (ensemble) model in forward stage wise

manner• In each stage introduce a new model to compensate the

shortcomings of existing models


Modeling Algorithm – Gradient Boosting Trees

22

• Strengths• No pre-processing required• Robust• Scalable

•Weaknesses• Overfits (Need to find proper stopping point)• Sensitive to noise

• Key Parameters• # of trees• Max depth• Max observations • Learning rate


EXPERIMENTS


Datasets

24

• Training Data• 1 year• 11 million transactions (1 million for active labelling)

• Test Data• 4 months• 4 million transactions

•# of features• 500 - 600


Tools

25

• H2O• Open source• Scalable• Robust• Deep Learning & GBM implementations

• R• Open source• Active learning package

© 2016 PayPal Inc. Confidential and proprietary. 26

# of instances queried AUC (*weighted)

0 0.9601000 0.96110000 0.96350000 0.971100000 0.975500000 0.9771000000 0.979

Early Results – Active Learning Shows Promise…


CONCLUSIONS


Conclusions

28

• Deep learning & GBT has shown tremendous performance for fraud detection.• Active learning shows promise in improving performance

of these champion models• Active learning also significantly reduce our labelling cost

active learning for fraud prevention

Technology