active learning for fraud prevention
TRANSCRIPT
©2016 PayPal Inc. Confidential and proprietary.
Active Learning for Fraud Prevention
Venkatesh Ramanathan• May 21, 2016
©2016 PayPal Inc. Confidential and proprietary.
Agenda Introduction
Fraud Prevention
Algorithm
Experiments
Conclusion
©2016 PayPal Inc. Confidential and proprietary.
INTRODUCTION
© 2016 PayPal Inc. Confidential and proprietary.
About Me•Software Engineer/Data Scientist/ML Researcher•Ph. D Computer Science•Research in Face Recognition, Phishing/Spam, Fraud Prevention
4
developers
+2.5 MILLION
©2016 PayPal Inc. Confidential and proprietary.
payments/year
4.9 BILLION
payments/second at
peak
~300
active customer accounts
184M
petabytes of data
42database
calls/ quarter
4.5T
PayPal operates one of the
largestPRIVATE CLOUDSin the world
We have transformed core
business processes into robustSERVICE-BASED
PLATFORMS
The power of our platform
Our technology transformation enables us to:• Process payments at tremendous scale• Accelerate the innovation of new products• Engage world-class developers & technologists
About PayPal
©2016 PayPal Inc. Confidential and proprietary.
FRAUD PREVENTION
Fraud Prevention @ PayPal
Robust feature engineering, machine learning and statistical models
Highly scalable and multi-layered infrastructure software
Superior team of data scientists, researchers, financial and intelligence analysts
Images source: images.google.com
Fraud Prevention @ PayPal
• Employs advanced machine learning and statistical models to flag fraudulent behavior up-front
• More sophisticated algorithms after transaction is completeTransaction
Level
• Monitor account level activity to identify abusive behavior• Abusive pattern include frequent payments, suspicious
profile changesAccount Level
• Monitor account-to-account interaction• Frequent transfer of money from several accounts to one
central account Network Level
Fraud Prevention – What are we up against?
Fraudsters are becoming increasingly smarter and adaptive
Need cost-effective solutions that can model complex attack patterns not previously observed
Need scalable and computationally efficient prediction models
© 2016 PayPal Inc. Confidential and proprietary.
Fraud Prevention – What are we up against?
•Much harder to get performance lift on our flagship models• Need to re-look at all aspects of
traditional model building• Need out-of-the-box thinking
10
Area we are missing (AUC 0.96)
© 2016 PayPal Inc. Confidential and proprietary.
Fraud Prevention – What can we do to build better models?
11
feature1 …. featureN ……… Target (Label)
d1d2…dM…..
Better feature
Better labeling
Advanced ML
Algorithms
Bigger better data
©2016 PayPal Inc. Confidential and proprietary.
ALGORITHM – ACTIVE LEARNING
© 2016 PayPal Inc. Confidential and proprietary.
Active Learning – What is it?
• Supervised learning algorithms require data to be labeled• Labelling is difficult, time-
consuming and expensive : Active Learning to the rescue• Idea – ML Algorithm can achieve
better accuracy if it is allowed to “choose the data” from which it learns*• Overcome labelling bottleneck
by asking queries (unlabeled data) to be labeled by human
13
Unlabeled Data
Labeled Data
Human Annotator
Machine Learning Model
(Re)Build Model
Select Queries
Source*: Burr Settles
© 2016 PayPal Inc. Confidential and proprietary.
Active Learning – What is it?
• Scenarios• Membership Query Synthesis – request labels for ‘any’
unlabeled instance in input space• Stream-based Selective Sampling – unlabeled instance
is drawn one at a time & learner decides whether to discard or query• Pool-based Sampling – instances are queried from a pool
according to informative-ness measure
14
© 2016 PayPal Inc. Confidential and proprietary.
Active Learning – What is it?
• Query Strategy Frameworks• Uncertainty Sampling• Query-By-Committee• Expected Model Change• Expected Error Reduction• Variance Reduction• Density Weighted
Methods
15
© 2016 PayPal Inc. Confidential and proprietary.
Active Learning – Toy Example
16
Toy data – 400 instances Model using random sampling70% accuracy
Model using active learning Uncertainty sampling – 90% accuracy
© 2016 PayPal Inc. Confidential and proprietary.
Active Learning For Fraud Prevention – Why is it unique?
17
• Data is unbalanced• Fraud labelling require trained experts. Can’t be outsourced• Fraud labelling is time consuming• Fraud labelling require more than just individual instances.
Require before & after transactions• Fraud labelling require data from other entities (ex: IP
address)• Fraud labelling require aggregate data• Fraud tag mature at different times (ex: chargeback) & not
instantaneous
© 2016 PayPal Inc. Confidential and proprietary.
Active Learning For Fraud Prevention – High Level Framework
18
Labeled Data
Create Bags
Deep Learning Model
GBT Model
(Re)Build Models
Unlabeled Data
Predict
Query By Committee
Human Expert
Create Statistics
Active Feature
Engineering SimulateFeatures
© 2016 PayPal Inc. Confidential and proprietary.
Modeling Algorithm – Deep Learning
19
Input LayerHidden Layers
Output Layer
• If a network has many layers of non-linearity, it is “deep”• Need scalable platform• Need lots of training data
© 2016 PayPal Inc. Confidential and proprietary.
Modeling Algorithm – Deep Learning
20
•Network Topology – Feed forward•Key Parameters•# of hidden layers•# of neurons @ each hidden layer•Regularization• Activation function
© 2016 PayPal Inc. Confidential and proprietary.
Modeling Algorithm – Gradient Boosting Trees
21
• GBT = Gradient Descent + Boosting• Fit an additive (ensemble) model in forward stage wise
manner• In each stage introduce a new model to compensate the
shortcomings of existing models
© 2016 PayPal Inc. Confidential and proprietary.
Modeling Algorithm – Gradient Boosting Trees
22
• Strengths• No pre-processing required• Robust• Scalable
•Weaknesses• Overfits (Need to find proper stopping point)• Sensitive to noise
• Key Parameters• # of trees• Max depth• Max observations • Learning rate
©2016 PayPal Inc. Confidential and proprietary.
EXPERIMENTS
© 2016 PayPal Inc. Confidential and proprietary.
Datasets
24
• Training Data• 1 year• 11 million transactions (1 million for active labelling)
• Test Data• 4 months• 4 million transactions
•# of features• 500 - 600
© 2016 PayPal Inc. Confidential and proprietary.
Tools
25
• H2O• Open source• Scalable• Robust• Deep Learning & GBM implementations
• R• Open source• Active learning package
© 2016 PayPal Inc. Confidential and proprietary. 26
# of instances queried AUC (*weighted)
0 0.9601000 0.96110000 0.96350000 0.971100000 0.975500000 0.9771000000 0.979
Early Results – Active Learning Shows Promise…
©2016 PayPal Inc. Confidential and proprietary.
CONCLUSIONS
© 2016 PayPal Inc. Confidential and proprietary.
Conclusions
28
• Deep learning & GBT has shown tremendous performance for fraud detection.• Active learning shows promise in improving performance
of these champion models• Active learning also significantly reduce our labelling cost