deep learning for fraud detection

© 2014 MapR Technologies 1© 2014 MapR Technologies

Deep Learning for Fraud Detection

© 2014 MapR Technologies 2

Contact Information

Ted DunningChief Applications Architect at MapR Technologies

Committer & PMC for Apache’s Drill, Zookeeper & othersVP of Incubator at Apache Foundation

Email [email protected] [email protected]

Twitter @ted_dunning

mailto:[email protected]

mailto:[email protected]


Goals for Today• Explore the state of the art for deep-learning and fraud detection

• Separate at least some of the wheat from the chaff

• Provide some realistic guidance for getting results


Goals for Today• Explore the state of the art for deep-learning and fraud detection

• Separate at least some of the wheat from the chaff

• Provide some realistic guidance for getting results

• Play with cool stuff !


Agenda• Motivation• What are neural networks and deep learning?• It can be simpler than you think• But, no free lunch / you get what you pay / other clever aphorism• Some experiments• Where to go from here


Motivation For Advanced Modeling in Fraud• Neural networks have completely dominated credit card fraud

detection since late 80’s– Random forest, tree ensembles often used in other kinds of fraud and

churn models

• The reason is rule-based systems simply don’t work– Well, they do work at first– Fraudsters change tactics, you add rules, interaction mayhem ensues

• And learning algorithms really do work– Fraudsters change tactics, you add features and retrain


So learning is good


So learning is good

But good learning is hard


So learning is good

But good learning is hardAnd finding good features is

really hard


Some Sample Features• Charge size relative to previous averages for card• Charge size relative to previous average for merchant• Known merchant or not• Doubled transaction• AVS or CVV2 mismatch


Some Sample Features• Charge size relative to previous averages for card• Charge size relative to previous average for merchant• Known merchant or not• Doubled transaction• Address Verification System or CVV2 mismatch


Some Sample Features• Charge size relative to previous averages for card• Charge size relative to previous average for merchant• Known merchant or not• Doubled transaction• Address Verification System or Card Verification Value mismatch


Some Sample Features• Charge size relative to previous averages for card• Charge size relative to previous average for merchant• Known merchant or not• Doubled transaction• Address Verification System or Card Verification Value mismatch• Unusual region for card• Unusual time-of-day relative to history• Magstripe use if chip available• (hundreds more)


Sequence Based Features• Plausible pattern matching (rent a car, pay for gas at airport)

• Probe transactions (gas in wrong place, pizza, big charge)

• Previous transaction at compromised merchant

• Card velocity


Key Problems • Good guys need data … that means that fraudsters get first

chance at bat

• Good guys are careful and test systems before releasing

• Bad guys have many low-risk transactions and can change methods quickly

• In some areas, fraudster adapt techniques in hours


Making up features is easy

Finding features that add real lift is very

hard


What are neural networks and deep learning?• Start simple … imagine we have 20 features, 0 or 1

– Let’s yell “Fraud” if any of the features is a 1

– Houston, we have a model

• But this model isn’t any better than a rule• Also doesn’t have any interesting Greek letters


Real-world Intrudes• We assumed all features are equally good

– What if some are kind of poor or weak?

• Can we weight different features more or less?– Can we learn these weights from data?


Learning Works• Yes. We can learn these models

• How we measure error is important

• We must have good features

• Even good features may need transformation– Take logs of times and monetary values– Subtract means, scale, bin values


Not Good Enough• We need combinations of models

• Simple linear combinations aren’t subtle enough

• Enter multi-level models– Can we learn a model that uses combinations of inputs– Where each of those combinations is a model that we learn?


Yes, Virginia, There IS a Santa Claus

Each circle is a sum and a (soft) threshold

Arrows are multiplication by a learned weight


Errors on Output Can Propagate

Each circle is sends error to each arrow

Arrows weight back-propagating errors


Success!Triumph!

World domination!


World domination!

With some reservations because features are

hard


Turtles All the Way Down – We Wish• This learning works well for just a few layers

• This is still a big deal … – with cool features, we can build real systems

• With many layers, the learning no longer converges

• Well … until recently


Model Learning in an Ideal World • If we could just learn the features

– Maybe unsupervised, maybe supervised– And at the same time learn the model

• Presumably we could build models quicker

• And more easily

• And we wouldn’t have to dirty our minds with pedestrian domain knowledge


Example 1 – (not very) Deep Auto-encoder• Let’s take an example where we can learn features

• Data is EKG traces

• We want to find anomalies – No supervised training


Spot the Anomaly

Anomaly?


Maybe not!


Where’s Waldo?

This is the real anomaly


Normal Isn’t Just Normal• What we want is a model of what is normal

• What doesn’t fit the model is the anomaly

• For simple signals, the model can be simple …

• The real world is rarely so accommodating


We Do Windows


Windows on the World• The set of windowed signals is a nice model of our original signal• Clustering can find the prototypes

– Fancier techniques available using sparse coding

• The result is a dictionary of shapes• New signals can be encoded by shifting, scaling and adding

shapes from the dictionary


Most Common Shapes (for EKG)


Reconstructed signal

Original signal

Reconstructed signal

Reconstructionerror

< 1 bit / sample


An Anomaly

Original technique for finding 1-d anomaly works against reconstruction error


Close-up of anomaly

Not what you want your heart to do.

And not what the model expects it to do.


A Different Kind of Anomaly


Some k-means Caveats• But Eamonn Keogh says that k-means can’t work on time-series

• That is silly … and kind of correct, k-means does have limits– Other kinds of auto-encoders are much more powerful

• More fun and code demos at – https://github.com/tdunning/k-means-auto-encoder

http://www.cs.ucr.edu/~eamonn/meaningless.pdf


The Limits of Clustering as Auto-encoder• Clustering is like trying to tile your sample distribution• Can be used to approximate a signal

• Filling d dimensional region with k clusters should give

• If d is large, this is no good


Moral For Auto-encoders• The simplest auto-encoders can be good models

• For more complex spaces/signals, more elaborate models may be required– Winner take (absolutely) all may be problematic– In particular, models that allow sparse linear combination may be better

• Consider deep learning, recurrent networks, denoising


How Does Clustering Do Reconstruction?

For normalized cluster centroids, dot-product and distance are equivalent



Winner takes all with k-means



Dot-product scales centroid to reconstruct


AKA - Neural Network


What If … We Had More Layers?


Other Thoughts• What if we allow more than one cluster to be active?

– k-sparse learning!




• Well, almost


The Point of Deep Learning• It isn’t just many hidden layers in a neural network

• The goal is to eliminate feature engineering by learning features as well as the classifier


Experiment 3 – Card Velocity• Most features so far are inherent in the data• Few are true sequence features

• Card velocity is a pure combination– Starting point can be anywhere– The issue is where the next point is relative to starting point


Card Velocity

Non-fraud steps arereasonable in terms of distance and time

Fraudulent use of card by multiple attackers results in big, fast jumps


Synthetic Data Example• Generate random point• Take four small steps• If fraud, second step can be large• Result is five positions, each in 3-d on surface of a sphere

– Data shape is N x (5 x 3)

• Add secondary features containing step size … N x 4


The Truth is Out There• With the right feature (step-size),

it is trivial to spot the fraud

• Here we show the step size between positions

• Fraud cases take a big jump that others don’t

• But they can be anywhere


But Dimensionality Bites Hard• With the step-size feature, learning succeeds instantly with the

simplest models and gets nearly perfect accuracy

• Without the step-size feature, learning with TensorFlow gets modest accuracy after substantial learning cost (work in progress, could do better with lots more tuning)

• The problem is that there are two many combinations of 15 variables, we need a very specific combination of three pair-wise diffs combined non-linearly into a distance


We have a bona fide revolution

But old tricks still pay


Greenfield Problem Landscape


Mature Problem Landscape


Summary• There is too much to say in 40 minutes, let’s talk some more at

the MapR booth

• Deep learning, especially with systems like TensorFlow have huge promise

• Deep learning trades learning architecture engineering for feature engineering

• There are powerful middle grounds


Short Books by Ted Dunning & Ellen Friedman• Published by O’Reilly in 2014 - 2016• For sale from Amazon or O’Reilly• Free e-books currently available courtesy of MapR

http://bit.ly/ebook-real-world-hadoop

http://bit.ly/mapr-tsdb-ebook

http://bit.ly/ebook-anomaly

http://bit.ly/recommendation-ebook


Streaming Architectureby Ted Dunning and Ellen Friedman © 2016 (published by O’Reilly)

Free copies at book signing today

http://bit.ly/mapr-ebook-streams


Thank You!


Q & A@mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies

deep learning for fraud detection

Technology