building intelligent data products
TRANSCRIPT
building intelligent data products
what actually is fraud
architecting flexible data ‘plumbing’
building solid data products on top of them
stephen whitworth
2 years at Hailo as data scientist/jack of some trades out of university
product and marketplace analytics, agent based modelling, data engineering, ‘ML’ services
data science/engineering at ravelin, specifically focused on our detection capabilities
what is ravelin?
online fraud detection and prevention platform
stream application/server data to our events API
we give fraud probability + beautiful data visualisation
backed by techstars/passion/playfair/amadeus/indeed.com founder/wonga founder amongst other great investors
fraud?
$14Ba dollar for every year the universe has existed
Same day delivery On-demand services
‘victimless crime’
police ill-equipped to handle
low barrier to entry from dark net
3D secure - conversion killer
traditional: human generated rules, born of deep expertise
order-centric view of the world
hybrid: augment expertise by learning rules from data
cards don’t commit fraud, people do
building good plumbing
receive firehose through API
decode arbitrary data and store
extract hundreds of features
http/slack/whatever notification to customer
in 100-300ms (ish)
run through N models and rule engine to get probability
BUZZWORDS ABOUND
go
postgres
AWS
microservices
zookeeper
NSQ python
event-driven
elasticsearch bigquery dynamodb
redis
instrumentation
different databases for different needs
kudos if you get The Office reference
postgres: solid, start here
dynamodb: very high throughput, low latency data
bigquery: to answer any question you could possibly have
elasticsearch: rich querying in a reasonable amount of time
graph db: haven’t decided, recommendations?
asynchronous systemsfirehoses
nice deployment patterns
‘lambda architecture’ - the append only log
services store their own interpretation of events
services are almost entirely decoupled
asynchronous systemsfirehoses
error propagation is challenging
no guarantees of SLA - at least as slow as your queue
hard to know who or what is consuming your data
building data products
‘a random forest is like a room full of experts who have seen different
cases of fraud from different perspectives’
‘a random forest is like a room full of experts who have seen different
cases of fraud from different perspectives’
N
precision: of all of my predictions, what % was I correct?
recall: out of all of the fraudsters, what % did I catch?
implicit tradeoff between conversion and fraud loss
‘accuracy’ a useless metric for fraud
99.8% ACCURATE
keep model interfaces simple
hide arbitrarily complex transformations behind it
blend global and client specific models
building and training statistical models
currently batch
will combine with online
RANDOM FORESTS
‘a random forest is like a room full of experts who have seen different
cases of fraud from different perspectives’
RANDOM FORESTS
MONITORING
probabilistic, not deterministic
dogfood - use live robot customers
run models in ‘dark mode’ to determine performance
why not deep learning? ..yet
ability to debug random forests
had nice results with keras
serialisation and deployment: an unsolved problem
in beta and signing up clients
looking for on-demand services/marketplaces
talk to me afterwards
obligatory: we are hiring!
senior machine learning engineers/data scientists
[email protected] or talk to me after
@sjwhitworthwww.ravelin.com - @ravelinhq