building intelligent data products

what actually is fraud

architecting flexible data ‘plumbing’

building solid data products on top of them

stephen whitworth

2 years at Hailo as data scientist/jack of some trades out of university

product and marketplace analytics, agent based modelling, data engineering, ‘ML’ services

data science/engineering at ravelin, specifically focused on our detection capabilities

what is ravelin?

online fraud detection and prevention platform

stream application/server data to our events API

we give fraud probability + beautiful data visualisation

backed by techstars/passion/playfair/amadeus/indeed.com founder/wonga founder amongst other great investors

http://indeed.com

fraud?

$14Ba dollar for every year the universe has existed

Same day delivery On-demand services

‘victimless crime’

police ill-equipped to handle

low barrier to entry from dark net

3D secure - conversion killer

traditional: human generated rules, born of deep expertise

order-centric view of the world

hybrid: augment expertise by learning rules from data

cards don’t commit fraud, people do

building good plumbing

receive firehose through API

decode arbitrary data and store

extract hundreds of features

http/slack/whatever notification to customer

in 100-300ms (ish)

run through N models and rule engine to get probability

BUZZWORDS ABOUND

go

postgres

AWS

microservices

zookeeper

NSQ python

event-driven

elasticsearch bigquery dynamodb

redis

instrumentation

different databases for different needs

kudos if you get The Office reference

postgres: solid, start here

dynamodb: very high throughput, low latency data

bigquery: to answer any question you could possibly have

elasticsearch: rich querying in a reasonable amount of time

graph db: haven’t decided, recommendations?

asynchronous systemsfirehoses

nice deployment patterns

‘lambda architecture’ - the append only log

services store their own interpretation of events

services are almost entirely decoupled

asynchronous systemsfirehoses

error propagation is challenging

no guarantees of SLA - at least as slow as your queue

hard to know who or what is consuming your data

building data products

‘a random forest is like a room full of experts who have seen different

cases of fraud from different perspectives’



N

precision: of all of my predictions, what % was I correct?

recall: out of all of the fraudsters, what % did I catch?

implicit tradeoff between conversion and fraud loss

‘accuracy’ a useless metric for fraud

99.8% ACCURATE

keep model interfaces simple

hide arbitrarily complex transformations behind it

blend global and client specific models

building and training statistical models

currently batch

will combine with online

RANDOM FORESTS

RANDOM FORESTS

MONITORING

probabilistic, not deterministic

dogfood - use live robot customers

run models in ‘dark mode’ to determine performance

why not deep learning? ..yet

ability to debug random forests

had nice results with keras

serialisation and deployment: an unsolved problem

in beta and signing up clients

looking for on-demand services/marketplaces

talk to me afterwards

obligatory: we are hiring!

senior machine learning engineers/data scientists

[email protected] or talk to me after

mailto:[email protected]

@sjwhitworthwww.ravelin.com - @ravelinhq

https://www.ravelin.com

building intelligent data products

Technology