realtime risk management using kafka, python, and spark streaming by nick evans

27
Realtime Risk Management USING KAFKA, PYTHON, AND SPARK STREAMING

Upload: spark-summit

Post on 21-Apr-2017

3.148 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Realtime Risk ManagementU S I N G K A F K A , P Y T H O N , A N D S PA R K S T R E A M I N G

2

U N D E R W R I T I N G C R E D I T C A R D T R A N S A C T I O N S I S R I S K Y

3

W E N E E D T O B E Q U I C K AT U N D E R W R I T I N G

4

W E A L S O N E E D T O AV O I D L O S I N G M O N E Y

5

Some Numbers

$12BnC U M U L AT I V E P R O C E S S E D

200k+M E R C H A N T S

14kE V E N T S / S E C O N D

7R I S K A N A LY S T S

6

Risk Analysts

7

W E N E E D E D T O B U I L D S O M E T H I N G T H AT S T O P S T H E

H O P E FA C T O R

8

9

S PA R K S T R E A M I N G A L L O W S U S T O D O R E A L -T I M E

D ATA P R O C E S S I N G .

W E C A N D E C I D E W H I C H E V E N T S N E E D A C L O S E R L O O K

10

Intro to Kafka, Zookeeper,

and Spark Streaming

11

Apache Kafka

Image taken from Kafka docs

12

Apache Zookeeper

Image taken from Zookeeper docs

13

Apache Spark Streaming

Image taken from mapr.com

14

O L D WAY: R E C E I V E R S

15

Kafka

Receiver

Spark Engine

t0 t1 t2 t3

Event Event Event Event Event Event

Build RDD with Events from t0 to

t1

Build RDD with Events from t1 to

t2

Process RDD Process RDD

16

Problems w/ Receivers

• The only way to get at-least once delivery makes it hard to deploy new code

• Zookeeper is updated with which offsets to start from when data is received, not when it is processed

• We’re actually duplicating Kafka

17

N E W WAY: R E C E I V E R L E S S

18

Kafka

Spark Engine

t0 t1 t2 t3

Event Event Event Event Event Event

Process RDD with Events from

t0 to t1

Process RDD with Events from

t1 to t2

19

General Structure• Load Kafka offsets from Zookeeper

• Tell Spark Streaming to create a DStream that consumes from Kafka, starting at the specified offsets

• Define your processing step (ie. filter out non-risky events)

• Define your output step (ie. POST the data to the case management software)

• Save Kafka offset of most recently processed event to Zookeeper

• Start your streaming application, and grab some popcorn!

20

Example Filtering: Risky Products

hair extensions

pharmacy

vaporizer

gateway card

wifi pineapple

iPhone

gucci

cannabis

travel package

21

Risky Products

hair extensions

gucci

cannabis

sweet bag

nice shoes

taylor swift t-shirt

RDD for Time 0

Filterhair extensions

gucci

cannabisMap

{“title”: “hair exten…}

{“title”: “gucci”…}

{“title”: “cannab…}HTTPS Post

Case Management

Software

22

23

T I M E - W I N D O W E D A G G R E G AT I O N S

24

The Future

• Time-Windowed Functions – A necessity for most of the non-trivial jobs

• Performance Tweaks – We haven’t spent any time on this, so lots of potential for gains

• Machine Learning – We could use the Risk Analyst decisions to build a ML model

• Improved Monitoring – We are only monitoring the basics right now

• Apache Cassandra – Others use it as a fast key/value store for their jobs

• Improved Receiverless API – An API to access Kafka / Zookeeper without hard work

25

Icon CreditsCredit Card by Rediffusion from the Noun Project

Money Bag by icon 54 from the Noun Project

26

W E ’ R E H I R I N GS H O P I F Y. C O M / C A R E E R S

27

T H A N K Y O U !

@n_e_evans