fraud detection architecture

45
Real Time Fraud Detection Patterns and reference architectures Ted Malaska // PSA Gwen Shapira // Software Engineer

Upload: gwen-chen-shapira

Post on 21-Apr-2017

5.576 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Fraud Detection Architecture

Real Time Fraud DetectionPatterns and reference architectures

Ted Malaska // PSA Gwen Shapira // Software Engineer

Page 2: Fraud Detection Architecture

2

• Intro• Review Problem• Quick overview of key technology• High level architecture• Deep Dive into NRT Processing• Completing the Puzzle – Micro-batch, Ingest and Batch

Overview

©2014 Cloudera, Inc. All rights reserved.

Page 3: Fraud Detection Architecture

3©2014 Cloudera, Inc. All rights reserved.

• 15 years of moving data• Formerly consultant• Now Cloudera Engineer:– Sqoop Committer– Kafka– Flume

• @gwenshap

Gwen Shapira

Page 4: Fraud Detection Architecture

4

• Ted Malaska (PSA at Cloudera)• Hadoop for ~5 years• Contributed to

– HDFS, MapReduce, Yarn, HBase, Spark, Avro, – Kite, Pig, Navigator, Cloudera Manager, Flume, Kafke, Sqoop, Accumulo – And working on a Sentry Patch

• Co-Author to O’Reilly Hadoop Application Architectures• Worked with about 70 companies in 8 countries• Marvel Fan Boy• Runner

Hello

©2014 Cloudera, Inc. All rights reserved.

Page 5: Fraud Detection Architecture

5

The Problem©2014 Cloudera, Inc. All rights reserved.

Page 6: Fraud Detection Architecture

6

Credit Card Transaction Fraud

©2014 Cloudera, Inc. All rights reserved.

Page 7: Fraud Detection Architecture

7

Ikea Meat Balls

©2014 Cloudera, Inc. All rights reserved.

Page 8: Fraud Detection Architecture

8

Coupon Fraud

©2014 Cloudera, Inc. All rights reserved.

Page 9: Fraud Detection Architecture

9

Video Game Strategy

©2014 Cloudera, Inc. All rights reserved.

Page 10: Fraud Detection Architecture

10

Health Insurance Fraud

©2014 Cloudera, Inc. All rights reserved.

Page 11: Fraud Detection Architecture

11

• Typical Atomic Card Fraud Detection• Ikea Meat Ball• Multi Coupons Combinations • OP or Negative Video Games Strategies • Ad Serving • Health Insurance Fraud• Kid Coming Home From School

Review of the Problem

©2014 Cloudera, Inc. All rights reserved.

Page 12: Fraud Detection Architecture

12

How do we React• Human Brain at Tennis – Muscle Memory– Reaction Thought– Reflective Meditation

©2014 Cloudera, Inc. All rights reserved.

Page 13: Fraud Detection Architecture

13

Overview of Key Technologies

©2014 Cloudera, Inc. All rights reserved.

Page 14: Fraud Detection Architecture

14

Kafka©2014 Cloudera, Inc. All Rights Reserved.

Page 15: Fraud Detection Architecture

15©2014 Cloudera, Inc. All rights reserved.

•Messages are organized into topics•Producers push messages•Consumers pull messages• Kafka runs in a cluster. Nodes are called brokers

The Basics

Page 16: Fraud Detection Architecture

16©2014 Cloudera, Inc. All rights reserved.

Topics, Partitions and Logs

Page 17: Fraud Detection Architecture

17©2014 Cloudera, Inc. All rights reserved.

Each partition is a log

Page 18: Fraud Detection Architecture

18©2014 Cloudera, Inc. All rights reserved.

Each Broker has many partitions

Partition 0 Partition 0

Partition 1 Partition 1

Partition 2

Partition 1

Partition 0

Partition 2 Partion 2

Page 19: Fraud Detection Architecture

19©2014 Cloudera, Inc. All rights reserved.

Producers load balance between partitions

Partition 0

Partition 1

Partition 2

Partition 1

Partition 0

Partition 2

Partition 0

Partition 1

Partion 2

Client

Page 20: Fraud Detection Architecture

20©2014 Cloudera, Inc. All rights reserved.

Producers load balance between partitions

Partition 0

Partition 1

Partition 2

Partition 1

Partition 0

Partition 2

Partition 0

Partition 1

Partion 2

Client

Page 21: Fraud Detection Architecture

21©2014 Cloudera, Inc. All rights reserved.

Consumers

Consumer Group Y

Consumer Group X

Consumer

Kafka Cluster

Topic

Partition A (File)

Partition B (File)

Partition C (File)

Consumer

Consumer

Consumer

Order retained with in partition

Order retained with in partition but not over

partitionsOff

Set

X

Off S

et X

Off S

et X

Off S

et Y

Off S

et Y

Off S

et Y

Off sets are kept per consumer group

Page 22: Fraud Detection Architecture

22

Flume

Page 23: Fraud Detection Architecture

23

Sources Interceptors Selectors Channels Sinks

Flume Agent

Short Intro to FlumeTwitter, logs, JMS, webserver, Kafka

Mask, re-format, validate…

DR, criticalMemory, file,

KafkaHDFS, HBase,

Solr

Page 24: Fraud Detection Architecture

24

Flume and/or Kafka

©2014 Cloudera, Inc. All rights reserved.

Flume

UpStream

Flume Source

Interceptor

Flume Channel

Flume Sink

Down Stream

SelectorCan Be KafkaCan Be KafkaCan Be Kafka

Page 25: Fraud Detection Architecture

25©2014 Cloudera, Inc. All rights reserved.

Interceptors• Mask fields• Validate information against external source• Extract fields• Modify data format• Filter or split events

Page 26: Fraud Detection Architecture

26

SparkStreaming

Page 27: Fraud Detection Architecture

27

Spark Streaming Example

©2014 Cloudera, Inc. All rights reserved.

1. val conf = new SparkConf().setMaster("local[2]”) 2. val ssc = new StreamingContext(conf, Seconds(1))3. val lines = ssc.socketTextStream("localhost", 9999)4. val words = lines.flatMap(_.split(" "))5. val pairs = words.map(word => (word, 1))6. val wordCounts = pairs.reduceByKey(_ + _)7. wordCounts.print()8. SSC.start()

Page 28: Fraud Detection Architecture

28

Spark Streaming Example

©2014 Cloudera, Inc. All rights reserved.

1. val conf = new SparkConf().setMaster("local[2]”) 2. val sc = new SparkContext(conf)3. val lines = sc.textFile(path, 2)4. val words = lines.flatMap(_.split(" "))5. val pairs = words.map(word => (word, 1))6. val wordCounts = pairs.reduceByKey(_ + _)7. wordCounts.print()

Page 29: Fraud Detection Architecture

29Confidentiality Information Goes Here

DStream

DStream

DStream

Spark Streaming

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count Print

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count Print

Pre-first Batch

First Batch

Second Batch

Page 30: Fraud Detection Architecture

30Confidentiality Information Goes Here

DStream

DStream

DStreamSpark Streaming

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count

Print

Source Receiver RDD

RDD

RDD

Single PassFilter Count

Pre-first Batch

First Batch

Second Batch

Stateful RDD 1

Print

Stateful RDD 2

Stateful RDD 1

Page 31: Fraud Detection Architecture

31

Spark Streaming and HBase

©2014 Cloudera, Inc. All rights reserved.

Driver

Walker Node

Configs

Executor

Static SpaceConfigs

HConnection

Tasks Tasks

Walker NodeExecutor

Static SpaceConfigs

HConnection

Tasks Tasks

Page 32: Fraud Detection Architecture

32

High Level Architecture

©2014 Cloudera, Inc. All rights reserved.

Page 33: Fraud Detection Architecture

33

Real-Time Event Processing Approach

©2014 Cloudera, Inc. All rights reserved.

Hadoop Cluster IIStorage Processing

SolR

Hadoop Cluster I

ClientClientFlume Agents Hbase /

Memory

Spark Streamin

g

HDFS

Hive/ImpalaMap/

ReduceSpark

Search

Automated & Manual

Analytical Adjustments and Pattern detection

Fetching & Updating Profiles

Adjusting NRT Stats

HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of NRT

Changes and Counters

Local Cache

Kafka

Clients:(Swipe here!)

Web App

Page 34: Fraud Detection Architecture

34

NRT Processing©2014 Cloudera, Inc. All rights reserved.

Page 35: Fraud Detection Architecture

35

Focus on NRT First

©2014 Cloudera, Inc. All rights reserved.

Hadoop Cluster IIStorage Processing

SolR

Hadoop Cluster I

ClientClientFlume Agents Hbase /

Memory

Spark Streamin

g

HDFS

Hive/ImpalaMap/

ReduceSpark

Search

Automated & Manual

Analytical Adjustments and Pattern detection

Fetching & Updating Profiles

Adjusting NRT Stats

HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of NRT

Changes and Counters

Local Cache

Kafka

Clients:(Swipe here!)

Web App

NRT Event Processing with Context

Page 36: Fraud Detection Architecture

36

Streaming Architecture – NRT Event Processing

©2014 Cloudera, Inc. All rights reserved.

Flume SourceFlume Source

Kafka

Initial Events Topic

Flume SourceFlume InterceptorEvent Processing

LogicLocal

MemoryHBase Client

Kafka

Answer Topic

HBase

Kafk

a Co

nsum

er

Kafk

a Pr

oduc

er

Able to respond with in 10s of milliseconds

Page 37: Fraud Detection Architecture

37

Partitioned NRT Event Processing

©2014 Cloudera, Inc. All rights reserved.

Flume SourceFlume Source

Kafka

Initial Events Topic Flume SourceFlume InterceptorEvent Processing

LogicLocal

MemoryHBase Client

Kafka

Answer Topic

HBase

Kafk

a Co

nsum

er

Kafk

a Pr

oduc

er

TopicPartition A

Partition B

Partition C

Producer

Partitioner

Producer

Partitioner

Producer

Partitioner

Custom Partitioner

Better use of local memory

Page 38: Fraud Detection Architecture

38

Completing the Puzzle

©2014 Cloudera, Inc. All rights reserved.

Page 39: Fraud Detection Architecture

39

Micro Batching

©2014 Cloudera, Inc. All rights reserved.

Hadoop Cluster IIStorage Processing

SolR

Hadoop Cluster I

ClientClientFlume Agents Hbase /

Memory

Spark Streamin

g

HDFS

Hive/ImpalaMap/

ReduceSpark

Search

Automated & Manual

Analytical Adjustments and Pattern detection

Fetching & Updating Profiles

Adjusting NRT Stats

HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of NRT

Changes and Counters

Local Cache

Kafka

Clients:(Swipe here!)

Web App

Micro Batching

Micro BatchingMicro Batching

Page 40: Fraud Detection Architecture

40

Complex Topologies

©2014 Cloudera, Inc. All rights reserved.

Kafka

Initial Events Topic

Spark Streaming

Kafk

a Di

rect

Co

nnec

tion

Dag Topologies

Kafka

Initial Events Topic

Spark StreamingKafka Receivers Dag Topologies

Kafka Receivers

Kafka Receivers

• Manages Offset• Stores Offset is RDD• No longer needs HDFS for initial RDD check

pointing

• Lets Kafka Manage Offsets• Uses HDFS for initial RDD recovery

1.3

1.2

Page 41: Fraud Detection Architecture

41©2014 Cloudera, Inc. All rights reserved.

MicroBatch Bad-Input Handling

0 1 2 3 4 5 6 7 8 9 10

11

12

13

Kafka – incoming events topic

Dag Topologies

0 1 2 3 4 5 6 7 8 9 10

11

12

13

Kafka – bad events topic

0 1 2 3 4 5 6 7 8 9 10

11

12

13

Kafka – resolved events topic

0 1 2 3 4 5 6 7 8 9 10

11

12

13

Kafka – results topic

Page 42: Fraud Detection Architecture

42

Ingestion

©2014 Cloudera, Inc. All rights reserved.

Hadoop Cluster IIStorage Processing

SolR

Hadoop Cluster I

ClientClientFlume Agents Hbase /

Memory

Spark Streamin

g

HDFS

Hive/ImpalaMap/

ReduceSpark

Search

Automated & Manual

Analytical Adjustments and Pattern detection

Fetching & Updating Profiles

Adjusting NRT Stats

HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of NRT

Changes and Counters

Local Cache

Kafka

Clients:(Swipe here!)

Web App

Ingestion

Ingestion

Page 43: Fraud Detection Architecture

43

Ingestion

©2014 Cloudera, Inc. All rights reserved.

Flume HDFS SinkKafka Cluster

TopicPartition A

Partition B

Partition C

SinkSinkSink

HDFS

Flume SolR SinkSinkSinkSink

SolR

Flume Hbase SinkSinkSinkSink

HBase

Page 44: Fraud Detection Architecture

44

Reflective Thoughts

©2014 Cloudera, Inc. All rights reserved.

Hadoop Cluster IIStorage Processing

SolR

Hadoop Cluster I

ClientClientFlume Agents Hbase /

Memory

Spark Streamin

g

HDFS

Hive/ImpalaMap/

ReduceSpark

Search

Automated & Manual

Analytical Adjustments and Pattern detection

Fetching & Updating Profiles

Adjusting NRT Stats

HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of NRT

Changes and Counters

Local Cache

Kafka

Clients:(Swipe here!)

Web App

Research and Searching

Page 45: Fraud Detection Architecture

©2014 Cloudera, Inc. All rights reserved.