building a real-time ingestion platform using kafka and...

17
| 1 Building a Real-time Ingestion Platform using Kafka and Aerospike Matteo Pelati EXECUTIVE DIRECTOR, TECHNOLOGY HEAD OF DATA PLATFORM DBS BANK

Upload: others

Post on 30-Dec-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building a Real-time Ingestion Platform using Kafka and ...pages.aerospike.com/rs/229-XUE-318/images/2020...• Typical storage formats used in data platforms are not suitable for

| 1

Building a Real-time Ingestion Platform using Kafka and Aerospike

Matteo PelatiEXECUTIVE DIRECTOR, TECHNOLOGY

HEAD OF DATA PLATFORM

DBS BANK

Page 2: Building a Real-time Ingestion Platform using Kafka and ...pages.aerospike.com/rs/229-XUE-318/images/2020...• Typical storage formats used in data platforms are not suitable for

| 2

Batch-oriented Platform

T0 T1 T2 T3

S3

Source File

Parquet /

Carbon

File

Flux

Clean &

Standardize

Anchor Points

Sparkola

Spark Job

Flux

Parquet /

Carbon

File Sparkola

Flux

Spark Job

Parquet /

Carbon

File

HBASESQL DB

S3

Serving

Layer Presto GRAPI

Page 3: Building a Real-time Ingestion Platform using Kafka and ...pages.aerospike.com/rs/229-XUE-318/images/2020...• Typical storage formats used in data platforms are not suitable for

| 3

Batch Ingestion

Source System

Source System produces daily

files and uploads them into an

S3 bucket

Files are standardized and

converted to

Parquet/Carbon format

Ingestion

Process

API Presto

Page 4: Building a Real-time Ingestion Platform using Kafka and ...pages.aerospike.com/rs/229-XUE-318/images/2020...• Typical storage formats used in data platforms are not suitable for

| 4

We need streaming support

Page 5: Building a Real-time Ingestion Platform using Kafka and ...pages.aerospike.com/rs/229-XUE-318/images/2020...• Typical storage formats used in data platforms are not suitable for

| 5

Challenges

• Typical storage formats used in data platforms are not suitable for streaming

• Kafka historical storage does not fit the purpose as we do not simply need historical records. Records

need to be “projected” to represent the current state

• A distributed database for the entire data platform would be too costly as we keep most of the historical

records in S3

Page 6: Building a Real-time Ingestion Platform using Kafka and ...pages.aerospike.com/rs/229-XUE-318/images/2020...• Typical storage formats used in data platforms are not suitable for

| 6

Partial Events

Source System

Source System generates

events notifying any state

change (CICS, CDC) and push

them to Kafka

Apache Kafka

?

{

fields:{

“phone_number”: “91147890”

},

op: “UPDATE”

pk: 3434833838

}

{

fields:{

“home address”: “91 Kovan Road”

},

op: “UPDATE”

pk: 3434833838

}

Page 7: Building a Real-time Ingestion Platform using Kafka and ...pages.aerospike.com/rs/229-XUE-318/images/2020...• Typical storage formats used in data platforms are not suitable for

| 7

Eventing + shadow database

Source SystemShadow

database

Source System generates

events notifying any state

change (CICS, CDC) and push

them to Kafka

Events are “projected” to the

shadow database to

replicate the data

Projector

Shadow database contains

an eventually consistent

replica of the source system

Apache Kafka

?

API Presto

Page 8: Building a Real-time Ingestion Platform using Kafka and ...pages.aerospike.com/rs/229-XUE-318/images/2020...• Typical storage formats used in data platforms are not suitable for

| 8

Database choice

Aerospike MongoDB Redis FDB

Fast key/value lookup ● ● ● ●

Open Source ● ● ● ●

Memory + flash storage ● ● ● ●

Shared-nothing architecture ● ● ● ●

Complex data structures ● ● ● ●

Secondary Indexes ● ● ● ●

Commercial Support ● ● ● ●

Based on the table above, we decided to adopt Aerospike as the database of choice for all services

Page 9: Building a Real-time Ingestion Platform using Kafka and ...pages.aerospike.com/rs/229-XUE-318/images/2020...• Typical storage formats used in data platforms are not suitable for

| 9

With Real Time ingestion

T0 T1 T2 T3

Source Kafka

topic

S3

Source File

Parquet /

Carbon

File

Flux

Clean &

Standardize

Anchor Points

FluxStream

Clean &

Standardize

Anchor Points

T1

Kafka

topic

Parquet /

Carbon

File

Sparkola

Spark Job

Flux

Spark

Streaming

Job

FluxStream

Parquet /

Carbon

File

Real Time

Snapshot

Parquet /

Carbon

File

T2

Kafka

topic

Sparkola

Flux

Spark

Streaming

Job

Spark Job

FluxStream

SQL DB

S3

Serving

Layer Presto GRAPI

Druid

Elastic

Parquet /

Carbon

File

Aerospike

HBASE

T3

Kafka

topic

Page 10: Building a Real-time Ingestion Platform using Kafka and ...pages.aerospike.com/rs/229-XUE-318/images/2020...• Typical storage formats used in data platforms are not suitable for

| 10

We needed a Global Model

Cards Source System Projector

CASA Source System

Cards

Customer

Model

CASA

Customer

Model

The same business entity is very different from source system to source system. We need an efficient way to

“globalize” the model stored in the database.

Projector

Page 11: Building a Real-time Ingestion Platform using Kafka and ...pages.aerospike.com/rs/229-XUE-318/images/2020...• Typical storage formats used in data platforms are not suitable for

| 11

Projector

Introducing ”the Globalizer”

Source SystemGlobal

Customer

Model

Every model gets translated into a Global model by a Globalizer component, which is fully

configurable using a custom DSL. This way, domain developers can easily plug-in new source systems

without requiring to write any code.

Globalizer

Page 12: Building a Real-time Ingestion Platform using Kafka and ...pages.aerospike.com/rs/229-XUE-318/images/2020...• Typical storage formats used in data platforms are not suitable for

| 12

Data Format

• De facto standard for streaming: Avro is the format of choice of Apache Kafka

• RPC Support: Avro can seamlessly be used for implementing RPC services

• Compact: it’s binary, thus it’s fast

• Evolutionary: its supports very well schema evolution

Page 13: Building a Real-time Ingestion Platform using Kafka and ...pages.aerospike.com/rs/229-XUE-318/images/2020...• Typical storage formats used in data platforms are not suitable for

| 13

Storing the data

Source

SystemProjector

Global

Customer

Model

Public Events

Globalizer

Periodical

snapshots

Page 14: Building a Real-time Ingestion Platform using Kafka and ...pages.aerospike.com/rs/229-XUE-318/images/2020...• Typical storage formats used in data platforms are not suitable for

| 14

The Avro Record FormatHow can we speed-up even further the responses? Just dump the AVRO payload into the database

{

"type" : "record",

"namespace" : ”Customers",

"name" : ”Customer",

"fields" : [

{ "name" : ”CustomerId" , "type" : "string”, “pk”: true },

{ "name" : ”Country" , "type" : ”string”, “sk”: true }

]

}Payload: a002ef10cc76eb21964abbf3489

CustomerId 366635326

Country SG

The Payload field is used to return the raw data to the client during an API call, while other fields are solely used for

creating secondary indexes. This design was inspired by the Record Layer of FDB from Apple

Page 15: Building a Real-time Ingestion Platform using Kafka and ...pages.aerospike.com/rs/229-XUE-318/images/2020...• Typical storage formats used in data platforms are not suitable for

| 15

We still have APIs

Global

Customer

Model

Periodical

SnapshotPeriodical

SnapshotPeriodical

SnapshotPeriodical

SnapshotPeriodical

Snapshot

Serving

Layer Presto GRAPI

Presto Adapter Data Acces Layer

GraphQL

Page 16: Building a Real-time Ingestion Platform using Kafka and ...pages.aerospike.com/rs/229-XUE-318/images/2020...• Typical storage formats used in data platforms are not suitable for

| 16

We still have challenges

• Eventual consistency sometimes is a problem: applications are not designed based on

eventual consistency principle. They expect after an action, data is updated immediately. That is

not the case for an eventual consistent system.

• The batch nature of core systems sometimes require the cache to be completely refreshed: not

always source systems can generate events in real time. Sometimes, after a batch operation, the

cache needs to be refreshed in bulk

Page 17: Building a Real-time Ingestion Platform using Kafka and ...pages.aerospike.com/rs/229-XUE-318/images/2020...• Typical storage formats used in data platforms are not suitable for

| 17

Thank You