building a real-time ingestion platform using kafka and...

| 1

Building a Real-time Ingestion Platform using Kafka and Aerospike

Matteo PelatiEXECUTIVE DIRECTOR, TECHNOLOGY

HEAD OF DATA PLATFORM

DBS BANK

| 2

Batch-oriented Platform

T0 T1 T2 T3

S3

Source File

Parquet /

Carbon

File

Flux

Clean &

Standardize

Anchor Points

Sparkola

Spark Job

Flux

Parquet /

Carbon

File Sparkola

Flux

Spark Job

Parquet /

Carbon

File

HBASESQL DB

S3

Serving

Layer Presto GRAPI

| 3

Batch Ingestion

Source System

Source System produces daily

files and uploads them into an

S3 bucket

Files are standardized and

converted to

Parquet/Carbon format

Ingestion

Process

API Presto

| 4

We need streaming support

| 5

Challenges

• Typical storage formats used in data platforms are not suitable for streaming

• Kafka historical storage does not fit the purpose as we do not simply need historical records. Records

need to be “projected” to represent the current state

• A distributed database for the entire data platform would be too costly as we keep most of the historical

records in S3

| 6

Partial Events

Source System

Source System generates

events notifying any state

change (CICS, CDC) and push

them to Kafka

Apache Kafka

?

{

fields:{

“phone_number”: “91147890”

},

op: “UPDATE”

pk: 3434833838

}

{

fields:{

“home address”: “91 Kovan Road”

},

op: “UPDATE”

pk: 3434833838

}

| 7

Eventing + shadow database

Source SystemShadow

database

Source System generates

events notifying any state

change (CICS, CDC) and push

them to Kafka

Events are “projected” to the

shadow database to

replicate the data

Projector

Shadow database contains

an eventually consistent

replica of the source system

Apache Kafka

?

API Presto

| 8

Database choice

Aerospike MongoDB Redis FDB

Fast key/value lookup ● ● ● ●

Open Source ● ● ● ●

Memory + flash storage ● ● ● ●

Shared-nothing architecture ● ● ● ●

Complex data structures ● ● ● ●

Secondary Indexes ● ● ● ●

Commercial Support ● ● ● ●

Based on the table above, we decided to adopt Aerospike as the database of choice for all services

| 9

With Real Time ingestion

T0 T1 T2 T3

Source Kafka

topic

S3

Source File

Parquet /

Carbon

File

Flux

Clean &

Standardize

Anchor Points

FluxStream

Clean &

Standardize

Anchor Points

T1

Kafka

topic

Parquet /

Carbon

File

Sparkola

Spark Job

Flux

Spark

Streaming

Job

FluxStream

Parquet /

Carbon

File

Real Time

Snapshot

Parquet /

Carbon

File

T2

Kafka

topic

Sparkola

Flux

Spark

Streaming

Job

Spark Job

FluxStream

SQL DB

S3

Serving

Layer Presto GRAPI

Druid

Elastic

Parquet /

Carbon

File

Aerospike

HBASE

T3

Kafka

topic

| 10

We needed a Global Model

Cards Source System Projector

CASA Source System

Cards

Customer

Model

CASA

Customer

Model

The same business entity is very different from source system to source system. We need an efficient way to

“globalize” the model stored in the database.

Projector

| 11

Projector

Introducing ”the Globalizer”

Source SystemGlobal

Customer

Model

Every model gets translated into a Global model by a Globalizer component, which is fully

configurable using a custom DSL. This way, domain developers can easily plug-in new source systems

without requiring to write any code.

Globalizer

| 12

Data Format

• De facto standard for streaming: Avro is the format of choice of Apache Kafka

• RPC Support: Avro can seamlessly be used for implementing RPC services

• Compact: it’s binary, thus it’s fast

• Evolutionary: its supports very well schema evolution

| 13

Storing the data

Source

SystemProjector

Global

Customer

Model

Public Events

Globalizer

Periodical

snapshots

| 14

The Avro Record FormatHow can we speed-up even further the responses? Just dump the AVRO payload into the database

{

"type" : "record",

"namespace" : ”Customers",

"name" : ”Customer",

"fields" : [

{ "name" : ”CustomerId" , "type" : "string”, “pk”: true },

{ "name" : ”Country" , "type" : ”string”, “sk”: true }

]

}Payload: a002ef10cc76eb21964abbf3489

CustomerId 366635326

Country SG

The Payload field is used to return the raw data to the client during an API call, while other fields are solely used for

creating secondary indexes. This design was inspired by the Record Layer of FDB from Apple

| 15

We still have APIs

Global

Customer

Model

Periodical

SnapshotPeriodical

SnapshotPeriodical

SnapshotPeriodical

SnapshotPeriodical

Snapshot

Serving

Layer Presto GRAPI

Presto Adapter Data Acces Layer

GraphQL

| 16

We still have challenges

• Eventual consistency sometimes is a problem: applications are not designed based on

eventual consistency principle. They expect after an action, data is updated immediately. That is

not the case for an eventual consistent system.

• The batch nature of core systems sometimes require the cache to be completely refreshed: not

always source systems can generate events in real time. Sometimes, after a batch operation, the

cache needs to be refreshed in bulk

| 17

Thank You

building a real-time ingestion platform using kafka and...

Documents