building a real-time ingestion platform using kafka and...
TRANSCRIPT
| 1
Building a Real-time Ingestion Platform using Kafka and Aerospike
Matteo PelatiEXECUTIVE DIRECTOR, TECHNOLOGY
HEAD OF DATA PLATFORM
DBS BANK
| 2
Batch-oriented Platform
T0 T1 T2 T3
S3
Source File
Parquet /
Carbon
File
Flux
Clean &
Standardize
Anchor Points
Sparkola
Spark Job
Flux
Parquet /
Carbon
File Sparkola
Flux
Spark Job
Parquet /
Carbon
File
HBASESQL DB
S3
Serving
Layer Presto GRAPI
| 3
Batch Ingestion
Source System
Source System produces daily
files and uploads them into an
S3 bucket
Files are standardized and
converted to
Parquet/Carbon format
Ingestion
Process
API Presto
| 4
We need streaming support
| 5
Challenges
• Typical storage formats used in data platforms are not suitable for streaming
• Kafka historical storage does not fit the purpose as we do not simply need historical records. Records
need to be “projected” to represent the current state
• A distributed database for the entire data platform would be too costly as we keep most of the historical
records in S3
| 6
Partial Events
Source System
Source System generates
events notifying any state
change (CICS, CDC) and push
them to Kafka
Apache Kafka
?
{
fields:{
“phone_number”: “91147890”
},
op: “UPDATE”
pk: 3434833838
}
{
fields:{
“home address”: “91 Kovan Road”
},
op: “UPDATE”
pk: 3434833838
}
| 7
Eventing + shadow database
Source SystemShadow
database
Source System generates
events notifying any state
change (CICS, CDC) and push
them to Kafka
Events are “projected” to the
shadow database to
replicate the data
Projector
Shadow database contains
an eventually consistent
replica of the source system
Apache Kafka
?
API Presto
| 8
Database choice
Aerospike MongoDB Redis FDB
Fast key/value lookup ● ● ● ●
Open Source ● ● ● ●
Memory + flash storage ● ● ● ●
Shared-nothing architecture ● ● ● ●
Complex data structures ● ● ● ●
Secondary Indexes ● ● ● ●
Commercial Support ● ● ● ●
Based on the table above, we decided to adopt Aerospike as the database of choice for all services
| 9
With Real Time ingestion
T0 T1 T2 T3
Source Kafka
topic
S3
Source File
Parquet /
Carbon
File
Flux
Clean &
Standardize
Anchor Points
FluxStream
Clean &
Standardize
Anchor Points
T1
Kafka
topic
Parquet /
Carbon
File
Sparkola
Spark Job
Flux
Spark
Streaming
Job
FluxStream
Parquet /
Carbon
File
Real Time
Snapshot
Parquet /
Carbon
File
T2
Kafka
topic
Sparkola
Flux
Spark
Streaming
Job
Spark Job
FluxStream
SQL DB
S3
Serving
Layer Presto GRAPI
Druid
Elastic
Parquet /
Carbon
File
Aerospike
HBASE
T3
Kafka
topic
| 10
We needed a Global Model
Cards Source System Projector
CASA Source System
Cards
Customer
Model
CASA
Customer
Model
The same business entity is very different from source system to source system. We need an efficient way to
“globalize” the model stored in the database.
Projector
| 11
Projector
Introducing ”the Globalizer”
Source SystemGlobal
Customer
Model
Every model gets translated into a Global model by a Globalizer component, which is fully
configurable using a custom DSL. This way, domain developers can easily plug-in new source systems
without requiring to write any code.
Globalizer
| 12
Data Format
• De facto standard for streaming: Avro is the format of choice of Apache Kafka
• RPC Support: Avro can seamlessly be used for implementing RPC services
• Compact: it’s binary, thus it’s fast
• Evolutionary: its supports very well schema evolution
| 13
Storing the data
Source
SystemProjector
Global
Customer
Model
Public Events
Globalizer
Periodical
snapshots
| 14
The Avro Record FormatHow can we speed-up even further the responses? Just dump the AVRO payload into the database
{
"type" : "record",
"namespace" : ”Customers",
"name" : ”Customer",
"fields" : [
{ "name" : ”CustomerId" , "type" : "string”, “pk”: true },
{ "name" : ”Country" , "type" : ”string”, “sk”: true }
]
}Payload: a002ef10cc76eb21964abbf3489
CustomerId 366635326
Country SG
The Payload field is used to return the raw data to the client during an API call, while other fields are solely used for
creating secondary indexes. This design was inspired by the Record Layer of FDB from Apple
| 15
We still have APIs
Global
Customer
Model
Periodical
SnapshotPeriodical
SnapshotPeriodical
SnapshotPeriodical
SnapshotPeriodical
Snapshot
Serving
Layer Presto GRAPI
Presto Adapter Data Acces Layer
GraphQL
| 16
We still have challenges
• Eventual consistency sometimes is a problem: applications are not designed based on
eventual consistency principle. They expect after an action, data is updated immediately. That is
not the case for an eventual consistent system.
• The batch nature of core systems sometimes require the cache to be completely refreshed: not
always source systems can generate events in real time. Sometimes, after a batch operation, the
cache needs to be refreshed in bulk
| 17
Thank You