real time etl processing using spark streaming

Post on 16-Apr-2017

925 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Real Time ETL processing

By Veeramani Moorthy

Agenda

Real time ETL Architecture

Why Reconciler?

Reconciler Data model

Q & A?

Requirements for Reconciler

[1.2

.1]

JDB

C F

etch

Tab

le S

chem

a

Trail Files

AdapterRead

GoldenGate

Schema Registry[1.1] Data

Pump

• Schema Registry is a repository of ALL schemas which are versioned.• GoldenGate captures the table change events• Kafka – Distributed Messaging system• CDC – Change Data Capture

[2.1] CDC Events to

broker

Spark Reconciler Spark Joiner

Get Table Schema Get Table Schema

Streaming Reconciler

job

Write output

Reconciled Companies Topic

Source DB

Golden Gate

[1.0] Data Extract

[1.2

] G

et/

Cre

ate

/Up

dat

e Sc

hem

a

Real-Time ETL Architecture

Companies Topic

Addresses Topic

Streaming Joiner/Transfo

rmer Job

Streaming Reconciler

jobReconciled

Addresses Topic

Read/Write for Reconcile Addresses

Read/Write for Reconcile Companies

[3.1] CDC Events to

broker

Streaming Joiner/Transfo

rmer Job

fn

Mapping service

Get Mapping

Requirements for Reconciler

Support for Idempotency

Support for immutability

Support for Schema evolution

Support to handle out of order CDC events

Challenges in Spark streaming

Out of sequence

UPDATE comes first INSERT comes later

Challenges in Spark streaming …

Data model

Tuple Id Source DB Timestamp

Attribute Name Attribute value isDelete?

10201 12345677 company_id 10201 false

10201 12345677 company_name ABC Inc false

10201 12345677 company_addr EGL, BLR false

10201 22345677 company_addr Ecospace, BLR false

….

Company_id Company_name Company_addr

10201 ABC Inc EGL, BLR

….

Instead of

Go with

How does it solve?

Immutability?

Idempotency?

Out of sequence events?

Schema Evolution

Tuple Id Source DB Timestamp

Attribute Name Attribute value isDelete?

10201 12345677 company_id 10201 false

10201 12345677 company_name ABC Inc false

10201 12345677 company_addr EGL, BLR false

10201 22345677 company_addr Ecospace, BLR false

10201 22345900 Registered_name

ABC India Pvt Ltd

false

….

Do I have to change the destination schema?

Schema Evolution

Addition of new column

Deletion of an existing column

Data Type change

top related