big data hadoop streaming etl template for database to database

17

Click here to load reader

Upload: datatorrent

Post on 21-Jan-2018

180 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Big Data Hadoop Streaming ETL template for Database to Database

1

Big Data Hadoop Streaming ETL template for DatabaseToDatabase

Hitesh Kapoor,Mohit Jotwani

[email protected]@datatorrent.com

Page 2: Big Data Hadoop Streaming ETL template for Database to Database

2

•DataTorrent - Vision•About Apache Apex•App templates•Database to Database App Template•Live demo•Roadmap

Agenda

Page 3: Big Data Hadoop Streaming ETL template for Database to Database

3

• Big Data is neither Productized nor Operationalized

• Total Cost of Ownership (TCO) includes

• Time to Develop + Time to Launch + Cost of ongoing Operations

• Provide a Product to ...

• Build Applications Rapidly with Simple Interfaces, Pre-Built Apps, Code

Reuse & Debuggability

• Support Dev, Test, Prod cycle to Launch Apps quickly

• Manage and Visualize Applications for Operability

DataTorrent Vision - Productize Big Data

Page 4: Big Data Hadoop Streaming ETL template for Database to Database

4

Next Gen Big Data Applications

Browser

Web Server

Kafka Input(logs)

Decompress, Parse,

Filter

Dimensions Aggregate

Kafka

Logs Kafka

Variety of sources - IoT, Kafka, files, social media etc.Variety of sinks – Kafka, files, databases etc.* Supports low latency real time visualizations as well

Unbounded and continuous data streamsBatch support, batch processed as stream

In-memory processing with temporal window boundaries

Stateful operations: Aggregation, Rules etc --> Analytics

Page 5: Big Data Hadoop Streaming ETL template for Database to Database

5

Big Data Ecosystem: Where DataTorrent fits

Data SourcesOper1 Oper2 Oper3

Hadoop (YARN + HDFS)

Sensor

Data

Social

Media

Web

Servers

App

Servers

Click

Streams

Real-time analytics &

Visualizations

Real-time DataVisualization

Page 6: Big Data Hadoop Streaming ETL template for Database to Database

6

DataTorrent Architecture

Solutions for Business Problems

Ingestion & Data Prep ETL Pipelines

Ease of Use Tools Real-Time Data VisualizationManagement & Monitoring

GUI Application Assembly

Application Templates

Apex-Malhar Operator Library

Big Data Infrastructure Hadoop 2.x – YARN + HDFS – On Prem & Cloud

Core

High-level APITransformation ML & Score SQL Analytic

s

FileSync

Dev Framework

Batch Support

Apache Apex Core

Kafka HDFS

HDFS HDFS

JDBC HDFSJDBC

Kafka

Page 7: Big Data Hadoop Streaming ETL template for Database to Database

7

• Building Apps such as Ingestion & Transform Apps for commonly patterns in customer use cases

App Templates – Recurring patterns

Use Case Pattern Sources Processors Sinks

Data Synchronization, Staging Data for Analytics

HDFS, Kafka, JDBC,

S3

→ HDFS,

S3

Enriching Data before Staging

HDFS,JDBC,Kafka

Parser → Deduper → Enricher → FormatterHDFS,

Cassandra

Merge & Transform Data Streams

Kafka,JDBC,

FileStream Merge → Transform → Filter → Enricher HDFS

Machine Scoring Kafka H2O or Custom HDFS

Page 8: Big Data Hadoop Streaming ETL template for Database to Database

8

• Central repository for big data application templates

• Tested and published by DataTorrent

• Accessible via dtManage on DataTorrent RTS and direct app download from website

• Provides version updates via dtManage

AppHub – App Template Repository

Page 9: Big Data Hadoop Streaming ETL template for Database to Database

9

App Templates advantages

Ease of use Time to market and TCO Real-time Visualizations

✓ Quickly import and launch app template applications

✓ Easily add business logic by adding custom operators

✓ Reduces time to production drastically

✓ Reduces cost of operations in production

✓ Real-time visualizations of operational metrics such as throughput, latency etc.

✓ Real-time visualizations of application data such as number of files processed, amount of data transferred etc.

Page 10: Big Data Hadoop Streaming ETL template for Database to Database

10

•Look at: https://www.datatorrent.com/apphub/

•Ready to use, customizable applications for big data ingestion use-cases.

•Source : https://github.com/DataTorrent/app-templates (apache 2.0)

App Template Demo

Page 11: Big Data Hadoop Streaming ETL template for Database to Database

11

Database table to Database table app-template

Page 12: Big Data Hadoop Streaming ETL template for Database to Database

12

Page 13: Big Data Hadoop Streaming ETL template for Database to Database

13

• Visualizations – widgets on app data• Metrics such as size of data moved, lines per file, number of errors etc

• Custom user defined metrics using apex auto-metrics

• Schema enablement

• Cloud Integrations• Amazon EMR, Microsoft Azure

• Upcoming app templates• FTP → HDFS• SFTP → HDFS• Kinesis → S3• Kinesis → Redshift • Kafka → JSON parse → filter → transform → HDFS• Kafka → CSV parse → filter → transform → HDFS

Roadmap

Page 14: Big Data Hadoop Streaming ETL template for Database to Database

14

Questions

•Send feedback to : https://groups.google.com/forum/#!forum/dt-users•Email to : [email protected]

Page 15: Big Data Hadoop Streaming ETL template for Database to Database

15

Resources

• Apache Apex - http://apex.apache.org/

• Subscribe to forumsᵒ Apex - http://apex.apache.org/community.htmlᵒ DataTorrent - https://groups.google.com/forum/#!forum/dt-users

• Download - https://datatorrent.com/download/

• Twitterᵒ @ApacheApex; Follow - https://twitter.com/apacheapexᵒ @DataTorrent; Follow – https://twitter.com/datatorrent

• Meetups - http://meetup.com/topics/apache-apex

• Webinars - https://datatorrent.com/webinars/

• Videos - https://youtube.com/user/DataTorrent

• Slides - http://slideshare.net/DataTorrent/presentations

• Startup Accelerator – Free full featured enterprise productᵒ https://datatorrent.com/product/startup-accelerator/

• Big Data Application Templates Hub – https://datatorrent.com/apphub

Page 16: Big Data Hadoop Streaming ETL template for Database to Database

16

We are hiring!

[email protected]

• Developers/Architects

• QA Automation Developers

• Information Developers

• Build and Release

• Community Leaders

Page 17: Big Data Hadoop Streaming ETL template for Database to Database

17