big data hadoop streaming etl template for database to database
TRANSCRIPT
1
Big Data Hadoop Streaming ETL template for DatabaseToDatabase
Hitesh Kapoor,Mohit Jotwani
[email protected]@datatorrent.com
2
•DataTorrent - Vision•About Apache Apex•App templates•Database to Database App Template•Live demo•Roadmap
Agenda
3
• Big Data is neither Productized nor Operationalized
• Total Cost of Ownership (TCO) includes
• Time to Develop + Time to Launch + Cost of ongoing Operations
• Provide a Product to ...
• Build Applications Rapidly with Simple Interfaces, Pre-Built Apps, Code
Reuse & Debuggability
• Support Dev, Test, Prod cycle to Launch Apps quickly
• Manage and Visualize Applications for Operability
DataTorrent Vision - Productize Big Data
4
Next Gen Big Data Applications
Browser
Web Server
Kafka Input(logs)
Decompress, Parse,
Filter
Dimensions Aggregate
Kafka
Logs Kafka
Variety of sources - IoT, Kafka, files, social media etc.Variety of sinks – Kafka, files, databases etc.* Supports low latency real time visualizations as well
Unbounded and continuous data streamsBatch support, batch processed as stream
In-memory processing with temporal window boundaries
Stateful operations: Aggregation, Rules etc --> Analytics
5
Big Data Ecosystem: Where DataTorrent fits
Data SourcesOper1 Oper2 Oper3
Hadoop (YARN + HDFS)
Sensor
Data
Social
Media
Web
Servers
App
Servers
Click
Streams
Real-time analytics &
Visualizations
Real-time DataVisualization
6
DataTorrent Architecture
Solutions for Business Problems
Ingestion & Data Prep ETL Pipelines
Ease of Use Tools Real-Time Data VisualizationManagement & Monitoring
GUI Application Assembly
Application Templates
Apex-Malhar Operator Library
Big Data Infrastructure Hadoop 2.x – YARN + HDFS – On Prem & Cloud
Core
High-level APITransformation ML & Score SQL Analytic
s
FileSync
Dev Framework
Batch Support
Apache Apex Core
Kafka HDFS
HDFS HDFS
JDBC HDFSJDBC
Kafka
7
• Building Apps such as Ingestion & Transform Apps for commonly patterns in customer use cases
App Templates – Recurring patterns
Use Case Pattern Sources Processors Sinks
Data Synchronization, Staging Data for Analytics
HDFS, Kafka, JDBC,
S3
→ HDFS,
S3
Enriching Data before Staging
HDFS,JDBC,Kafka
Parser → Deduper → Enricher → FormatterHDFS,
Cassandra
Merge & Transform Data Streams
Kafka,JDBC,
FileStream Merge → Transform → Filter → Enricher HDFS
Machine Scoring Kafka H2O or Custom HDFS
8
• Central repository for big data application templates
• Tested and published by DataTorrent
• Accessible via dtManage on DataTorrent RTS and direct app download from website
• Provides version updates via dtManage
AppHub – App Template Repository
9
App Templates advantages
Ease of use Time to market and TCO Real-time Visualizations
✓ Quickly import and launch app template applications
✓ Easily add business logic by adding custom operators
✓ Reduces time to production drastically
✓ Reduces cost of operations in production
✓ Real-time visualizations of operational metrics such as throughput, latency etc.
✓ Real-time visualizations of application data such as number of files processed, amount of data transferred etc.
10
•Look at: https://www.datatorrent.com/apphub/
•Ready to use, customizable applications for big data ingestion use-cases.
•Source : https://github.com/DataTorrent/app-templates (apache 2.0)
App Template Demo
11
Database table to Database table app-template
12
13
• Visualizations – widgets on app data• Metrics such as size of data moved, lines per file, number of errors etc
• Custom user defined metrics using apex auto-metrics
• Schema enablement
• Cloud Integrations• Amazon EMR, Microsoft Azure
• Upcoming app templates• FTP → HDFS• SFTP → HDFS• Kinesis → S3• Kinesis → Redshift • Kafka → JSON parse → filter → transform → HDFS• Kafka → CSV parse → filter → transform → HDFS
Roadmap
14
Questions
•Send feedback to : https://groups.google.com/forum/#!forum/dt-users•Email to : [email protected]
15
Resources
• Apache Apex - http://apex.apache.org/
• Subscribe to forumsᵒ Apex - http://apex.apache.org/community.htmlᵒ DataTorrent - https://groups.google.com/forum/#!forum/dt-users
• Download - https://datatorrent.com/download/
• Twitterᵒ @ApacheApex; Follow - https://twitter.com/apacheapexᵒ @DataTorrent; Follow – https://twitter.com/datatorrent
• Meetups - http://meetup.com/topics/apache-apex
• Webinars - https://datatorrent.com/webinars/
• Videos - https://youtube.com/user/DataTorrent
• Slides - http://slideshare.net/DataTorrent/presentations
• Startup Accelerator – Free full featured enterprise productᵒ https://datatorrent.com/product/startup-accelerator/
• Big Data Application Templates Hub – https://datatorrent.com/apphub
16
We are hiring!
• Developers/Architects
• QA Automation Developers
• Information Developers
• Build and Release
• Community Leaders
17