big data file ingestion using apex
TRANSCRIPT
Apache Apex Meetup
Contents
● What is Big Data Ingestion● Challenges in File copy @ scale● Ingestion using Apex
○ Input ○ Output○ Key features
● Demo● Summary
Apache Apex Meetup
Directed Acyclic Graph (DAG)
•A Stream is a sequence of data tuples•An Operator takes one or more input streams, performs computations & emits one or more output streams
• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library• Operator has many instances that run in parallel and each instance in single-threaded
•Directed Acyclic Graph (DAG) is made up of operators and streams
Apex: Application Programming Model
Output StreamTuple Tuple
Filtered Stream
Enriched Stream
Enriched
Stream
er
Operator
er
Operator
er
Operator
er
Operator
Filtered
Stream
Apache Apex Meetup
What is Ingestion
Data ingestion● process of obtaining, importing, and processing data for later use or storage
in a database
Big Data Ingestion● discovering the data sources ● importing the data ● processing data to produce intermediate data ● Send data out to durable data stores
Apache Apex Meetup
Challenges in File copy @ scale
● Failure Recovery
● Copying big files in parallel
● Copying large number of small files
● Processing○ Encryption
○ Compression
○ Compaction
Apache Apex Meetup
DAG - Components
Read Data Write DataProcess
Apache Apex Meetup
DAG - Read Data : Requirements
● Independent of input file type ○ HDFS○ S3○ FTP○ NFS
● Scale to large data○ Large files○ Large number of small files
● Configurable Bandwidth usage
Apache Apex Meetup
DAG - Read Data
Break the whole task into smaller sub-tasks
Connect to input and scan for available data
Assign smaller tasks for downstream operators
Step
s
Purp
ose
Nam
e
Work on the sub-tasks given by Operator 1, one
at a time
Connect to source and read data as smaller
tasks one-by-one
Pass on the read data to downstream operator
Write File
Save the data read by Operator 2
File Splitter
Block Reader
FileWriter
Apache Apex Meetup
DAG - Simple Design
File Splitter
Block Reader
FileWriterBlockMetaData Data
Challenges● Reading files in parallel is not possible
○ Can have multiple Block Readers and File Writers reading multiple files in parallel but single file can’t be read by two Block Readers
● Failure recovery is hard
Apache Apex Meetup
DAG - Read Data
Break the whole task into smaller sub-tasks
Connect to input and scan for available data
Assign smaller tasks for downstream
operators
Step
s
Pur
pose
N
ame
Work on the sub-tasks given by Operator 1,
one at a time
Connect to source and read data as smaller
tasks one-by-one
Pass on the read data to downstream operator
Write File
Save the data read by Operator 2
File Splitter
Block Reader
FileWriter
Check for completeness
Make sure all smaller tasks for a file are completed by upstream operators & send file merger trigger
Synchronizer
Apache Apex Meetup
DAG - Input
File Splitter
Block Reader
BlockWriter
BlockMetaData
Data
Block Reader
BlockWriter
Synchronizer
BlockMetaData
FileMetaData
BlockMetaData
BlockMetaData
Data
Apache Apex Meetup
Input DAG - FileSplitter
Scan input files/ directoriesCreate smaller sub-tasks
FileMetaData
BlockMetaData
File Splitter
● Parameters○ input files/directories to copy data from○ recursive - Yes / No○ polling - Yes / No○ bandwidth - MB / sec
Apache Apex Meetup
Input DAG - FileSplitter
● For each file in the directory:■ [output] FileMetaData - file information
● Name● Size ● Relative path● Block IDs into which the file is virtually split
■ [output] BlockMetaData - block information
● BlockID ● Start position● End position ● File URL
InputFile.txt
1073741824 (1GB)
input/data/InputFile.txt
[0,1,2,3,4,5,6,7,8]
1
134217728
268435456 (128MB)
hdfs://node18:8020/user/sandeep/input
Apache Apex Meetup
Input DAG - BlockReader
Block Reader
Read block from remote location and emit Data
Data
BlockMetaData
● Parameters
Input URL: E.g.: hdfs://node18:8020/user/hduser/input
BlockMetaData
Apache Apex Meetup
Input DAG - BlockWriter
Block Writer
Write block data on local HDFS
BlockMetaData
BlockMetaData
Data
Saves data in apps directory
Apache Apex Meetup
Input DAG - Synchronizer
Track blocks for each file and send trigger once all the block for that file
are available
FileMetaDataSynchronizer
FileMetaData
BlockMetaData
Apache Apex Meetup
DAG - Input
File Splitter
Block Reader
Block Writer
Synchronizer
BlockMetaData
Data
FileMetaData
BlockMetaData
BlockMetaData
FileMetaData
Apache Apex Meetup
Output DAG - FileMerger
Merge blocks to recreate original file
FileMerger
● Parameters○ Output directory to copy data to○ Overwrite - Yes/No
FileMetaData
Apache Apex Meetup
Output DAG - FileMerger - FastMerge Magic
Different Blocks:
File :
B1
Dat
aNod
e 1
Dat
aNod
e 2
Dat
aNod
e 3
Dat
aNod
e 4
B2
B1
B1
B2
B2
Bn
Bn
Bn
BnB1 B2
1
2
1
1
2
2
n
n
n
1 2 n
Apache Apex Meetup
● Same replication factor
● On same HDFS cluster
● Same block size for all files
● Size of all files (except last) : multiple of block size
Output DAG - FileMerger - FastMerge Magic
Apache Apex Meetup
DAG - Complete
File Splitter
Block Reader
Block Writer
Synchronizer
BlockMetaData BlockMetaData
Data
BlockMetaData
FileMetaData
FileMergerFileMetaData
Apache Apex Meetup
Other features: Optional processing
● Compression○ Gzip and lzo
● Encryption○ PKI & AES
● Compaction○ Size based
● Dedup● Dimension Computation & Aggregation
Apache Apex Meetup
Apache Apex Meetup
Summary
● Easy to use○ Configure and run
● Unified for batch and continuous ingestion● Handles
○ Large files○ Large number of small files
25
Resources
Apache Apex Meetup
• Apache Apex - http://apex.apache.org/• Subscribe - http://apex.apache.org/community.html• Download - https://www.datatorrent.com/download/• Twitter
o @ApacheApex; Follow - https://twitter.com/apacheapexo @DataTorrent; Follow – https://twitter.com/datatorrent
• Meetups - http://www.meetup.com/topics/apache-apex• Webinars - https://www.datatorrent.com/webinars/• Videos - https://www.youtube.com/user/DataTorrent• Slides - http://www.slideshare.net/DataTorrent/presentations • Startup Accelerator Program - Full featured enterprise product
o https://www.datatorrent.com/product/startup-accelerator/
© 2016 DataTorrent
We Are Hiring
2
• [email protected]• Developers/Architects• QA Automation Developers• Information Developers• Build and Release• Community Leaders