fault tolerant file input & output

23
Fault-tolerant File Input & Output Chandni Singh - Committer Apache Apex May 4, 2016

Upload: datatorrent

Post on 18-Jan-2017

32 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Fault Tolerant File Input & Output

Fault-tolerantFile Input & Output

Chandni Singh - Committer Apache ApexMay 4, 2016

Chandni Singh
maybe I should remove this slide. There isn't much here
Thomas Weise
maybe add few sub-bullets: Writing hive partitions etc.
Page 2: Fault Tolerant File Input & Output

Background- Windows in Apex- Window: finite piece of a data set along temporal boundaries*- Apex assigns an id to each window which helps with fault-tolerance.- An operator is provided hooks to know which window id it is on.

Page 4: Fault Tolerant File Input & Output

AbstractFileInputOperator- Scans a folder periodically for new files.

- Parses the file for records.

- Fault-tolerant and scalable.

Page 5: Fault Tolerant File Input & Output

AbstractFileInputOperator : Fault tolerance- A record is not lost.

- A record is associated with only one window id irrespective of failures.

- If a window is replayed then all the records associated with it will be replayed.

Page 7: Fault Tolerant File Input & Output

AbstractFileInputOperator : Fault tolerance cont’dFault tolerance is achieved by

- Support from platform - Automatic checkpointing of the state of every operator in the dag.- Automatic restoring a failed operator in another container.

- WindowDataManager - Saves incremental state every window. - Helps with replaying windows that were completed by this operator.

Page 8: Fault Tolerant File Input & Output

AbstractFileInputOperator : Scalability- Operator partitions read different subset of files.

- Files are distributed between partitions based on their hash.

- Number of partitions can be changed at run time by changing a property.

- For advanced use cases, subclasses can override the directory scanner to customize behavior such as having each partition scan a different directory.

- Auto-scalability supported as well in AbstractThroughputFileInputOperator.

Chandni Singh
didn't understand the part about file boundaries
Pramod Immaneni
Maybe also add that number of partitions can be changed anytime (not necessarily at file boundaries) and platform and operator will ensure that files are processed correctly without data loss
Page 9: Fault Tolerant File Input & Output

AbstractFileInputOperator : Implementations- LineByLineFileInputOperator in Malhar library- Custom implementation

public class CustomFileInputOperator<RECORD> extends AbstractFileInputOperator<RECORD>{ public final transient DefaultOutputPort<RECORD> output = new DefaultOutputPort<RECORD>();

@Override protected RECORD readEntity() throws IOException { //read record from input stream RECORD record= inputStream.read(...); return record; }

@Override protected void emit(RECORD tuple) { output.emit(tuple); }}

Pramod Immaneni
Maybe in custom implementation readEntity method you can do some filtering on record before returning it
Page 10: Fault Tolerant File Input & Output

FileSplitterInput & AbstractFSBlockReader- Task of discovering files and reading them is separated into different logical

operators.- File splitter discovers files asynchronously and creates task descriptions-

FileBlockMetadata.- Block readers use FileBlockMetada to read a particular block of file.- Fault-tolerant, parallelizes reading on a single file and is auto-scalable.

Page 11: Fault Tolerant File Input & Output

FileSplitterInput & AbstractFSBlockReader: Fault tolerance- Platform supports checkpointing state and re-deployment automatically.

- FileSplitterInput uses WindowDataManager to replay tuples of completed windows.

- AbstractFSBlockReader relies on the upstream buffer-server to replay tuples from a given window.

- Buffer-server is a buffer associated with each output port of an operator which holds tuples emitted by that port.

Page 12: Fault Tolerant File Input & Output

FileSplitterInput & AbstractFSBlockReader: Fault tolerance cont’d

Chandni Singh
Will take much more time
Pramod Immaneni
Maybe more lines in file will make the illustration better, currently you end up with trivial windows on restart
Page 13: Fault Tolerant File Input & Output

FileSplitterInput & AbstractFSBlockReader: Scalability- FileSplitterInput is a simple operator which does not take much resources.- Block reader does the actual work of reading files and is auto-scalable (in beta).

- Min and max partitions are configurable.- Frequency of re-partition is controlled by a time interval property.- Scales up/down based on the pending FileBlockMetadata in the input port queue.

Page 14: Fault Tolerant File Input & Output

FileSplitterInput & AbstractFSBlockReader: Implementations- FileSplitterInput is concrete. Default behavior can be overridden.- FS Block Readers

- FSSliceReader : record is a slice- AbstractFSLineReader and AbstractFSReadAheadLineReader: record is a line

- Custom FS Block Readerpublic class CustomFSBlockReader<RECORD> extends AbstractFSBlockReader<RECORD>

{

public CustomFSBlockReader() { //initialize reader context this.readerContext = new RecordReaderContext(); }

@Override protected RECORD convertToRecord(byte[] bytes) { //convert bytes to RECORD return RECORD.from(bytes); }}

Page 15: Fault Tolerant File Input & Output

AbstractFileOutputOperator- Persists data to a single file or multiple files.

- Automatic rotation of files (optional) based on- file size- window count

- Optional compression and encryption of data.

- Fault-tolerant

- Scalable as long as different partitions write to different files. Subclasses can achieve this by appending the operator id to the file name.

Page 16: Fault Tolerant File Input & Output

AbstractFileOutputOperator : Fault tolerance

Record is persisted exactly once.

- A record is never missed.

- A record is not duplicated.

Example application that persists data exactly once:AtomicFileOutputApp

Page 18: Fault Tolerant File Input & Output

AbstractFileOutputOperator : Fault tolerance cont’d

To write exactly once

- Assumes idempotent processing

- Checkpoint consists of size of each file the operator has written so far.

- Truncation of files to the size saved in the restoration checkpoint.

Page 19: Fault Tolerant File Input & Output

AbstractFileOutputOperator : Fault tolerance cont’d

To avoid dangling lease issues in HDFS- Data is always written to temporary files

- Renaming temp files to actual files when a file is finalized, that is, closed for writing.

- User can choose when the files get finalized. Rotation handles finalization automatically.

Page 20: Fault Tolerant File Input & Output

AbstractFileOutputOperator : Custom Implementationpublic class CustomFileOutputOperator<RECORD> extends AbstractFileOutputOperator<RECORD>{ public CustomFileOutputOperator() { setMaxLength(1024 * 1024); setRotationWindows(600); } @Override protected String getFileName(RECORD tuple) { //file name return tuple.getFileName(); } @Override protected byte[] getBytesForTuple(RECORD tuple) { //bytes from record return tuple.toBytes(); }}

Page 21: Fault Tolerant File Input & Output

Acknowledgements- Apex dev team

Munagala RamanathPramod ImmaneniSasha Parfenov Thomas WeiseTimothy Farkas

- Meetup organizersAmol Kekre

Qybare PulaIan Gomez

- Apache Apex Community

Page 22: Fault Tolerant File Input & Output

© 2016 DataTorrent

Resources

22

• Apache Apex - http://apex.apache.org/• Subscribe - http://apex.apache.org/community.html• Download - https://www.datatorrent.com/download/• Twitter

ᵒ @ApacheApex; Follow - https://twitter.com/apacheapexᵒ @DataTorrent; Follow – https://twitter.com/datatorrent

• Meetups - http://www.meetup.com/topics/apache-apex• Webinars - https://www.datatorrent.com/webinars/• Videos - https://www.youtube.com/user/DataTorrent• Slides - http://www.slideshare.net/DataTorrent/presentations • Startup Accelerator Program - Full featured enterprise product

ᵒ https://www.datatorrent.com/product/startup-accelerator/

Page 23: Fault Tolerant File Input & Output

© 2016 DataTorrent

We Are Hiring

23

[email protected]• Developers/Architects• QA Automation Developers• Information Developers• Build and Release• Community Leaders