vyacheslav zholudev – flink, a convenient abstraction layer for yarn?

FLINK - A CONVENIENT ABSTRACTION LAYER FOR YARN?

VYACHESLAV ZHOLUDEV

INTRODUCTION

• YARN opened Hadoop for many more developers • API to integrate into a Hadoop cluster • Flexibility • Applications: MR, TEZ, Flink, Spark,…

• Flink has been great in using the opportunity • Flexible program execution graph • Operators other than Map and Reduce • Clean and convenient API • Efficient with I/O

EXPECTATIONS FROM YARN

• New programming models in addition to MapReduce • More alternatives to cover cases where the MapReduce paradigm does

not suit well • Flexibility with expressing operations on data • Elasticity of a cluster • Ability to write own applications to distribute computations across

the cluster

DISTRIBUTING COMPUTATIONAL TASKS

• Writing own YARN application • Complicated • Tedious • Error-prone • Somebody must have done

something simpler • Apache Twill • Was not simple enough still

• Execute CLI tools remotely (if everything else fails)

• Flink?

FLINK AT RESEARCHGATE

Lots of benefits:• Made MapReduce jobs more readable • More compact • Less boiler plate code • Easier to understand and maintain

• Got rid of ugly Hive queries and optimised runtime • Better and cleaner orchestration of workflow

subtasks (before we had to glue multiple MR jobs) • Iterative machine learning algorithms • Distributing computational tasks across a cluster

REAL USE CASE:MONGODB TO AVRO BRIDGE

REAL USE CASE

• In essence:• Reads MongoDB documents• Converts them to Avro records (based on a provided Avro schema)• Persists them on HDFS

• Avrongo evolution • One threaded program• Multi-threaded program talking to different shards in parallel• Distributed across cluster

• Reasons for distributing:• Were CPU bound• HDFS load distribution

A MongoDB to Avro Bridge (aka Avrongo)

Used to dump live DB data to HDFS for further batch-processing and analytics

HOW AVRONGO WORKS?

Basic Version• One thread• Using one MongoDB cursor to iterate the whole collection• Suitable for smaller collections

MONGODB SHARDS AND CHUNKS

• Controlling load on the MongoDB cluster• Deterministic way of splitting collection for input

Utilizing MongoDB chunks

AVRONGO - SHARDED VERSION

• Collecting chunks information (sets of documents living on a particular shard)• Processing chunks of each shard in a separate group of threads

AVRONGO - FLINK VERSION

• Custom InputFormat that distributes MongoDB chunks uniformly• FlatMap operator• Number of task nodes = (number of shards) x (parallelism per shard) • Custom Generic AvroOutputFormat• Slower shards receive a bit more attention

FLINK APPROACH

Outcome• No longer bound by CPU• Imports to HDFS are faster

• Some collections: from 6h to 2.5h or from 3.5h to 2h

• Very few lines of code• Same command line interface (no efforts to migrate to Flink-based version)• Reusing the same converter as in standalone versions• All orchestration and parallelisation work is done automatically by Flink

Benefits

ANOTHER USE CASE:DISTRIBUTED FILE COPYING

HADOOP DISTCP

• Generates a MapReduce job that copies big amount of data• List of files as an input to a Map Task• Two types of Input Formats:

• UniformSizeInputFormat• DynamicInputFormat• gives more load to faster mappers• complicated code• utilizes FS to feed the mappers

https://hadoop.apache.org/docs/r1.2.1/distcp2.html

https://hadoop.apache.org/docs/r1.2.1/distcp2.html

• Implements the same logic as in a DynamicInputFormat of Hadoop’s distcp• Much fewer lines of code • Same runtime as Hadoop distcp • Available in Flink Java examples• Not fault-tolerant (yet)

FLINK DISTCP

https://github.com/apache/flink/tree/master/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/distcp

https://github.com/apache/flink/tree/master/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/distcp

CONCLUSIONS

CONCLUSIONS

• Flink - a thin layer for implementing your YARN application for parallelising independent tasks on the cluster• Thanks to custom input formats that are easy to implement• No boilerplate code

Would be nice to have:• Elasticity• Better progress tracking• Fault tolerance

Custom input format + a Flink operator with business logic = Happiness

QUESTIONS?

https://www.researchgate.net/careers

https://www.researchgate.net/careers

vyacheslav zholudev – flink, a convenient abstraction layer for yarn?

Technology