overview of cascading 3.0 on apache flink

16
Cascading on Flink Fabian Hueske @fhueske

Upload: cascading

Post on 16-Apr-2017

4.998 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Overview of Cascading 3.0 on Apache Flink

Cascading on Flink

Fabian Hueske@fhueske

Page 2: Overview of Cascading 3.0 on Apache Flink

What is Cascading?“Cascading is the proven application

development platform for building data applications on Hadoop.”

(www.cascading.org)

Java API for large-scale batch processing Programs are specified as data flows

• pipes, taps, flow, cascade, …• each, groupBy, every, coGroup, merge, …

Originally for Hadoop MapReduce• Compiled to workflows of Hadoop MapReduce jobs

Open Source (AL2)• Developed by Concurrent

2

Page 3: Overview of Cascading 3.0 on Apache Flink

Why Cascading? Vastly simplified API compared to pure MR API

• Reuse of code, connecting flows, … Automatic translation to MR jobs

• Minimizes number of MR jobs Rock-solid execution due to Hadoop MapReduce More APIs have been put on top

• Scalding (Scala) by Twitter• Cascalog (Datalog)• Lingual (SQL)• Fluent (fluent Java API)

Runs in many production settings• Twitter, Soundcloud, Etsy, Airbnb, …

3

Page 4: Overview of Cascading 3.0 on Apache Flink

Cascading Example

4

Compute TF-IDF scores for a set of documents• TF-IDF: Term-Frequency / Inverted-Document-Frequency• Used for weighting the relevance of terms in search

engines Building this against the MapReduce API is painful

Example taken from docs.cascading.org/impatient

Page 5: Overview of Cascading 3.0 on Apache Flink

Cascading 3.0 Released in June 2015

A new planner• Execution backend can be changed

Apache Tez executor• Cascading programs are compiled to Tez jobs• No identity mappers• No writing to HDFS between jobs

5

Page 6: Overview of Cascading 3.0 on Apache Flink

Why Cascading on Flink? Flink’s unique batch processing runtime• Pipelined data exchange• Actively managed memory on- & off-heap• Efficient in-memory & out-of-core operators• Sorting and hashing on binary data• No tuning for robust operation (OOME, GC)

YARN integration

6

Page 7: Overview of Cascading 3.0 on Apache Flink

Cascading on Flink released Available on Github• Apache License V2

Depends on • Cascading 3.1 WIP• Flink 0.10-SNAPSHOT• Will be fixed to next releases of Cascading and Flink

Check Github for details:http://github.com/dataartisans/cascading-flink

7

Page 8: Overview of Cascading 3.0 on Apache Flink

Executing Cascading on Flink Cascading programs are translated into Flink

programs

Execution leverages all runtime features• Memory-safe execution• In-memory operators• Pipelining• Native serializers & binary comparators

(if program provides data types)

Use Flink’s regular execution clients8

Page 9: Overview of Cascading 3.0 on Apache Flink

Current limitations HashJoin only supported as InnerJoin• HashJoin can be replaced by CoGroup

Support will be added once Flink supports hash-based outer joins• This is work in progress

9

Page 10: Overview of Cascading 3.0 on Apache Flink

How to run Cascading on Flink No binaries available yet

• Clone the repository• And build it (mvn –DskipTests clean install)

Add the cascading-flink Maven dependency to your Cascading project

Change just one line of code in your Cascading program• Replace Hadoop2MR1FlowConnector by FlinkConnector• Do not change any application logic (except replacing HashJoin

for non-InnerJoins)

Execute Cascading program as regular Flink program

Detailed instructions on Github 10

Page 11: Overview of Cascading 3.0 on Apache Flink

Example: TF-IDF Taken from “Cascading for the

impatient”• 2 CoGroup, 7 GroupBy, 1 HashJoin

11http://docs.cascading.org/impatient

Page 12: Overview of Cascading 3.0 on Apache Flink

TF-IDF on MapReduce Cascading on MapReduce translates

the TF-IDF program to 9 MapReduce jobs

Each job• Reads data from HDFS• Applies a Map function• Shuffles the data over the network• Sorts the data• Applies a Reduce function• Writes the data to HDFS

12

Page 13: Overview of Cascading 3.0 on Apache Flink

TF-IDF on Flink Cascading on Flink translates the

TF-IDF job into one Flink job

13

Page 14: Overview of Cascading 3.0 on Apache Flink

TF-IDF on Flink Shuffle is pipelined Intermediate results are not

written to or read from HDFS

14

Page 15: Overview of Cascading 3.0 on Apache Flink

TF-IDF: MR vs. Flink 8 worker node• 8 CPUs, 30GB RAM, 2 local SSDs

Hadoop 2.7.1 (YARN, HDFS, MapReduce)

Flink 0.10-SNAPSHOT 80GB data (intermediate data larger)

15

Cascading on Flink -> 3:24h

Cascading on MapReduce -> 8:33h

Page 16: Overview of Cascading 3.0 on Apache Flink

Conclusion Executing Cascading jobs on Apache Flink• Improves runtime• Reduces parameter tuning and avoids failures• Virtually no code changes

Apache Flink’s runtime is very versatile• Apache Hadoop MR• Apache Storm• Google Dataflow• Apache Samoa (incubating)• + Flink’s own APIs and libraries…

16