dongwon kim – a comparative performance evaluation of flink

39
A Comparative Performance Evaluation of Flink Dongwon Kim POSTECH

Upload: flink-forward

Post on 08-Jan-2017

8.938 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Dongwon Kim – A Comparative Performance Evaluation of Flink

A Comparative Performance Evaluation of Flink

Dongwon Kim

POSTECH

Page 2: Dongwon Kim – A Comparative Performance Evaluation of Flink

About Me

• Postdoctoral researcher @ POSTECH

• Research interest• Design and implementation of distributed systems

• Performance optimization of big data processing engines

• Doctoral thesis• MR2: Fault Tolerant MapReduce with the Push Model

• Personal blog• http://eastcirclek.blogspot.kr

• Why I’m here

2

Page 3: Dongwon Kim – A Comparative Performance Evaluation of Flink

Outline

• TeraSort for various engines

• Experimental setup

• Results & analysis

• What else for better performance?

• Conclusion

3

Page 4: Dongwon Kim – A Comparative Performance Evaluation of Flink

TeraSort

• Hadoop MapReduce program for the annual terabyte sort competition

• TeraSort is essentially distributed sort (DS)

a4

b3 a1

a2

b1

b2

a2

b1 a3

a4

b3

b4

Disk

a2

a4

b3

b1

a1

b4

a3

b2

a1

a3

b4

b2Disk

a2

a4

a1

a3

b3

b1

b4

b2

read shufflinglocal sort write

Disk

Disk

local sort

Node 1

Node 2

Typical DS phases :

4Total order

Page 5: Dongwon Kim – A Comparative Performance Evaluation of Flink

• Included in Hadoop distributions• with TeraGen & TeraValidate

• Identity map & reduce functions

• Range partitioner built on sampling• To guarantee a total order & to prevent partition skew

• Sampling to compute boundary points within few seconds

TeraSort for MapReduce

Reduce taskMap task

read shuffling sortsort reducemap write

read shufflinglocal sort writelocal sortDS phases :

reducemap

5

Record range…

Partition 1 Partition 2 Partition r

boundary points

Page 6: Dongwon Kim – A Comparative Performance Evaluation of Flink

• Tez can execute TeraSort for MapReduce w/o any modification• mapreduce.framework.name = yarn-tez

• Tez DAG plan of TeraSort for MapReduce

TeraSort for Tez

finalreduce vertex

initialmap vertex

Map task

read sortmap

Reduce task

shuffling sort reduce write

input data

output data

6

Page 7: Dongwon Kim – A Comparative Performance Evaluation of Flink

TeraSort for Spark & Flink

• My source code in GitHub:• https://github.com/eastcirclek/terasort

• Sampling-based range partitioner from TeraSort for MapReduce• Visit my personal blog for a detailed explanation

• http://eastcirclek.blogspot.kr

7

Page 8: Dongwon Kim – A Comparative Performance Evaluation of Flink

RDD1 RDD2

• Code

• Two RDDs

TeraSort for Spark

Stage 1Stage 0

Shuffle-Map Task(for newAPIHadoopFile)

read sort

Result Task(for repartitionAndSortWithinPartitions)

shuffling sort write

read shufflinglocal sort writelocal sort

Create a new RDD to read from HDFS

# partitions = # blocks

Repartition the parent RDD based on the user-specified partitioner

Write output to HDFS

DS phases :

8

Page 9: Dongwon Kim – A Comparative Performance Evaluation of Flink

Pipeline

• Code

• Pipelines consisting of four operators

TeraSort for Flink

read shuffling writelocal sort

Create a dataset to read tuples from HDFS

partition tuples

Sort tuples of each partition

DataSource Partition SortPartition DataSink

local sort

No map-side sorting due to pipelined execution

Write output to HDFS

DS phases :

9

Page 10: Dongwon Kim – A Comparative Performance Evaluation of Flink

Importance of TeraSort

• Suitable for measuring the pure performance of big data engines• No data transformation (like map, filter) with user-defined logic

• Basic facilities of each engine are used

• “Winning the sort benchmark” is a great means of PR

10

Page 11: Dongwon Kim – A Comparative Performance Evaluation of Flink

Outline

• TeraSort for various engines

• Experimental setup

• Machine specification

• Node configuration

• Results & analysis

• What else for better performance?

• Conclusion

11

Page 12: Dongwon Kim – A Comparative Performance Evaluation of Flink

Machine specification (42 identical machines)

DELL PowerEdge R610

CPUTwo X5650 processors

(Total 12 cores)

MemoryTotal 24Gb

Disk6 disks * 500GB/disk

Network10 Gigabit Ethernet

My machine Spark team

ProcessorIntel Xeon X5650

(Q1, 2010)Intel Xeon E5-2670

(Q1, 2012)

Cores 6 * 2 processors 8 * 4 processors

Memory 24GB 244GB

Disks 6 HDD's 8 SSD's

Results can be differentin newer machines

12

Page 13: Dongwon Kim – A Comparative Performance Evaluation of Flink

24GB on each node

Node configuration

Total 2 GBfor daemons

13 GB

Tez-0.7.0

NodeManager (1 GB)

ShuffleService

MapTask (1GB)

DataNode (1 GB)

MapTask (1GB)

ReduceTask (1GB)

ReduceTask (1GB)

MapTask (1GB)

MapTask (1GB)

MapTask (1GB)

MapReduce-2.7.1

NodeManager (1 GB)

ShuffleService

DataNode (1 GB)

MapTask (1GB)

MapTask (1GB)

ReduceTask (1GB)

ReduceTask (1GB)

MapTask (1GB)

MapTask (1GB)

MapTask (1GB)

…12 GB

FlinkSpark

Spark-1.5.1

NodeManager (1 GB)

Executor (12GB)

Internal memory layout

Various managers

DataNode (1 GB)

Task slot 1

Task slot 2

Task slot 12

...

Thread pool

Flink-0.9.1

NodeManager (1 GB)

TaskManager (12GB)

DataNode (1 GB)

Internal memory layout

Various managers

Task slot 1

Task slot 2

Task slot 12

...

Task threads

TezMapReduce

ReduceTask (1GB)ReduceTask (1GB)

13

12 simultaneous tasks at most

Driver (1GB) JobManager (1GB)

Page 14: Dongwon Kim – A Comparative Performance Evaluation of Flink

Outline

• TeraSort for various engines

• Experimental setup

• Results & analysis

• Flink is faster than other engines due to its pipelined execution

• What else for better performance?

• Conclusion

14

Page 15: Dongwon Kim – A Comparative Performance Evaluation of Flink

How to read a swimlane graph & throughput graphs

Tasks

Time since job starts (seconds)

2nd stage

1st2nd

3rd4th

5th6th

15

Cluster network throughput

Cluster disk throughput

InOut

Disk read

DiskWrite

- 6 waves of 1st stage tasks- 1 wave of 2nd stage tasks

- Two stages are hardly overlapped

1st stage

2nd stage

1st stage

2nd stage

No network traffic during 1st stage

Each line : duration of each task

Different patterns for different stages

Page 16: Dongwon Kim – A Comparative Performance Evaluation of Flink

Result of sorting 80GB/node (3.2TB)

1480 sec

1st stage

1st stage

1st stage

2nd stage2157 sec

2nd stage

2171 sec

1 DataSource

2 Partition

3 SortPartition

4 DataSink

• Flink is the fastest due to its pipelined execution

• Tez and Spark do not overlap 1st and 2nd stages

• MapReduce is slow despite overlapping stages

MapReducein Hadoop-2.7.1

Tez-0.7.0

Spark-1.5.1

Flink-0.9.1

2nd stage

1887 sec

2157

1887

2171

1480

0

500

1000

1500

2000

2500

MapReduce

in Hadoop-2.7.1

Tez-0.7.0 Spark-1.5.1 Flink-0.9.1

Tim

e (se

conds)

16* Map output compression turned on for Spark and Tez

* *

Page 17: Dongwon Kim – A Comparative Performance Evaluation of Flink

Tez and Spark do not overlap 1st and 2nd stages

Cluster network throughput

Cluster disk throughput

InOut

Disk readCluster network throughput

Cluster disk throughput

InOut

Disk read

Diskwrite

Diskwrite

Disk read

Disk write

Out

In

(1) 2nd stage starts

(2) Output of 1st stage is sent

(1) 2nd stage starts

(2) Output of 1st stage is sent

(1)Network traffic

occurs from start

Cluster network throughput

(2)Write to HDFS occurs

right after shuffling is done

1 DataSource

2 Partition

3 SortPartition

4 DataSink

idle idle

(3)Disk write to HDFS occurs

after shuffling is done

(3)Disk write to HDFS occurs

after shuffling is done

17

Page 18: Dongwon Kim – A Comparative Performance Evaluation of Flink

Tez does not overlap 1st and 2nd stages

• Tez has parameters to control the degree of overlap• tez.shuffle-vertex-manager.min-src-fraction : 0.2

• tez.shuffle-vertex-manager.max-src-fraction : 0.4

• However, 2nd stage is scheduled early but launched late

scheduled launched

18

Page 19: Dongwon Kim – A Comparative Performance Evaluation of Flink

Spark does not overlap 1st and 2nd stages

• Spark cannot execute multiple stages simultaneously• also mentioned in the following VLDB paper (2015)

Spark doesn’t support the overlap between shuffle write and read stages.

…Spark may want to support this overlap

in the future to improve performance.

Experimental results of this paper- Spark is faster than MapReduce for WordCount, K-means, PageRank.- MapReduce is faster than Spark for Sort.

19

Page 20: Dongwon Kim – A Comparative Performance Evaluation of Flink

MapReduce is slow despite overlapping stages

• mapreduce.job.reduce.slowstart.completedMaps : [0.0, 1.0]

• Wang’s attempt to overlap spark stages

0.05(overlapping, default)

0.95 (no overlapping)

2157 sec

10% improvement

20

Wang proposes to overlap stages to achieve better utilization

10%???

Why Spark & MapReduce improve just 10%?

2385 sec

2nd stage

1st stage

Page 21: Dongwon Kim – A Comparative Performance Evaluation of Flink

Disk

Data transfer between tasks of different stages

Output file

P1 P2 Pn

Shuffle server

ConsumerTask 1

ConsumerTask 2

ConsumerTask n

P1

P2

Pn

Traditional pull model- Used in MapReduce, Spark, Tez- Extra disk access & simultaneous disk access- Shuffling affects the performance of producers

Producer Task

(1)Write output

to disk

(2)Request P1

(3)Send P1

Pipelined data transfer- Used in Flink- Data transfer from memory to memory- Flink causes fewer disk access during shuffling

21Leads to only 10% improvement

Page 22: Dongwon Kim – A Comparative Performance Evaluation of Flink

Flink causes fewer disk access during shuffling

MapReduce

Flink diff.

Total disk write(TB)

9.9 6.5 3.4

Total disk read(TB)

8.1 6.9 1.2

Difference comes from shuffling

Shuffled data are sometimes read from page cache

Cluster disk throughput

Disk read

Disk write

Disk read

Disk write

Cluster disk throughput

FlinkMapReduce

22

Total amount of disk read/writeequals to

the area of blue/green region

Page 23: Dongwon Kim – A Comparative Performance Evaluation of Flink

Result of TeraSort with various data sizes

node data size(GB)

Time (seconds)

Flink Spark MapReduce Tez

10 157 387 259 277

20 350 652 555 729

40 741 1135 1085 1709

80 1480 2171 2157 1887

160 3127 4927 4796 395023

100

1000

10000

10 20 40 80 160

Tim

e (se

conds)

node data size (GB)

Flink Spark MapReduce Tez

What we’ve seen

Log scale

* Map output compression turned on for Spark and Tez

Page 24: Dongwon Kim – A Comparative Performance Evaluation of Flink

Result of HashJoin

• 10 slave nodes

• org.apache.tez.examples.JoinDataGen• Small dataset : 256MB

• Large dataset : 240GB (24GB/node)

• Result :

24

Visit my blog

Flink is

~2x faster than Tez~4x faster than Spark

770

1538

378

0

200

400

600

800

1000

1200

1400

1600

1800

Tez-0.7.0 Spark-1.5.1 Flink-0.9.1

Tim

e (se

conds)

* No map output compression for both Spark and Tez unlike in TeraSort

Page 25: Dongwon Kim – A Comparative Performance Evaluation of Flink

Result of HashJoin with swimlane & throughput graphs

25

Idle

1 DataSource

2 DataSource

3 Join

4 DataSink

Idle

Cluster network throughput

Cluster disk throughput

InOut

Diskread

Diskwrite

Disk read

Disk write

In

Out

In

Out

Disk read

Diskwrite

Cluster network throughput

Cluster disk throughput

0.24 TB

0.41 TB0.60 TB 0.84 TB

0.68 TB

0.74 TB

Overlap

2nd

3rd

Page 26: Dongwon Kim – A Comparative Performance Evaluation of Flink

Flink’s shortcoming

• No support for map output compression• Small data blocks are pipelined between operators

• Job-level fault tolerance• Shuffle data are not materialized

• Low disk throughput during the post-shuffling phase

26

Page 27: Dongwon Kim – A Comparative Performance Evaluation of Flink

Low disk throughput during the post-shuffling phase

• Possible reason : sorting records from small files

• Concurrent disk access to small files too many disk seeks low disk throughput

• Other engines merge records from larger files than Flink

• “Eager pipelining moves some of the sorting work from the mapper to the reducer”

• from MapReduce online (NSDI 2010)

Flink Tez MapReduce

27

Page 28: Dongwon Kim – A Comparative Performance Evaluation of Flink

Outline

• TeraSort for various engines

• Experimental setup

• Results & analysis

• What else for better performance?

• Conclusion

28

Page 29: Dongwon Kim – A Comparative Performance Evaluation of Flink

MR2 – another MapReduce engine

• PhD thesis• MR2: Fault Tolerant MapReduce with the Push Model

• developed for 3 years

• Provide the user interface of Hadoop MapReduce• No DAG support

• No in-memory computation

• No iterative-computation

• Characteristics• Push model + Fault tolerance

• Techniques to boost up HDD throughput• Prefetching for mappers

• Preloading for reducers

29

Page 30: Dongwon Kim – A Comparative Performance Evaluation of Flink

MR2 pipeline

• 7 types of components with memory buffers1. Mappers & reducers : to apply user-defined functions

2. Prefetcher & preloader : to eliminate concurrent disk access

3. Sender & reducer & merger : to implement MR2’s push model

• Various buffers : to pass data between components w/o disk IOs

• Minimum disk access (2 disk reads & 2 disk writes)• +1 disk write for fault tolerance

W1 R2 W2R1

30

1 12 23 3 3

W3

Page 31: Dongwon Kim – A Comparative Performance Evaluation of Flink

Prefetcher & Mappers

• Prefetcher loads data for multiple mappers

• Mappers do not read input from disks

<MR2><Hadoop MapReduce>

Mapper1 processing Blk1

Mapper2 processing Blk2

Time

Diskthroughput

CPUutilization

2 mapperson a node

Blk1

Time

Prefetcher Blk2 Blk3

Blk12

Blk11

Blk22

Blk21

Blk32

Blk31

Blk4

Blk42

Blk41

Diskthroughput

CPUutilization

2 mapperson a node

31

Page 32: Dongwon Kim – A Comparative Performance Evaluation of Flink

Push-model in MR2

• Node-to-node network connection for pushing data• To reduce # network connections

• Data transfer from memory buffer• Mappers stores spills in send buffer• Spills are pushed to reducer sides by sender

• Fault tolerance (can be turned on/off)• Input ranges of each spill are known to master for reproduce• For fast recovery

• store spills on disk for fast recovery (extra disk write)

32

similar to Flink’s pipelined execution

MR2 does local sorting before pushing data

similar to Spark

Page 33: Dongwon Kim – A Comparative Performance Evaluation of Flink

Receiver’s managed memory

Receiver & merger & preloader & reducer

• Merger produces a file from different partition data• sorts each partition data

• and then does interleaving

• Preloader preloads each group into reduce buffer

• Reducers do not read data directly from disks• MR2 can eliminate concurrent disk reads from reducers thanks to Preloader

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

… …

Preloader loads each group(1 disk access for 4 partitions)

33

Page 34: Dongwon Kim – A Comparative Performance Evaluation of Flink

Result of sorting 80GB/node (3.2TB) with MR2

MapReducein Hadoop-2.7.1

Tez-0.7.0 Spark-1.5.1 Flink-0.9.1 MR2

Time (sec) 2157 1887 2171 1480 890

MR2 speedup over other engines

2.42 2.12 2.44 1.66 -

21571887

2171

1480

890

0

500

1000

1500

2000

2500

MapReduce

in Hadoop-2.7.1

Tez-0.7.0 Spark-1.5.1 Flink-0.9.1 MR2

Tim

e (se

conds)

34

Page 35: Dongwon Kim – A Comparative Performance Evaluation of Flink

Disk & network throughput

1. DataSource / Mapping• Prefetcher is effective

• MR2 shows higher disk throughput

2. Partition / Shuffling• Records to shuffle are

generated faster from in MR2

3. DataSink / Reducing• Preloader is effective

• Almost 2x throughput

Disk read

Disk write

Out

In

Cluster network throughput

Cluster disk throughput

Out

In

Disk read

Disk write

Flink MR2

11

2

2

3

3

35

Page 36: Dongwon Kim – A Comparative Performance Evaluation of Flink

• Experimental results using 10 nodes

PUMA (PUrdue MApreduce benchmarks suite)

36

Page 37: Dongwon Kim – A Comparative Performance Evaluation of Flink

Outline

• TeraSort for various engines

• Experimental setup

• Experimental results & analysis

• What else for better performance?

• Conclusion

37

Page 38: Dongwon Kim – A Comparative Performance Evaluation of Flink

Conclusion

• Pipelined execution for both batch and streaming processing

• Even better than other batch processing engines for

TeraSort & HashJoin

• Shortcomings due to pipelined execution

• No fine-grained fault tolerance

• No map output compression

• Low disk throughput during the post-shuffling phase

38

Page 39: Dongwon Kim – A Comparative Performance Evaluation of Flink

Thank you!

Any question?

39