dongwon kim – a comparative performance evaluation of flink

Download Dongwon Kim – A Comparative Performance Evaluation of Flink

Post on 08-Jan-2017

8.937 views

Category:

Technology

0 download

Embed Size (px)

TRANSCRIPT

  • A Comparative Performance Evaluation of Flink

    Dongwon Kim

    POSTECH

  • About Me

    Postdoctoral researcher @ POSTECH

    Research interest Design and implementation of distributed systems

    Performance optimization of big data processing engines

    Doctoral thesis MR2: Fault Tolerant MapReduce with the Push Model

    Personal blog http://eastcirclek.blogspot.kr

    Why Im here

    2

    http://eastcirclek.blogspot.kr/

  • Outline

    TeraSort for various engines

    Experimental setup

    Results & analysis

    What else for better performance?

    Conclusion

    3

  • TeraSort

    Hadoop MapReduce program for the annual terabyte sort competition

    TeraSort is essentially distributed sort (DS)

    a4

    b3 a1

    a2

    b1

    b2

    a2

    b1 a3

    a4

    b3

    b4

    Disk

    a2

    a4

    b3

    b1

    a1

    b4

    a3

    b2

    a1

    a3

    b4

    b2Disk

    a2

    a4

    a1

    a3

    b3

    b1

    b4

    b2

    read shufflinglocal sort write

    Disk

    Disk

    local sort

    Node 1

    Node 2

    Typical DS phases :

    4Total order

  • Included in Hadoop distributions with TeraGen & TeraValidate

    Identity map & reduce functions

    Range partitioner built on sampling To guarantee a total order & to prevent partition skew

    Sampling to compute boundary points within few seconds

    TeraSort for MapReduce

    Reduce taskMap task

    read shuffling sortsort reducemap write

    read shufflinglocal sort writelocal sortDS phases :

    reducemap

    5

    Record range

    Partition 1 Partition 2 Partition r

    boundary points

  • Tez can execute TeraSort for MapReduce w/o any modification mapreduce.framework.name = yarn-tez

    Tez DAG plan of TeraSort for MapReduce

    TeraSort for Tez

    finalreduce vertex

    initialmap vertex

    Map task

    read sortmap

    Reduce task

    shuffling sort reduce write

    input data

    output data

    6

  • TeraSort for Spark & Flink

    My source code in GitHub: https://github.com/eastcirclek/terasort

    Sampling-based range partitioner from TeraSort for MapReduce Visit my personal blog for a detailed explanation

    http://eastcirclek.blogspot.kr

    7

    https://github.com/eastcirclek/terasorthttp://eastcirclek.blogspot.kr/

  • RDD1 RDD2

    Code

    Two RDDs

    TeraSort for Spark

    Stage 1Stage 0

    Shuffle-Map Task(for newAPIHadoopFile)

    read sort

    Result Task(for repartitionAndSortWithinPartitions)

    shuffling sort write

    read shufflinglocal sort writelocal sort

    Create a new RDD to read from HDFS

    # partitions = # blocks

    Repartition the parent RDD based on the user-specified partitioner

    Write output to HDFS

    DS phases :

    8

  • Pipeline

    Code

    Pipelines consisting of four operators

    TeraSort for Flink

    read shuffling writelocal sort

    Create a dataset to read tuples from HDFS

    partition tuples

    Sort tuples of each partition

    DataSource Partition SortPartition DataSink

    local sort

    No map-side sorting due to pipelined execution

    Write output to HDFS

    DS phases :

    9

  • Importance of TeraSort

    Suitable for measuring the pure performance of big data engines No data transformation (like map, filter) with user-defined logic

    Basic facilities of each engine are used

    Winning the sort benchmark is a great means of PR

    10

  • Outline

    TeraSort for various engines

    Experimental setup

    Machine specification

    Node configuration

    Results & analysis

    What else for better performance?

    Conclusion

    11

  • Machine specification (42 identical machines)

    DELL PowerEdge R610

    CPUTwo X5650 processors

    (Total 12 cores)

    MemoryTotal 24Gb

    Disk6 disks * 500GB/disk

    Network10 Gigabit Ethernet

    My machine Spark team

    ProcessorIntel Xeon X5650

    (Q1, 2010)Intel Xeon E5-2670

    (Q1, 2012)

    Cores 6 * 2 processors 8 * 4 processors

    Memory 24GB 244GB

    Disks 6 HDD's 8 SSD's

    Results can be differentin newer machines

    12

  • 24GB on each node

    Node configuration

    Total 2 GBfor daemons

    13 GB

    Tez-0.7.0

    NodeManager (1 GB)

    ShuffleService

    MapTask (1GB)

    DataNode (1 GB)

    MapTask (1GB)

    ReduceTask (1GB)

    ReduceTask (1GB)

    MapTask (1GB)

    MapTask (1GB)

    MapTask (1GB)

    MapReduce-2.7.1

    NodeManager (1 GB)

    ShuffleService

    DataNode (1 GB)

    MapTask (1GB)

    MapTask (1GB)

    ReduceTask (1GB)

    ReduceTask (1GB)

    MapTask (1GB)

    MapTask (1GB)

    MapTask (1GB)

    12 GB

    FlinkSpark

    Spark-1.5.1

    NodeManager (1 GB)

    Executor (12GB)

    Internal memory layout

    Various managers

    DataNode (1 GB)

    Task slot 1

    Task slot 2

    Task slot 12

    ...

    Thread pool

    Flink-0.9.1

    NodeManager (1 GB)

    TaskManager (12GB)

    DataNode (1 GB)

    Internal memory layout

    Various managers

    Task slot 1

    Task slot 2

    Task slot 12

    ...

    Task threads

    TezMapReduce

    ReduceTask (1GB)ReduceTask (1GB)

    13

    12 simultaneous tasks at most

    Driver (1GB) JobManager (1GB)

  • Outline

    TeraSort for various engines

    Experimental setup

    Results & analysis

    Flink is faster than other engines due to its pipelined execution

    What else for better performance?

    Conclusion

    14

  • How to read a swimlane graph & throughput graphs

    Tasks

    Time since job starts (seconds)

    2nd stage

    1st2nd

    3rd4th

    5th6th

    15

    Cluster network throughput

    Cluster disk throughput

    InOut

    Disk read

    DiskWrite

    - 6 waves of 1st stage tasks- 1 wave of 2nd stage tasks

    - Two stages are hardly overlapped

    1st stage

    2nd stage

    1st stage

    2nd stage

    No network traffic during 1st stage

    Each line : duration of each task

    Different patterns for different stages

  • Result of sorting 80GB/node (3.2TB)

    1480 sec

    1st stage

    1st stage

    1st stage

    2nd stage2157 sec

    2nd stage

    2171 sec

    1 DataSource

    2 Partition

    3 SortPartition

    4 DataSink

    Flink is the fastest due to its pipelined execution

    Tez and Spark do not overlap 1st and 2nd stages

    MapReduce is slow despite overlapping stages

    MapReducein Hadoop-2.7.1

    Tez-0.7.0

    Spark-1.5.1

    Flink-0.9.1

    2nd stage

    1887 sec

    2157

    1887

    2171

    1480

    0

    500

    1000

    1500

    2000

    2500

    MapReduce

    in Hadoop-2.7.1

    Tez-0.7.0 Spark-1.5.1 Flink-0.9.1

    Tim

    e (se

    conds)

    16* Map output compression turned on for Spark and Tez

    * *

  • Tez and Spark do not overlap 1st and 2nd stages

    Cluster network throughput

    Cluster disk throughput

    InOut

    Disk readCluster network throughput

    Cluster disk throughput

    InOut

    Disk read

    Diskwrite

    Diskwrite

    Disk read

    Disk write

    Out

    In

    (1) 2nd stage starts

    (2) Output of 1st stage is sent

    (1) 2nd stage starts

    (2) Output of 1st stage is sent

    (1)Network traffic

    occurs from start

    Cluster network throughput

    (2)Write to HDFS occurs

    right after shuffling is done

    1 DataSource

    2 Partition

    3 SortPartition

    4 DataSink

    idle idle

    (3)Disk write to HDFS occurs

    after shuffling is done

    (3)Disk write to HDFS occurs

    after shuffling is done

    17

  • Tez does not overlap 1st and 2nd stages

    Tez has parameters to control the degree of overlap tez.shuffle-vertex-manager.min-src-fraction : 0.2

    tez.shuffle-vertex-manager.max-src-fraction : 0.4

    However, 2nd stage is scheduled early but launched late

    scheduled launched

    18

  • Spark does not overlap 1st and 2nd stages

    Spark cannot execute multiple stages simultaneously also mentioned in the following VLDB paper (2015)

    Spark doesnt support the overlap between shuffle write and read stages.

    Spark may want to support this overlap

    in the future to improve performance.

    Experimental results of this paper- Spark is faster than MapReduce for WordCount, K-means, PageRank.- MapReduce is faster than Spark for Sort.

    19

  • MapReduce is slow despite overlapping stages

    mapreduce.job.reduce.slowstart.completedMaps : [0.0, 1.0]

    Wangs attempt to overlap spark stages

    0.05(overlapping, default)

    0.95 (no overlapping)

    2157 sec

    10% improvement

    20

    Wang proposes to overlap stages to achieve better utilization

    10%???

    Why Spark & MapReduce improve just 10%?

    2385 sec

    2nd stage

    1st stage

  • Disk

    Data transfer between tasks of different stages

    Output file

    P1 P2 Pn

    Shuffle server

    ConsumerTask 1

    ConsumerTask 2

    ConsumerTask n

    P1

    P2

    Pn

    Traditional pull model- Used in MapReduce, Spark, Tez- Extra disk access & simultaneous disk access- Shuffling affects the performance of producers

    Producer Task

    (1)Write output

    to disk

    (2)Request P1

    (3)Send P1

    Pipelined data transfer- Used in Flink- Data transfer from memory to memory- Flink causes fewer disk access during shufflin