apache beam and google cloud dataflow - idg - final

Download Apache Beam and Google Cloud Dataflow - IDG - final

Post on 13-Jan-2017

625 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

  • Google Cloud Dataflowthe next generation of managed big data service based on the Apache Beam programming model

    Sub Szabolcs Feczak, Cloud Solutions Engineer

    Google

    9th Cloud & Data Center World 2016 - IDG

    mailto:sub@google.commailto:sub@google.com

  • You leave here understanding the fundamentals of

    the Apache Beam model and the Google Cloud Dataflow managed service

    We have some fun.

    1

    Goals

    2

  • Background and Historical overview

  • The trade-off quadrant of Big Data

    CompletenessSpeed

    Cost Optimization

    Complexity

    Time to Answer

  • MapReduce

    Hadoop

    Flume

    Storm

    Spark

    MillWheel

    Flink

    Apache Beam

    *

    Batch

    Streaming

    Pipelines

    Unified API

    No Lam

    bda

    Iterative

    Interactive

    Exactly Once

    State

    Timers

    Auto-A

    wesom

    e

    Waterm

    arks

    Window

    ing

    High-level API

    Managed Service

    Triggers

    Open Source

    Unified Engine*

    *O

    ptimizer

    * * **

    *

    * *

    * *

    *

  • Deep dive, probing familiarity with the subject

    1M Devices

    16.6K Events/sec

    43B Events/month

    518B Events/year

  • Before Apache Beam

    Batch

    Accuracy

    Simplicity

    Savings

    Stream

    Speed

    Sophistication

    Scalability

    OROROROR

  • After Apache Beam

    Batch

    Accuracy

    Simplicity

    Savings

    Stream

    Speed

    Sophistication

    Scalability

    ANDANDANDAND

    Balancing correctness, latency and cost with a unified batch

    with a streaming model

  • http://research.google.com/search.html?q=dataflow

    http://research.google.com/search.html?q=dataflowhttp://research.google.com/search.html?q=dataflow

  • Apache Beam (incubating)

    Java https://github.com/GoogleCloudPlatform/DataflowJavaSDK

    Python (ALPHA)

    Scala /darkjh/scalaflow

    /jhlch/scala-dataflow-dsl

    SoftwareDevelopment Kits Runners

    http://incubator.apache.org/projects/beam.htmlThe Dataflow submission to the Apache Incubator was accepted on February 1, 2016, and the resulting project is now called Apache Beam.

    Spark runner@ /cloudera/spark-

    dataflow

    Flink runner @ /dataArtisans/flink-dataflow

    https://github.com/GoogleCloudPlatform/DataflowJavaSDKhttps://github.com/GoogleCloudPlatform/DataflowJavaSDKhttps://github.com/GoogleCloudPlatform/DataflowJavaSDKhttps://github.com/darkjh/scalaflowhttps://github.com/jhlch/scala-dataflow-dslhttps://github.com/jhlch/scala-dataflow-dslhttp://incubator.apache.org/projects/beam.htmlhttp://incubator.apache.org/projects/beam.htmlhttp://github.com//cloudera/spark-dataflowhttp://github.com//cloudera/spark-dataflowhttp://github.com//cloudera/spark-dataflowhttp://github.com/dataArtisans/flink-dataflowhttp://github.com/dataArtisans/flink-dataflow

  • Movement

    Filtering

    Enrichment

    Shaping

    Reduction

    Batch computation

    Continuous computation

    Composition

    External orchestration

    Simulation

    Where might you use Apache Beam?

    AnalysisETL Orchestration

  • Why would you go with a managed service?

  • GCP

    Managed Service

    User Code & SDKWork Manager

    Deploy & Schedule

    Monitoring UI

    Job Manager

    Cloud Dataflow Managed Service advantages (GA since 2015 August)

    Progress & Logs

  • Deploy Schedule & Monitor Tear Down

    Worker Lifecycle ManagementCloud Dataflow Service

  • Time & life never stop

    Data rates & schema are not static

    Scaling models are not static

    Non-elastic compute is wasteful and can create lag

    Challenge: cost optimization

  • Auto-scaling800 QPS 1200 QPS 5000 QPS 50 QPS

    10:00 11:00 12:00 13:00

    Cloud Dataflow Service

  • 100 mins. 65 mins.

    vs.

    Dynamic Work RebalancingCloud Dataflow Service

  • ParDo fusion Producer Consumer Sibling Intelligent fusion

    boundaries Combiner lifting e.g. partial

    aggregations before reduction

    http://research.google.com/search.html?q=flume%20java

    ...

    Graph OptimizationCloud Dataflow Service

    C D

    C+D

    consumer-producer

    = ParallelDo

    GBK = GroupByKey

    + = CombineValues

    sibling

    C D

    C+D

    A GBK + B

    A+ GBK + B

    combiner lifting

    http://research.google.com/search.html?q=flume%20javahttp://research.google.com/search.html?q=flume%20javahttp://research.google.com/search.html?q=flume%20java

  • Deep dive into the programming model

  • The Apache Beam Logical Model

    What are you computing?

    Where in event time?

    When in processing time?

    How do refinements relate?

  • What are you computing?

    A Pipeline represents a graph

    Nodes are data processing

    transformations

    Edges are data sets flowing

    through the pipeline

    Optimized and executed as a

    unit for efficiency

  • What are you computing? PCollections is a collection of homogenous

    data of the same type

    Maybe be bounded or unbounded in size

    Each element has an implicit timestamp

    Initially created from backing data stores

  • Challenge: completeness when processing continuous data

    9:008:00 14:0013:0012:0011:0010:00

    8:00

    8:008:00

    8:00

  • What are you computing? PTransforms

    transform PCollections into other PCollections.

    What Where When How

    Element-Wise(Map + Reduce = ParDo)

    Aggregating(Combine, Join Group)

    Composite

  • GroupByKey

    Pair With Ones

    Sum Values

    Count

    Define new PTransforms by building up subgraphs of existing transforms

    Some utilities are included in the SDK Count, RemoveDuplicates, Join,

    Min, Max, Sum, ...

    You can define your own: DoSomething, DoSomethingElse,

    etc.

    Why bother? Code reuse Better monitoring experience

    Composite PTransformsApache BeamSDK

  • Example: Computing Integer Sums

    What Where When How

  • What Where When How

    Example: Computing Integer Sums

  • Key 2

    Key 1

    Key 3

    1

    Fixed

    2

    3

    4

    Key 2

    Key 1

    Key 3

    Sliding

    123

    54

    Key 2

    Key 1

    Key 3

    Sessions

    2

    43

    1

    Where in Event Time?

    Windowing divides data into event-time-based finite chunks.

    Required when doing aggregations over unbounded data.

    What Where When How

  • What Where When How

    Example: Fixed 2-minute Windows

  • What Where When How

    When in Processing Time?

    Triggers control when results are emitted.

    Triggers are often relative to the watermark.P

    roce

    ssin

    g Ti

    me

    Event Time

    WatermarkSkew

  • What Where When How

    Example: Triggering at the Watermark

  • What Where When How

    Example: Triggering for Speculative & Late Data

  • What Where When How

    How do Refinements Relate?

    How should multiple outputs per window accumulate?

    Appropriate choice depends on consumer.

    Firing Elements

    Speculative 3

    Watermark 5, 1

    Late 2

    Total Observ 11

    Discarding

    3

    6

    2

    11

    Accumulating

    3

    9

    11

    23

    Acc. & Retracting

    3

    9, -3

    11, -9

    11

  • What Where When How

    Example: Add Newest, Remove Previous

  • 1. Classic Batch 2. Batch with Fixed Windows

    3. Streaming 5. Streaming with Retractions

    4. Streaming with Speculative + Late Data

    Customizing What Where When How

    What Where When How

  • The key takeaway

  • Optimizing Your Time To Answer

    More time to dig into your data

    Programming

    Resource provisioning

    Performance tuning

    Monitoring

    ReliabilityDeployment & configuration

    Handling Growing Scale

    Utilization improvements

    Data Processing with Cloud DataflowTypical Data Processing

    Programming

  • How much more time?

    You do not just save on processing, but code complexity and size as well!

    Source: https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison

    https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparisonhttps://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparisonhttps://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparisonhttps://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison

  • What do customers have to say aboutGoogle Cloud Dataflow

    "We are utilizing Cloud Dataflow to overcome elasticity challenges with our current Hadoop cluster. Starting with some basic ETL workflow for BigQuery ingestion, we transitioned into full blown clickstream processing and analysis. This has helped us significantly improve performance of our overall system and reduce cost."

    Sudhir Hasbe, Director of Software Engineering, Zullily.com

    The current iteration of Qubits real-time data supply chain was heavily inspired by the ground-breaking stream processing concepts described in Googles MillWheel paper. Today we are happy to come full circle and build streaming pipelines on top of Cloud Dataflow - which has delivered on the promise of a highly-available and fault-tolerant data processing system with an incredibly powerful a

Recommended

View more >