apache beam and google cloud dataflow - idg - final

Google Cloud Dataflowthe next generation of managed big data service based on the Apache Beam programming model

Sub Szabolcs Feczak, Cloud Solutions Engineer

Google

9th Cloud & Data Center World 2016 - 한국 IDG

mailto:[email protected]

mailto:[email protected]

You leave here understanding the fundamentals of

the Apache Beam model and the Google Cloud Dataflow managed service

We have some fun.

1

Goals

2

Background and Historical overview

The trade-off quadrant of Big Data

CompletenessSpeed

Cost Optimization

Complexity

Time to Answer

MapReduce

Hadoop

Flume

Storm

Spark

MillWheel

Flink

Apache Beam

*

Batch

Streaming

Pipelines

Unified API

No Lam

bda

Iterative

Interactive

Exactly Once

State

Timers

Auto-A

wesom

e

Waterm

arks

Window

ing

High-level API

Managed Service

Triggers

Open Source

Unified Engine*

*O

ptimizer

* * **

*

* *

* *

*

Deep dive, probing familiarity with the subject

1M Devices

16.6K Events/sec

43B Events/month

518B Events/year

Before Apache Beam

Batch

Accuracy

Simplicity

Savings

Stream

Speed

Sophistication

Scalability

OROROROR

After Apache Beam

Batch

Accuracy

Simplicity

Savings

Stream

Speed

Sophistication

Scalability

ANDANDANDAND

Balancing correctness, latency and cost with a unified batch

with a streaming model

http://research.google.com/search.html?q=dataflow



Apache Beam (incubating)

Java https://github.com/GoogleCloudPlatform/DataflowJavaSDK

Python (ALPHA)

Scala /darkjh/scalaflow

/jhlch/scala-dataflow-dsl

SoftwareDevelopment Kits Runners

http://incubator.apache.org/projects/beam.htmlThe Dataflow submission to the Apache Incubator was accepted on February 1, 2016, and the resulting project is now called Apache Beam.

Spark runner@ /cloudera/spark-

dataflow

Flink runner @ /dataArtisans/flink-dataflow

https://github.com/GoogleCloudPlatform/DataflowJavaSDK



https://github.com/darkjh/scalaflow

https://github.com/jhlch/scala-dataflow-dsl

https://github.com/jhlch/scala-dataflow-dsl

http://incubator.apache.org/projects/beam.html

http://incubator.apache.org/projects/beam.html

http://github.com//cloudera/spark-dataflow



http://github.com/dataArtisans/flink-dataflow

http://github.com/dataArtisans/flink-dataflow

• Movement

• Filtering

• Enrichment

• Shaping

• Reduction

• Batch computation

• Continuous computation

• Composition

• External orchestration

• Simulation

Where might you use Apache Beam?

AnalysisETL Orchestration

Why would you go with a managed service?

GCP

Managed Service

User Code & SDKWork Manager

Deploy & Schedule

Monitoring UI

Job Manager

Cloud Dataflow Managed Service advantages (GA since 2015 August)

Progress & Logs

Deploy Schedule & Monitor Tear Down

Worker Lifecycle ManagementCloud Dataflow Service

❯ Time & life never stop

❯ Data rates & schema are not static

❯ Scaling models are not static

❯ Non-elastic compute is wasteful and can create lag

Challenge: cost optimization

Auto-scaling800 QPS 1200 QPS 5000 QPS 50 QPS

10:00 11:00 12:00 13:00

Cloud Dataflow Service

100 mins. 65 mins.

vs.

Dynamic Work RebalancingCloud Dataflow Service

● ParDo fusion○ Producer Consumer○ Sibling○ Intelligent fusion

boundaries● Combiner lifting e.g. partial

aggregations before reduction

● http://research.google.com/search.html?q=flume%20java

...

Graph OptimizationCloud Dataflow Service

C D

C+D

consumer-producer

= ParallelDo

GBK = GroupByKey

+ = CombineValues

sibling

C D

C+D

A GBK + B

A+ GBK + B

combiner lifting

http://research.google.com/search.html?q=flume%20java



Deep dive into the programming model

The Apache Beam Logical Model

What are you computing?

Where in event time?

When in processing time?

How do refinements relate?

What are you computing?

● A Pipeline represents a graph

● Nodes are data processing

transformations

● Edges are data sets flowing

through the pipeline

● Optimized and executed as a

unit for efficiency

What are you computing? PCollections ● is a collection of homogenous

data of the same type

● Maybe be bounded or unbounded in size

● Each element has an implicit timestamp

● Initially created from backing data stores

Challenge: completeness when processing continuous data

9:008:00 14:0013:0012:0011:0010:00

8:00

8:008:00

8:00

What are you computing? PTransforms

transform PCollections into other PCollections.

What Where When How

Element-Wise(Map + Reduce = ParDo)

Aggregating(Combine, Join Group)

Composite

GroupByKey

Pair With Ones

Sum Values

Count

❯ Define new PTransforms by building up subgraphs of existing transforms

❯ Some utilities are included in the SDK• Count, RemoveDuplicates, Join,

Min, Max, Sum, ...

❯ You can define your own:• DoSomething, DoSomethingElse,

etc.

❯ Why bother?• Code reuse• Better monitoring experience

Composite PTransformsApache BeamSDK

Example: Computing Integer Sums

What Where When How

What Where When How

Example: Computing Integer Sums

Key 2

Key 1

Key 3

1

Fixed

2

3

4

Key 2

Key 1

Key 3

Sliding

123

54

Key 2

Key 1

Key 3

Sessions

2

43

1

Where in Event Time?

● Windowing divides data into event-time-based finite chunks.

● Required when doing aggregations over unbounded data.

What Where When How

What Where When How

Example: Fixed 2-minute Windows

What Where When How

When in Processing Time?

● Triggers control when results are emitted.

● Triggers are often relative to the watermark.Pr

oces

sing

Tim

e

Event Time

WatermarkSkew

What Where When How

Example: Triggering at the Watermark

What Where When How

Example: Triggering for Speculative & Late Data

What Where When How

How do Refinements Relate?

● How should multiple outputs per window accumulate?

● Appropriate choice depends on consumer.

Firing Elements

Speculative 3

Watermark 5, 1

Late 2

Total Observ 11

Discarding

3

6

2

11

Accumulating

3

9

11

23

Acc. & Retracting

3

9, -3

11, -9

11

What Where When How

Example: Add Newest, Remove Previous

1. Classic Batch 2. Batch with Fixed Windows

3. Streaming 5. Streaming with Retractions

4. Streaming with Speculative + Late Data

Customizing What Where When How

What Where When How

The key takeaway

Optimizing Your Time To Answer

More time to dig into your data

Programming

Resource provisioning

Performance tuning

Monitoring

ReliabilityDeployment & configuration

Handling Growing Scale

Utilization improvements

Data Processing with Cloud DataflowTypical Data Processing

Programming

How much more time?

You do not just save on processing, but code complexity and size as well!

Source: https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison

https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison




What do customers have to say aboutGoogle Cloud Dataflow

"We are utilizing Cloud Dataflow to overcome elasticity challenges with our current Hadoop cluster. Starting with some basic ETL workflow for BigQuery ingestion, we transitioned into full blown clickstream processing and analysis. This has helped us significantly improve performance of our overall system and reduce cost."

Sudhir Hasbe, Director of Software Engineering, Zullily.com

“The current iteration of Qubit’s real-time data supply chain was heavily inspired by the ground-breaking stream processing concepts described in Google’s MillWheel paper. Today we are happy to come full circle and build streaming pipelines on top of Cloud Dataflow - which has delivered on the promise of a highly-available and fault-tolerant data processing system with an incredibly powerful and expressive API.”

Jibran Saithi, Lead Architect, Qubit

"We are very excited about the productivity benefits offered by Cloud Dataflow and Cloud Pub/Sub. It took half a day to rewrite something that had previously taken over six months to build using Spark"

Paul Clarke, Director of Technology, Ocado

“Boosting performance isn’t the only thing we want to get from the new system. Our bet is that by using cloud-managed

products we will have a much lower operational overhead. That in turn means we will have much more time to make

Spotify’s products better.”

Igor Maravić, Software Engineer working at Spotify

http://engineering.zulily.com/

http://engineering.zulily.com/

Demo Time!

Let’s build something - Demo!

Ingest stream from Wikipedia edits https://wikitech.wikimedia.org/wiki/Stream.wikimedia.org

Inspect the result set in our data warehouse (BigQuery)

Create a pipeline and run a Dataflow job to extract the top 10 active editors and top 10 pages edited

Extract words from a Shakespeare corpus, count the occurrences of each word, write sharded results as blobs into a key value store (Cloud Storage)

1.

2.

https://wikitech.wikimedia.org/wiki/Stream.wikimedia.org



Thank You!cloud.google.com/dataflowcloud.google.com/blog/big-data/cloud.google.com/solutions/articles#bigdatacloud.google.com/newsletterresearch.google.com

apache beam and google cloud dataflow - idg - final

Documents