the next generation of data processing and open source
TRANSCRIPT
The Next Generation of Data Processing & Open SourceJames Malone, Google Product Manager, Apache Beam PPMCEric Schmidt, Google Developer Relations
Agenda
1
2
3
4
5
6
The Last Generation - Common historical challenges in large-scale data processing
The Next Generation - How large-scale data processing should work
Apache Beam - A solution for next generation data processing
Why Beam matters - A gaming example to show the power of the Beam model
Demo - Lets run a Beam pipeline on 3 engines in 2 separate clouds
Things to Remember - Recap and how you can get involved
2
3
Common historical challenges in large-scale data processing
01 The Last Generation
Decide on tool Read docs
Get infrastructure
Setup tools Tune tools
Productionize Get Specialists
Optimistic
Frustrated
Setting up infrastructure
Batch model Streaming model
Batch use case Streaming use case
Streaming engineBatch engine
Batch output Streaming output
Join output
Optimistic
Frustrated
Programming models
Data model
Data pipeline
Execution engine 1
Data model
Data pipeline
Execution engine 1
Data model
Data pipeline
Execution engine 1
FrustratedHappy
Data pipeline portability
Infrastructure is a pain
Models are disconnected
Pipelines are not portable
7
8
How data processing should work
02 The Next Generation
9
Infrastructure is a pain an afterthought
Models are disconnected unified
Pipelines are not portable portable
Skim docs
Decide on product
Start service
Optimistic
Happy
Setting up infrastructure
Unified model
Batch use case
Runner(s)
Streaming use case
Output
Optimistic
Happy
A flexible (unified) model
Data model
Data pipeline
Execution engine
Execution engine
Execution engine
Happy
Happier
Portable data pipelines
Why does this matter?
More time can be dedicated to examining data for actionable insights
Less time is spent wrangling code, infrastructure, and tools used to process data
Hands-on with data
Cloud setup and customization
14
A solution for next generation data processing
03 Apache Beam (incubating)
What is Apache Beam?
1. The (unified stream + batch) Dataflow Beam programming model
2. Java and Python SDKs
3. Runners for Existing Distributed Processing Backends
a. Apache Flink (thanks to dataArtisans)
b. Apache Spark (thanks to Cloudera & PayPal)
c. Google Cloud Dataflow (fast, no-ops)
d. Local (in-process) runner for testing
+ Future runners for Beam - Apache Gearpump, Apache Apex, MapReduce, others!
15
The Apache Beam vision
1. End users: who want to write pipelines in a language that’s familiar.
2. SDK writers: who want to make Beam concepts available in new languages.
3. Runner writers: who have a distributed processing environment and want to support Beam pipelines
16
Beam Model: Fn Runners
Apache Flink
Apache Spark
Beam Model: Pipeline Construction
OtherLanguagesBeam Java
Beam Python
Execution Execution
Google Cloud
Dataflow
Execution
Joining several threads into Beam
17
MapReduce
BigTable DremelColossus
FlumeMegastore
SpannerPubSub
Millwheel
Cloud Dataflow
Cloud Dataproc
Apache Beam
Creating an Apache Beam community
Collaborate - Beam is becoming a community-driven effort with participation from many organizations and contributors
Grow - We want to grow the Beam ecosystem and community with active, open involvement so beam is a part of the larger OSS ecosystem
Learn - We (Google) are also learning a lot as this is our first data-related Apache contribution ;-)
Apache Beam Roadmap
02/01/2016Enter Apache
Incubator
End 2016Beam pipelines
run on many runners in
production uses
Early 2016Design for use cases,
begin refactoring
Mid 2016Additional refactoring,non-production uses
Late 2016Multiple runners execute Beam
pipelines
02/25/20161st commit to ASF repository
06/14/20161st incubating
release
June 2016Python SDK
moves to Beam
20
An example to show the power of the Beam model
04 Why Beam Matters
Apache Beam - A next generation model
21
Improved abstractions let you focus on your business logic
Batch and stream processing are both first-class citizens -- no need to choose.
Clearly separates event time from processing time.
Processing time vs. event time
22
Beam model - asking the right questions
23
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
The Beam model - what is being computed?
24
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
The Beam model - what is being computed?
25
The Beam model - where in event time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());
The Beam model - where in event time?
The Beam model - when in processing time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
The Beam model - when in processing time?
The Beam model - how do refinements relate?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
The Beam model - how do refinements relate?
Customizing what where when how
32
3Streaming
4Streaming
+ Accumulation
1Classic Batch
2Windowed
Batch
Apache Beam - the ecosystem
33http://beam.incubator.apache.org/capability-matrix
34
Lets run a Beam pipeline on 3 engines in 2 separate locations
05 Demo
35
Created 1 Beam pipeline
Ran that one pipeline on three execution engines in two places
● Google Cloud Platform○ Google Cloud Dataflow○ Apache Spark on Google Cloud Dataproc
● Local○ Apache Beam local runner○ Apache Flink
100% portability, 0 problems
What we just did
36
Recap and how you can get involved
06 Things to remember
Apache Beam is designed to provide potable pipelines with a unified programming model
37
Get involved with Apache Beam
38
Apache Beam (incubating)http://beam.incubator.apache.org
The World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Join the Beam mailing lists! [email protected]@beam.incubator.apache.org
Join the Apache Beam Slack channel
https://apachebeam.slack.com
Follow @ApacheBeam on Twitter
A special thank you
39
A special thank you to Frances Perry and Tyler Akidau for sharing Apache Beam content which was used in this presentation.
40
Thank you