with apache beam - talend · with apache beam william vambenepe, google @vambenepe. speakers info...
TRANSCRIPT
SPEAKERS INFO
WILLIAM VAMBENEPE
Group Product Manager
Data Processing and Analytics
Google Cloud Platform
@vambenepe
Open source (top-level Apache project)
Portable
Unifies batch and stream
Cloud-native
Built on 15 years of large scale data processing at Google
You don’t need to be a developer to benefit from Beam
APACHE BEAM: THE KEY TO MODERN DATA PROCESSING
MapReduce Apache Beam
Cloud Dataflow
BigTable DremelColossus
FlumeMegastore Spanner
PubSub
Millwheel
THE EVOLUTION OF DATA PIPELINES
Progressive evolution from batch to stream
- Stream as the new default
Cost/perf trade-offs without re-architecting
- Just turn the knob
ML: data preparation consistency between training & scoring
- Same pipeline to train in batch and score in stream
BENEFIT OF BATCH / STREAM UNIFICATION
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
THE BEAM MODEL: ASKING THE RIGHT QUESTIONS
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
THE BEAM MODEL: ASKING THE RIGHT QUESTIONS
The Beam Model: the abstractions at the core of Apache Beam
Choice of API: Users write their pipelines in a language that’s familiar and integrated with their other tooling
Choice of Runtime: Users choose the right runner for their current needs -- on-prem / cloud, open source / not, fully managed / not
Scalability for Developers: Clean APIs allow developers to contribute modules independently
Language B SDK
Language A SDK
Language C SDK
Runner 1
Runner 3
Runner 2
The Beam Model
Language ALanguage
CLanguage B
The Beam Model
BEAM VISION: MIX AND MATCH SDKS AND RUNTIMES
APACHE SPARK
Open-source cluster-
computing framework
Large ecosystem of
APIs and tools
Runs on premise or
in the cloud
APACHE FLINK
Open-source distributed data
processing engine
High-throughput and
low-latency stream processing
Runs on premise or in the cloud
EXAMPLE BEAM RUNNERS
GOOGLE CLOUD DATAFLOW
Fully-managed service for batch and
stream data processing
Provides dynamic auto-scaling,
monitoring tools, and tight integration
with Google Cloud Platform
GA 360
Cloud Pub/Sub
BigQuery Storage(tables)
Cloud Bigtable(NoSQL)
Cloud Storage(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Stackdriver
Process
Stream
Use
Cloud Dataproc
Cloud Datalab
Real-time analytics
Real-timedashboard
Real-timealerts
ML Engine
Batch
Firebase
Storage Transfer Service
Cloud Dataflow
etc...
SQL
Adwords
DoubleClick
YouTube
BEAM ON GOOGLE CLOUD: SERVERLESS DATA PROCESSING
Streaming 101 and 102: The World Beyond Batchhttps://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
BEAM
MORE INFO
Apache Beam: https://beam.apache.org
Google Cloud Platform: https://cloud.google.com
The Dataflow Model paper from VLDB 2015http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf