introduction to apache beam
TRANSCRIPT
![Page 1: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/1.jpg)
Introduction to Apache Beam
JB Onofré - Talend
![Page 2: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/2.jpg)
Who am I ?
● Talend○ Software Architect○ Apache team
● Apache○ Member of the Apache Software Foundation
○ Champion/Mentor/PPMC/PMC/Committer for ~ 20 projects (Beam, Falcon, Lens, Brooklyn,
Slider, Karaf, Camel, ActiveMQ, ACE, Archiva, Aries, ServiceMix, Syncope, jClouds, Unomi,
Guacamole, BatchEE, Sirona, Incubator, …)
![Page 3: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/3.jpg)
What is Apache Beam?
1. Agnostic (unified batch + stream) Beam programming model
2. Dataflow Java SDK (soon Python, DSLs)
3. Runners for Dataflow
a. Apache Flink (thanks to data Artisans)
b. Apache Spark (thanks to Cloudera)
c. Google Cloud Dataflow (fast, no-ops)
d. Local (in-process) runner for testing
e. OSGi/Karaf
![Page 4: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/4.jpg)
Why Apache Beam?
1. Portable - You can use the same code with different runners (abstraction) and backends on premise, in the cloud, or locally
2. Unified - Same unified model for batch and stream processing
3. Advanced features - Event windowing, triggering, watermarking, lateless, etc.
4. Extensible model and SDK - Extensible API; can define custom sources to read and write in parallel
![Page 5: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/5.jpg)
Beam Programming Model
Data processing pipeline(executed via a Beam runner)
PTransform/IO PTransform PTransformInput Output
![Page 6: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/6.jpg)
Beam Programming Model
1. Pipelines - data processing job as a directed graph of steps
2. PCollection - the data inside a pipeline
3. Transform - a step in the pipeline (taking PCollections as input, and produce
PCollections)
a. Core transforms - common transformation provided (ParDo, GroupByKey, …)
b. Composite transforms - combine multiple transforms
c. IO transforms - endpoints of a pipeline to create PCollections (consumer/root) or use
PCollections to “write” data outside of the pipeline (producer)
![Page 7: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/7.jpg)
Beam Programming Model - PCollection
1. PCollection is immutable, does not support random access to element, belong to a pipeline
2. Each element in PCollection has a timestamp (set by IO Source)3. Coder to support different data types4. Bounded (batch) or Unbounded (streaming) PCollection (depending of the IO
Source)5. Grouping of unbounded PCollection with Windowing (thanks to the timestamp)
a. Fixed time windowb. Sliding time windowc. Session windowd. Global window (for bounded PCollection)
e. Can deal with time skew and data lag (late data) with trigger (time-based with watermark, data-
based with counting, composite)
![Page 8: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/8.jpg)
Beam Programming Model - IO
1. IO Sources (read data as PCollections) and Sinks (write PCollections)
2. Support Bounded and/or Unbounded PCollections
3. Provided IO - File, BigQuery, BigTable, Avro, and more coming (Kafka, JMS, …)
4. Custom IO - extensible IO API to create custom sources & sinks
5. Should deal with timestamp, watermark, deduplication, parallelism (depending of the needs)
![Page 9: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/9.jpg)
Apache Beam SDKs
1. API for Beam Programming Model (design pipelines, transforms, …)
2. Current SDKs
a. Java - First SDK and primary focus for refactoring and improvement
b. Python - Dataflow SDK preview for batch processing, will be migrated to Apache Beam once
the Java SDK has been stabilized (and APIs/interfaces redefined)
3. Coming (possible) SDKs/languages - Scala, Go, Ruby, etc.
4. DSLs - domain specific languages on top of the SDKs (Java fluent DSL on top of Java SDK, …)
![Page 10: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/10.jpg)
Java SDK
public static void main(String[] args) {
// Create a pipeline parameterized by commandline flags.
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(arg));
p.apply(TextIO.Read.from("/path/to...")) // Read input.
.apply(new CountWords()) // Do some processing.
.apply(TextIO.Write.to("/path/to...")); // Write output.
// Run the pipeline.
p.run();
}
![Page 11: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/11.jpg)
Beam Programming Model
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(SessionWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(
AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
The Apache Beam Model (by way of the Dataflow model) includes many primitives and features which are powerful but hard to express in other models and languages.
![Page 12: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/12.jpg)
Runners and Backends
● Runners “translate” the code to a target backend (the runner itself doesn’t provide the backend)
● Many runners are tied to other top-level Apache projects, such as Apache Flink and apache Spark
● Due to this, runners can be run on-premise (on your local Flink cluster) or in a public cloud (using Google Cloud Dataproc or Amazon EMR) for example
● Apache Beam is focused on treating runners as a top-level use case (with APIs, support, etc.) so runners can be developed with minimal friction for maximum pipeline portability
![Page 13: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/13.jpg)
Beam Runners
Google Cloud Dataflow Apache Flink* Apache Spark*
[*] With varying levels of fidelity.The Apache Beam (http://beam.incubator.apache.org) site will have more details soon.
?
Other Runner*(local, OSGi, …)
![Page 14: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/14.jpg)
Use Cases
Apache Beam is a great choice for both batch and stream processing and can handle bounded and unbounded datasets
Batch can focus on ETL/ELT, catch-up processing, daily aggregations, and so on
Stream can focus on handling real-time processing on a record-by-record basis
Real use cases
● Mobile gaming data processing, both batch and stream processing (https://github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/)
● Real-time event processing from IoT devices
![Page 15: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/15.jpg)
Use Case - Gaming
● A game store the gaming results in the CSV file:○ Player,team,score,timestamp
● Two pipelines:○ UserScore (batch) sum scores for each user
○ HourlyScore (batch) similar UserScore but with a Window (hour): it calculates sum scores per
team on fixed windows.
![Page 16: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/16.jpg)
User Game - Gaming - UserScore - PipelinePipeline pipeline = Pipeline.create(options);
// Read events from a text file and parse them.
pipeline.apply(TextIO.Read.from(options.getInput()))
.apply(ParDo.named("ParseGameEvent").of(new ParseEventFn()))
// Extract and sum username/score pairs from the event data.
.apply("ExtractUserScore", new ExtractAndSumScore("user"))
.apply("WriteUserScoreSums",
new WriteToBigQuery<KV<String, Integer>>(options.
getTableName(),
configureBigQueryWrite()));
// Run the batch pipeline.
pipeline.run();
![Page 17: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/17.jpg)
User Game - Gaming - UserScore - Avro Coder @DefaultCoder(AvroCoder.class)
static class GameActionInfo {
@Nullable String user;
@Nullable String team;
@Nullable Integer score;
@Nullable Long timestamp;
public GameActionInfo(String user, String team, Integer score, Long
timestamp) {
…
}
…}
![Page 18: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/18.jpg)
User Game - Gaming - UserScore - Parse Event Fn static class ParseEventFn extends DoFn<String, GameActionInfo> {
// Log and count parse errors.
private static final Logger LOG = LoggerFactory.getLogger(ParseEventFn.class);
private final Aggregator<Long, Long> numParseErrors =
createAggregator("ParseErrors", new Sum.SumLongFn());
@Override
public void processElement(ProcessContext c) {
String[] components = c.element().split(",");
try {
String user = components[0].trim();
String team = components[1].trim();
Integer score = Integer.parseInt(components[2].trim());
Long timestamp = Long.parseLong(components[3].trim());
GameActionInfo gInfo = new GameActionInfo(user, team, score, timestamp);
c.output(gInfo);
} catch (ArrayIndexOutOfBoundsException | NumberFormatException e) {
numParseErrors.addValue(1L);
LOG.info("Parse error on " + c.element() + ", " + e.getMessage());
}
}
}
![Page 19: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/19.jpg)
User Game - Gaming - UserScore - Sum Score Tr
public static class ExtractAndSumScore
extends PTransform<PCollection<GameActionInfo>, PCollection<KV<String, Integer>>> {
private final String field;
ExtractAndSumScore(String field) {
this.field = field;
}
@Override
public PCollection<KV<String, Integer>> apply(
PCollection<GameActionInfo> gameInfo) {
return gameInfo
.apply(MapElements
.via((GameActionInfo gInfo) -> KV.of(gInfo.getKey(field), gInfo.getScore()))
.withOutputType(new TypeDescriptor<KV<String, Integer>>() {}))
.apply(Sum.<String>integersPerKey());
}
}
![Page 20: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/20.jpg)
User Game - Gaming - HourlyScore - Pipeline
pipeline.apply(TextIO.Read.from(options.getInput()))
.apply(ParDo.named("ParseGameEvent”).of(new ParseEventFn()))
// filter with byPredicate to ignore some data
.apply("FilterStartTime", Filter.byPredicate((GameActionInfo gInfo)
-> gInfo.getTimestamp() > startMinTimestamp.getMillis()))
.apply("FilterEndTime", Filter.byPredicate((GameActionInfo gInfo)
-> gInfo.getTimestamp() < stopMinTimestamp.getMillis()))
// use fixed-time window
.apply("AddEventTimestamps", WithTimestamps.of((GameActionInfo i) -> new Instant(i.getTimestamp())))
.apply(Window.named("FixedWindowsTeam")
.<GameActionInfo>into(FixedWindows.of(Duration.standardMinutes(60)))
// extract and sum teamname/score pairs from the event data.
.apply("ExtractTeamScore", new ExtractAndSumScore("team"))
// write the result
.apply("WriteTeamScoreSums",
new WriteWindowedToBigQuery<KV<String, Integer>>(options.getTableName(),
configureWindowedTableWrite()));
pipeline.run();
![Page 21: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/21.jpg)
Roadmap
02/01/2016Enter Apache
Incubator
End 2016Cloud Dataflow
should run Beam pipelines
Early 2016Design for use cases,
begin refactoring
Mid 2016Slight chaos
Late 2016Multiple runners execute Beam
pipelines
02/25/20161st commit to
ASF repository
![Page 22: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/22.jpg)
More information and get involved!
1: Read about Apache Beam
Apache Beam website - http://beam.incubator.apache.org
2: See what the Apache Beam team is doing
Apache Beam JIRA - https://issues.apache.org/jira/browse/BEAM
Apache Beam mailing lists - http://beam.incubator.apache.org/mailing_lists/
3: Contribute!
Apache Beam git repo - https://github.com/apache/incubator-beam
![Page 23: Introduction to Apache Beam](https://reader034.vdocuments.site/reader034/viewer/2022050714/5877d9f61a28abaa6c8b5f71/html5/thumbnails/23.jpg)
Q&A