google cloud dataflow
TRANSCRIPT
![Page 1: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/1.jpg)
Cloud DataflowA Google Service and SDK
![Page 2: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/2.jpg)
The Presenter
Alex Van Boxel, Software Architect, software engineer, devops, tester,...
at Vente-Exclusive.com
Introduction
Twitter @alexvb
Plus +AlexVanBoxel
E-mail [email protected]
Web http://alex.vanboxel.be
![Page 3: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/3.jpg)
Dataflowwhat, how, where?
![Page 4: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/4.jpg)
Dataflow is...
A set of SDK that define the programming model that
you use to build your streaming and batch
processing pipeline (*)
Google Cloud Dataflow is a fully managed service that will run and optimize your
pipeline
![Page 5: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/5.jpg)
Dataflow where...
ETL● Move● Filter● Enrich
Analytics● Streaming Compu● Batch Compu● Machine Learning
Composition● Connecting in Pub/Sub
○ Could be other dataflows○ Could be something else
![Page 6: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/6.jpg)
NoOps Model
● Resources allocated on demand● Lifecycle managed by the service● Optimization● Rebalancing
Cloud Dataflow Benefits
![Page 7: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/7.jpg)
Unified Programming Model
Unified: Streaming and BatchOpen Sourced● Java 8 implementation● Python in the works
Cloud Dataflow Benefits
![Page 8: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/8.jpg)
Unified Programming Model (streaming)Cloud Dataflow Benefits
DataFlow(streaming)
DataFlow (batch)
BigQuery
![Page 9: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/9.jpg)
Unified Programming Model (stream and backup)Cloud Dataflow Benefits
DataFlow(streaming)
DataFlow (batch)
BigQuery
![Page 10: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/10.jpg)
Unified Programming Model (batch)Cloud Dataflow Benefits
DataFlow(streaming)
DataFlow (batch)
BigQuery
![Page 11: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/11.jpg)
BeamApache Incubation
![Page 12: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/12.jpg)
DataFlow SDK becomes Apache BeamCloud Dataflow Benefits
![Page 13: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/13.jpg)
DataFlow SDK becomes Apache Beam
● Pipeline first, runtime second○ focus your data pipelines, not the characteristics runner
● Portability ○ portable across a number of runtime engines
○ runtime based on any number of considerations
■ performance
■ cost
■ scalability
Cloud Dataflow Benefits
![Page 14: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/14.jpg)
DataFlow SDK becomes Apache Beam
● Unified model ○ Batch and streaming are integrated into a unified model
○ Powerful semantics, such as windowing, ordering and triggering
● Development tooling○ tools you need to create portable data pipelines quickly and easily
using open-source languages, libraries and tools.
Cloud Dataflow Benefits
![Page 15: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/15.jpg)
DataFlow SDK becomes Apache Beam
● Scala Implementation
● Alternative Runners○ Spark Runner from Cloudera Labs
○ Flink Runner from data Artisans
Cloud Dataflow Benefits
![Page 16: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/16.jpg)
DataFlow SDK becomes Apache Beam
● Cloudera
● data Artisans
● Talend
● Cask
● PayPal
Cloud Dataflow Benefits
![Page 17: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/17.jpg)
SDK PrimitivesAutopsy of a dataflow pipeline
![Page 18: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/18.jpg)
Pipeline
● Inputs: Pub/Sub, BigQuery, GCS, XML, JSON, …
● Transforms: Filter, Join, Aggregate, …
● Windows: Sliding, Sessions, Fixed, …
● Triggers: Correctness….
● Outputs: Pub/Sub, BigQuery, GCS, XML, JSON, ...
Logical Model
![Page 19: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/19.jpg)
PCollection<T>
● Immutable collection● Could be bound or unbound● Created by
○ a backing data store○ a transformation from other PCollection○ generated
● PCollection<KV<K,V>>
Dataflow SDK Primitives
![Page 20: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/20.jpg)
ParDo<I,O>
● Parallel Processing of each element in the PCollection
● Developer provide DoFn<I,O>
● Shards processed independent
of each other
Dataflow SDK Primitives
![Page 21: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/21.jpg)
ParDo<I,O>
Interesting concept of side outputs: allows for data sharing
Dataflow SDK Primitives
![Page 22: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/22.jpg)
PTransform
● Combine small operations into bigger transforms● Standard transform (COUNT, SUM,...)● Reuse and Monitoring
Dataflow SDK Primitives
public class BlockTransform extends PTransform<PCollection<Legacy>, PCollection<Event>> {
@Override public PCollection<Event> apply(PCollection<Legacy> input) { return input.apply(ParDo.of(new TimestampEvent1Fn())) .apply(ParDo.of(new FromLegacy2EventFn())).setCoder(AvroCoder.of(Event.class)) .apply(Window.<Event>into(Sessions.withGapDuration(Duration.standardMinutes(20)))) .apply(ParDo.of(new ExtactChannelUserFn())) }}
![Page 23: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/23.jpg)
Merging
● GroupByKey○ Takes a PCollection of KV<K,V>
and groups them○ Must be keyed
● Flatten ○ Join with same type
Dataflow SDK Primitives
![Page 24: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/24.jpg)
Input/OutputDataflow SDK Primitives
PipelineOptions options = PipelineOptionsFactory.create();Pipeline p = Pipeline.create(options);
// streamData is Unbounded; apply windowing afterward.PCollection<String> streamData =
p.apply(PubsubIO.Read.named("ReadFromPubsub") .topic("/topics/my-topic"));
PipelineOptions options = PipelineOptionsFactory.create();Pipeline p = Pipeline.create(options);PCollection<TableRow> weatherData = p
.apply(BigQueryIO.Read.named("ReadWeatherStations") .from("clouddataflow-readonly:samples.weather_stations"));
![Page 25: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/25.jpg)
● Multiple Inputs + Outputs
Input/OutputDataflow SDK Primitives
![Page 26: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/26.jpg)
Primitives on steroidsAutopsy of a dataflow pipeline
![Page 27: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/27.jpg)
Windows
● Split unbounded collections (ex. Pub/Sub)● Works on bounded collection as well
Dataflow SDK Primitives
.apply(Window.<Event>into(Sessions.withGapDuration(Duration.standardMinutes(20))))
Window.<CAL>into(SlidingWindows.of(Duration.standardMinutes(5)));
![Page 28: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/28.jpg)
Triggers
● Time-based● Data-based● Composite
Dataflow SDK Primitives
AfterEach.inOrder( AfterWatermark.pastEndOfWindow(), Repeatedly.forever(AfterProcessingTime .pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(10))) .orFinally(AfterWatermark .pastEndOfWindow() .plusDelayOf(Duration.standardDays(2))));
![Page 29: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/29.jpg)
14:22Event time 14:21
Input/OutputDataflow SDK Primitives
14:21 14:22
14:21:25Processing time
Event time
14:2314:22
Watermark
14:21:20 14:21:40
Early trigger Late trigger Late trigger
14:25
![Page 30: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/30.jpg)
14:22Event time 14:21
Input/OutputDataflow SDK Primitives
14:21 14:22
14:21:25Processing time
Event time
14:2314:22
Watermark
14:21:20 14:21:40
Early trigger Late trigger Late trigger
14:25
![Page 31: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/31.jpg)
Google Cloud DataflowThe service
![Page 32: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/32.jpg)
What you need
● Google Cloud SDK○ utilities to authenticate and manage project
● Java IDE○ Eclipse○ IntelliJ
● Standard Build System○ Maven○ Gradle
Dataflow SDK Primitives
![Page 33: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/33.jpg)
What you needDataflow SDK Primitives
subprojects { apply plugin: 'java'
repositories { maven { url "http://jcenter.bintray.com" } mavenLocal() }
dependencies { compile 'com.google.cloud.dataflow:google-cloud-dataflow-java-sdk-all:1.3.0' testCompile 'org.hamcrest:hamcrest-all:1.3' testCompile 'org.assertj:assertj-core:3.0.0' testCompile 'junit:junit:4.12' }}
![Page 34: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/34.jpg)
How it workGoogle Cloud Dataflow Service
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))
.apply(FlatMapElements.via((String word) -> Arrays.asList(word.split("[^a-zA-Z']+"))) .withOutputType(new TypeDescriptor<String>() {})) .apply(Filter.byPredicate((String word) -> !word.isEmpty())) .apply(Count.<String>perElement()) .apply(MapElements .via((KV<String, Long> wordCount) -> wordCount.getKey() + ": " + wordCount.getValue()) .withOutputType(new TypeDescriptor<String>() {}))
.apply(TextIO.Write.to("gs://YOUR_OUTPUT_BUCKET/AND_OUTPUT_PREFIX"));
![Page 35: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/35.jpg)
Service
● Stage● Optimize (Flume)● Execute
Google Cloud Dataflow Service
![Page 36: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/36.jpg)
OptimizationGoogle Cloud Dataflow Service
![Page 37: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/37.jpg)
OptimizationGoogle Cloud Dataflow Service
![Page 38: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/38.jpg)
ConsoleGoogle Cloud Dataflow Service
![Page 39: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/39.jpg)
LifecycleGoogle Cloud Dataflow Service
![Page 40: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/40.jpg)
Compute Worker Scaling Google Cloud Dataflow Service
![Page 41: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/41.jpg)
Compute Worker Rescaling Google Cloud Dataflow Service
![Page 42: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/42.jpg)
Compute Worker Rebalancing Google Cloud Dataflow Service
![Page 43: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/43.jpg)
Compute Worker RescalingGoogle Cloud Dataflow Service
![Page 44: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/44.jpg)
LifecycleGoogle Cloud Dataflow Service
![Page 45: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/45.jpg)
BigQueryGoogle Cloud Dataflow Service
![Page 46: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/46.jpg)
TestabilityTest Driven Development Build-In
![Page 47: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/47.jpg)
Testability - Built inDataflow Testability
KV<String, List<CalEvent>> input = KV.of("", new ArrayList<>()); input.getValue().add(CalEvent.event("SESSION", "REFERRAL") .build()); input.getValue().add(CalEvent.event("SESSION", "START") .data("{\"referral\":{\"Sourc...}").build());
DoFnTester<KV<?, List<CalEvent>>, CalEvent> fnTester = DoFnTester.of(new SessionRepairFn()); fnTester.processElement(input);
List<CalEvent> calEvents = fnTester.peekOutputElements(); assertThat(calEvents).hasSize(2);
CalEvent referral = CalUtil.findEvent(calEvents, "REFERRAL"); assertNotNull("REFERRAL should still exist", referral);
CalEvent start = CalUtil.findEvent(calEvents, "START"); assertNotNull("START should still exist", start); assertNull("START should have referral removed.",
![Page 48: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/48.jpg)
AppendixFollow the yellow brick road
![Page 49: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/49.jpg)
Paper - Flume Java
FlumeJava: Easy, Efficient Data-Parallel Pipelines
http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf
More Dataflow Resources
![Page 50: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/50.jpg)
DataFlow/Bean vs Spark
Dataflow/Beam & Spark: A Programming Model Comparison
https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison
More Dataflow Resources
![Page 51: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/51.jpg)
Github - Integrate Dataflow with Luigi
Google Cloud integration for Luigi
https://github.com/alexvanboxel/luigiext-gcloud
More Dataflow Resources
![Page 52: Google Cloud Dataflow](https://reader030.vdocuments.site/reader030/viewer/2022021422/586fdd0c1a28ab18428b6709/html5/thumbnails/52.jpg)
Questions and AnswersThe last slide
Twitter @alexvb
Plus +AlexVanBoxel
E-mail [email protected]
Web http://alex.vanboxel.be