google cloud platformfiles.meetup.com/18404940/big data reference architecture - reza rokni.pdfdata...
TRANSCRIPT
![Page 1: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/1.jpg)
Google Cloud Platform Reference Architecture (Streaming)
Reza Rokni
![Page 2: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/2.jpg)
![Page 3: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/3.jpg)
Data .. Introduction
GB's
![Page 4: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/4.jpg)
...can be Big Introduction
TB's
![Page 5: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/5.jpg)
... really really big! ... but at least always batch?Introduction
TuesdayWednesday
Thursday
PB's
![Page 6: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/6.jpg)
... well... but at least it's on time..Introduction
9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00
![Page 7: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/7.jpg)
... it's doesn't even have the courtesy to be on time!Introduction
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
![Page 8: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/8.jpg)
Google confidential │ Do not distribute
Let’s process some dataReference ArchitectureProcesses
1,000,000'ssec
10sec
Cloud Pub/SubAsync Messaging
Massive Scale NoSqlNoSQL Database Service
Cloud DataflowParallel data processing
BigQueryAnalytics Engine
CloudMLMachine Learning
File
Cloud StorageObject Store Exports
Cloud DataprocManaged Spark Hadoop
![Page 9: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/9.jpg)
Google confidential │ Do not distribute
Let’s process some dataReference ArchitectureProcesses
1,000,000'ssec
100sec
Cloud Pub/SubAsync Messaging
Cloud StorageObject Store
![Page 10: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/10.jpg)
Capture
• Globally redundant• Batched read/write• Custom labels• Push & Pull• Auto expiration• 10 MB Message Size• 7 Days storage for
unack Messages
Publisher A Publisher B Publisher C
Message 1
Topic A Topic B Topic C
Subscription XA Subscription XB Subscription YC
Subscription ZC
Cloud Pub/Sub
Subscriber X Subscriber Y
Message 2 Message 3
Subscriber Z
Message 1
Message 2
Message 3
Message 3
Cloud Pub/Sub API
![Page 11: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/11.jpg)
Google confidential │ Do not distribute
Let’s process some dataReference ArchitectureProcesses
1,000,000'ssec
10sec
Cloud Pub/SubAsync Messaging
Cloud DataflowParallel data processing
File
Cloud StorageObject Store
![Page 12: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/12.jpg)
Google Cloud Dataflow ( Apache Beam ) Introduction
Apache Beam (incubating) Google Cloud Dataflow
![Page 13: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/13.jpg)
Extra Reading : FlumeJava Combined with MillWheel Dataflow explained
![Page 14: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/14.jpg)
FlumeJava - The What not the HowDataflow explained
FlumeJava
TextIO.Read(MarketData)
ParDo(enrichData(bidsize,ask,bid,trade)
ParDo(filterData(bidsize>x))
BigQueryIO.Write
Code shown is sudo code only
![Page 15: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/15.jpg)
MillWheel - Framework for low latency data processing Dataflow explained
![Page 16: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/16.jpg)
Google confidential │ Do not distribute
C D
C+D
consumer-producer sibling
C D
C+D
Optimizer fusion Optimizer fusionProcesses
![Page 17: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/17.jpg)
Google confidential │ Do not distribute
100 mins. 65 mins.vs.
Dynamic Worker OptimizationProcesses
![Page 18: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/18.jpg)
Google confidential │ Do not distribute
Count
Stream
Parse Message
BigQuery BigQuery
Window
Detect Anomaly
Building a clickstream processing pipeline● In this example we will
○ Read Data from Pub/Sub○ Window and Aggregate the Data○ Do something programmatically with the data
![Page 19: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/19.jpg)
Google confidential │ Do not distribute
Batch Read
Parse Message
Clickstream
BigQuery
Pipeline p = Pipeline.create();
p.begin()
PCollection<String> dataCollection = p.apply(TextIO.Read.from(“gs://…”))
dataCollection.apply(new ParseMessage())
ParDo.of(new TokenizesMessage())
ParDo.of(new CreateRecords())
.apply(BigQueryIO.Write.to(...))
STEP 1 - Transport
Code shown is sudo code only
![Page 20: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/20.jpg)
Google confidential │ Do not distribute
Count
Batch Read
Parse Message
Clickstream
BigQuery BigQuery
Window
Detect Anomaly
Pipeline p = Pipeline.create();
p.begin()
.apply(Window.<Record>into(FixedWindows.of(Duration.standardSecounds(60)))
.apply(ParDo.of(new CreateEventKey()))
.apply(Count)
.apply(ParDo.of(new DetectAnomaly()))
STEP 2 - Detect
Code shown is sudo code only
![Page 21: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/21.jpg)
Google confidential │ Do not distribute
Count
Stream
Parse Message
BigQuery BigQuery
Window
Detect Anomaly
Pipeline p = Pipeline.create();
p.begin()
.apply(PubsubIO.Write.topic(...))
STEP 3 - Stream
.apply(TextIO.Read.from(“gs://…”))
.apply(PubsubIO.Read.topic(...))
Code shown is sudo code only
![Page 22: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/22.jpg)
1 + 1 = 2Completeness Latency Cost
$$$
Data Processing Tradeoffs
![Page 23: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/23.jpg)
Requirements: Billing Pipeline
Completeness Low Latency Low Cost
Important
Not Important
![Page 24: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/24.jpg)
Requirements: Live Cost Estimate Pipeline
Completeness Low Latency Low Cost
Important
Not Important
![Page 25: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/25.jpg)
Requirements: Abuse Detection Pipeline
Completeness Low Latency Low Cost
Important
Not Important
![Page 26: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/26.jpg)
Requirements: Abuse Detection Backfill Pipeline
Completeness Low Latency Low Cost
Important
Not Important
![Page 27: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/27.jpg)
Dataflow explained
Inherent issues when dealing with streams
![Page 28: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/28.jpg)
Watermarks
![Page 29: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/29.jpg)
Watermark triggers
PCollection<KV<String, Integer>> scores = input
.apply(Window
.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark())
.apply(Sum.integersPerKey());
Code shown is sudo code only
![Page 30: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/30.jpg)
Approximate Triggers
PCollection<KV<String, Integer>> scores = input
.apply(Window
.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1))))
.apply(Sum.integersPerKey());
Code shown is sudo code only
![Page 31: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/31.jpg)
Requirements: Live Cost Estimate Pipeline
Completeness Low Latency Low Cost
Important
Not Important
![Page 32: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/32.jpg)
Google confidential │ Do not distribute
GCP
Managed Service
User Code & SDK Work Manager
Dep
loy
& S
ched
ule
Pro
gres
s &
Lo
gsMonitoring UI
Job Manager
![Page 33: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/33.jpg)
Google confidential │ Do not distribute
Let’s process some dataReference ArchitectureProcesses
1,000,000'ssec
10sec
Cloud Pub/SubAsync Messaging
Massive Scale NoSqlNoSQL Database Service
Cloud DataflowParallel data processing
BigQueryAnalytics Engine
File
Cloud StorageObject Store Exports
![Page 34: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/34.jpg)
Google confidential │ Do not distribute
BigQuery Or BigTable... Or Both??Pipeline Consumers
Massive Scale NoSqlNoSQL Database Service
BigQueryAnalytics Engine
![Page 35: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/35.jpg)
Google confidential │ Do not distribute
Let’s process some dataReference ArchitectureProcesses
1,000,000'ssec
10sec
Cloud Pub/SubAsync Messaging
Massive Scale NoSqlNoSQL Database Service
Cloud DataflowParallel data processing
BigQueryAnalytics Engine
CloudMLMachine Learning
File
Cloud StorageObject Store Exports
Cloud DataprocManaged Spark Hadoop
![Page 36: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/36.jpg)
Google confidential │ Do not distribute
CloudML - Data pre-processing stagesMachine Learning
If Machine learning is the new rocket ship...
Data is the fuel!
![Page 37: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/37.jpg)
Google confidential │ Do not distribute
Let’s process some dataCloudML - API'sProcesses
Speech APIVision API
![Page 38: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but](https://reader034.vdocuments.site/reader034/viewer/2022042220/5ec6e723f2803417034c460d/html5/thumbnails/38.jpg)
Google confidential │ Do not distribute
It is well known that a vital
ingredient of success is not
knowing that what you're
attempting can't be done
Terry Pratchett