cloud native data pipelines anand.pdf · 2016-10-20 · data pipeline correctness operability...
TRANSCRIPT
Cloud Native Data Pipelines
Sid Anand QCon Shanghai & Tokyo 2016
1
About Me
2
Work [ed | s] @
Committer & PPMC on
Father of 2
Co-Chair for
Apache Airflow
Agari
3
What We Do!
Agari : What We Do
4
5
Agari : What We Do
6
Agari : What We Do
7
Agari : What We Do
8
Agari : What We Do
9
Enterprise Customers
email metadata
apply trust
models
email md + trust score
Agari’s Previous EP Version
Agari : What We Do
Batch
10
email metadata
apply trust
modelsemail md + trust score
Agari’s Current EP VersionEnterprise Customers
Agari : What We Do
Near-real time
Quarantine
Data PipelinesBI vs Predictive
11
Data Pipelines (BI)
12
WebServers
OLTPDB
DataWarehouse
Repor6ngTools
QueryBrowsers
ETL(batch)MySQL,Oracle,Cassandra
Terradata,RedShi;BigQuery
Data Pipelines (Predictive)
13
OLTPDBorcache
ETL(batchorstreaming)
MySQL,Oracle,Cassandra,Redis
Spark,Flink,Beam,Storm
WebServers
DataProductsRanking(Search,NewsFeed),RecommenderProducts,FraudDetecGon/PrevenGon
DataSource
Data Products
14
BI Predictive
Common Focus of this talk
Data Pipelines
15
WebServers
OLTPDB
DataWarehouse
Repor6ngTools
QueryBrowsers
ETL(batch)MySQL,Oracle,Cassandra
Terradata,RedShi;BigQuery
OLTPDBorcache
ETL(batchorstreaming)
MySQL,Oracle,Cassandra,Redis
Spark,Flink,Beam,Storm
WebServers
Ranking(Search,NewsFeed),RecommenderProducts,FraudDetecGon/PrevenGon
DataSource
MotivationCloud Native Data Pipelines
16
Cloud Native Data Pipelines
17
Big Data Companies like LinkedIn, Facebook, Twitter, & Google build custom, large scale data pipelines that run in their own Data Centers
Cloud Native Data Pipelines
18
Big Data Companies like LinkedIn, Facebook, Twitter, & Google build custom, large scale data pipelines that run in their own Data Centers
Most start-ups run in the public cloud. Can they leverage aspects of the public cloud to build comparable pipelines?
Cloud Native Data Pipelines
19
Cloud Native Techniques
Open Source Technogies
Custom Data Pipeline Stacks seen in Big Data companies
~
Design GoalsDesirable Qualities of a Resilient Data Pipeline
20
21
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness Cost
22
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness Cost
• Data Integrity (no loss, etc…) • Expected data distributions
• All output within time-bound SLAs
• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs
• Quick Recoverability
• Pay-as-you-go
Quickly Recoverable
23
• Bugs happen!
• Bugs in Predictive Data Pipelines have a large blast radius
• Optimize for MTTR
Predictive Analytics @ AgariUse Cases
24
Use Cases
25
Apply trust models (message scoring)
batch + near real time
Build trust models
batch
(Enterprise Protect)
Use-Case : Message Scoring (batch)Batch Pipeline Architecture
26
Use-Case : Message Scoring
27
enterprise Aenterprise Benterprise C
S3
S3 uploads an Avro file every 15 minutes
Use-Case : Message Scoring
28
enterprise Aenterprise Benterprise C
S3
Airflow kicks of a Spark message scoring job
every hour (EMR)
Use-Case : Message Scoring
29
enterprise Aenterprise Benterprise C
S3
Spark job writes scored messages and stats to
another S3 bucket
S3
Use-Case : Message Scoring
30
enterprise Aenterprise Benterprise C
S3
This triggers SNS/SQS messages events
S3
SNS
SQS
Use-Case : Message Scoring
31
enterprise Aenterprise Benterprise C
S3
An Autoscale Group (ASG) of Importers spins up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
32
enterprise Aenterprise Benterprise C
S3
The importers rapidly ingest scored messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASGDB
Use-Case : Message Scoring
33
enterprise Aenterprise Benterprise C
S3
Users receive alerts of untrusted emails & can review them in
the web app
S3
SNS
SQS
Importers
ASGDB
Use-Case : Message Scoring
34
enterprise Aenterprise Benterprise C
S3 S3
SNS
SQS
Importers
ASGDB
Airflow manages the entire process
Use-Case : Message Scoring
Tackling Cost & TimelinessLeveraging the AWS Cloud
35
Tackling Cost
36
Between Daily Runs During Daily Runs
When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR
Tackling Cost
37
Between Hourly Runs During Hourly Runs
When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR
This does not help when runs are hourly since AWS charges at an hourly rate for EC2 instances!
Tackling TimelinessAuto Scaling Group (ASG)
38
ASG - Overview
39
What is it?
A means to automatically scale out/in clusters to handle variable load/traffic
A means to keep a cluster/service of a fixed size always up
ASG - Data Pipeline
40
importer
importer
importer
importer
Importer ASG
scale out / inSQS
DB
41
Sent
CPU
ACKd/Recvd
CPU-based auto-scaling is good at scaling in/out to keep the average CPU constant
ASG : CPU-based
ASG : CPU-based
42
Sent
CPU
Recv
Premature Scale-in
Premature Scale-in:
• The CPU drops to noise-levels before all messages are consumed
• This causes scale in to occur while the last few messages are still being committed
43
Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)
Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is ACK’d)
This causes the ASG to grow
This causes the ASG to shrink
ASG : Queue-based
44
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness Cost• ASG • EMR Spark
Daily • ASG • EMR Spark Hourly ASG • No Cost Savings
Tackling Operability & CorrectnessLeveraging Tooling
45
46
A simple way to author and manage workflows
Provides visual insight into the state & performance of workflow runs
Integrates with our alerting and monitoring tools
Tackling Operability : Requirements
Apache AirflowWorkflow Automation & Scheduling
47
48
Airflow: Author DAGs in Python! No need to bundle many config files!
Apache Airflow - Authoring DAGs
49
Airflow: Visualizing a DAG
Apache Airflow - Authoring DAGs
50
Airflow: It’s easy to manage multiple DAGs
Apache Airflow - Managing DAGs
Apache Airflow - Perf. Insights
51
Airflow: Gantt chart view reveals the slowest tasks for a run!
52
Apache Airflow - Perf. InsightsAirflow: Task Duration chart view show task completion time trends!
53
Airflow: …And easy to integrate with Ops tools!Apache Airflow - Alerting
54
Apache Airflow - Correctness
55
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness Cost
Use-Case : Message Scoring (near-real time)NRT Pipeline Architecture
56
Use-Case : Message Scoring
57
enterprise Aenterprise Benterprise C
Kinesis batch put every second
K
Use-Case : Message Scoring
58
enterprise Aenterprise Benterprise C
K
As ASG of scorers is scaled up to one process per core per kinesis shard
Scorers
ASG
Use-Case : Message Scoring
59
enterprise Aenterprise Benterprise C
KScorers
ASG
KinesisScorers apply the trust model and send scored messages downstream
Use-Case : Message Scoring
60
enterprise Aenterprise Benterprise C
KScorers
ASG
Kinesis
Importers
ASG
As ASG of importers is scaled up to rapidly import messages
DB
Use-Case : Message Scoring
61
enterprise Aenterprise Benterprise C
KScorers
ASG
Kinesis
Importers
ASG
Imported messages are also consumed by the
alerter
DB
K
Alerters
ASG
Use-Case : Message Scoring
62
enterprise Aenterprise Benterprise C
KScorers
ASG
Kinesis
Importers
ASG
Imported messages are also consumed by the
alerter
DB
K
Alerters
ASG
InnovationsNRT Pipeline Architecture
63
64
The Architecture is composed of repeated patterns of :
ASG-based compute consumer
Kinesis transport streams (i.e. AWS’ managed “Kafka”)
A Lambda-based Avro Schema Registry
Innovation 1 : Repeatable Units
ComputeiKinesisi
ASGi
SR
65
You can chain these repeatable units together to make arbitrary DAGs (Directed Acyclic Graphs)
The example above is a simple Linear DAG with 3 units
Innovation 1 : Repeatable Units
ComputeiKinesisi
ASGi
SR
ComputeiKinesisi
ASGi
SR
ComputeiKinesisi
ASGi
SR
66
The message body is Avro-encoded, with one detail:
The schema is not included in the Kinesis message!
The schema would be 99% overhead for the message
Instead, a schema_id is sent in the message header
Innovation 2 : Avro Schema Registry
ASG1
Compute1 Compute2Kinesis2
ASG2
SR
67
When the Compute 2 consumer receives the message, it
First reads the Schema_id out of the message header
Contacts the Schema Registry for the Schema (and caches it)
Deserialized the Avro body using the newly acquired schema
Innovation 2 : Avro Schema Registry
ASG
Compute1 Compute2Kinesis2
ASG
SR SR.getSchemaById()…
68
enterprise Aenterprise Benterprise C
KScorers
ASG
Kinesis
Importers
ASG
Imported messages are also consumed by the
alerter
DB
K
Alerters
ASG
SR
SR
SR
Innovation 2 : Avro Schema Registry
Airflow Job Reactively Scales
Innovation 3 : Reactive-Scaling (WIP)
69
enterprise Aenterprise Benterprise C
KScorers
ASG
Kinesis
Importers
ASGDB
K
Alerters
ASG
SR
SR
SR
70
If the ADR is triggered and a model build or code push was recently done to Compute 1, ADR will revert the last code or model push to ASG Compute 1
Innovation 4 : Anomaly-based Rollback (WIP)
ASG
Compute1 Compute2Kinesis
ASG
SR
Anomaly-detector&Reverter
Open Source Plans
71
Follow us to be notified when the following is open-sourced
• Avro Schema Registry
• Agari (Kinesis+ASG) scaling tool (Airflow Job)
• Anomaly-detector & Reverter
To be notified, follow @AgariEng & @r39132
Acknowledgments
72
• Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Mike Jones
• Scot Kennedy • Thede Loder • Paul Lorence • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang • Julian Mehnle
None of this work would be possible without the contributions of the strong team below
Questions? (@r39132)
73