streaming auto-scaling in google cloud dataflow · given code in dataflow (incubating as apache...

73
Software Engineer Google Manuel Fahndrich Streaming Auto-Scaling in Google Cloud Dataflow

Upload: others

Post on 28-May-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Software EngineerGoogle

Manuel Fahndrich

Streaming Auto-Scaling in Google Cloud Dataflow

Page 2: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg

Addictive

Mobile

Game

Page 3: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

1,251,965

1,019,341

989,673

151,365

109,903

98,736

Team RankingIndividual Ranking

Sarah

Joe

Milo

Hourly Ranking Daily Ranking

Page 4: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

An Unbounded Stream of Game Events

9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00

Page 5: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

… with unknown delays.

9:008:00 14:0013:0012:0011:0010:008:00

8:00

8:00

8:00

Page 6: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

The Resource Allocation Problem

time

workload

over-provisioned

resources

time

workload

under-provisioned

resources

resources

resources

Page 7: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Matching Resources to Workload

time

workload

auto-tuned

resourcesresources

Page 8: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Resources = Parallelism

time

workload

auto-tuned

parallelismparallelism

More generally: VMs (including CPU, RAM, network, IO).

Page 9: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Assumptions

Big Data Problem

Embarrassingly Parallel

Scaling VMs ==> Scales Throughput

Horizontal Scaling

Page 10: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Agenda

Streaming Dataflow Pipelines

Pipeline Execution

Adjusting Parallelism Automatically

Summary + Future Work

1

2

3

4

Page 11: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Streaming Dataflow1

Page 12: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

20122002 2004 2006 2008 2010

MapReduce

GFS Big Table

Dremel

Pregel

FlumeJava

Colossus

Spanner

2014

MillWheel

Dataflow

2016

Google’s Data-Related Systems

Page 13: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Google Dataflow SDK

Open Source SDK used to construct a Dataflow pipeline.

(Now Incubating as Apache Beam)

Page 14: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Computing Team Scores

// Collection of raw log linesPCollection<String> raw = ...;

// Element-wise transformation into team/score// pairsPCollection<KV<String, Integer>> input =

raw.apply(ParDo.of(new ParseFn()))

// Composite transformation containing an// aggregationPCollection<KV<String, Integer>> output = input.apply(Window.into(FixedWindows.of(Minutes(60)))).apply(Sum.integersPerKey());

Page 15: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Google Cloud Dataflow

● Given code in Dataflow (incubating as Apache Beam)

SDK...

● Pipelines can run…

○ On your development machine

○ On the Dataflow Service on Google Cloud Platform

○ On third party environments like Spark or Flink.

Page 16: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Cloud Dataflow

A fully-managed cloud service and

programming model for batch and

streaming big data processing.

Google Cloud Dataflow

Page 17: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Google Cloud Dataflow

Optimize

Schedule

GCS GCS

Page 18: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

time

workload

auto-tuned

parallelismparallelism

Back to the Problem at Hand

Page 19: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Signals measuring Workload

Policy making Decisions

Mechanism actuating Change

Auto-Tuning Ingredients

Page 20: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Pipeline Execution2

Page 21: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

S0

S2

S1

Optimized Pipeline = DAG of Stages

raw input

Individualpoints

teampoints

Page 22: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

S0

S2

S1

Stage Throughput Measure

raw input

Individualpoints

teampoints

throughput

throughput

throughput

Page 23: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Picture by Alexandre Duret-Lutz, Creative Commons 2.0 Generic

Page 24: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

S0

S2

S1

Queues of Data Ready for Processing

Queue Size = Backlog

Page 25: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

vs.Backlog Growth

Backlog Size

Page 26: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Backlog Growth=

Processing Deficit

Page 27: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

S1

Derived Signal: Stage Input Rate

throughput

Input Rate = Throughput + Backlog Growth

backlog growth

Page 28: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Constant Backlog...

...could be bad

Page 29: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Backlog Time =Backlog Size

Throughput

Page 30: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Backlog Time =Time to get through backlog

Page 31: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Bad Backlog = Long Backlog Time

Page 32: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Backlog Growthand

Backlog TimeInform Upscaling.

What Signals indicate Downscaling?

Page 33: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Low CPU Utilization

Page 34: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Throughput

Backlog growth

Backlog time

CPU utilization

Signals Summary

Page 35: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Goals:1. No backlog growth2. Short backlog time3. Reasonable CPU utilization

Policy: making Decisions

Page 36: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Upscaling Policy: Keeping Up

Given M machines

For a stage, given:

average stage throughput T

average positive backlog growth G of stage

Machines needed for stage to keep up:

(T + G)

TM’ = M

Page 37: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Upscaling Policy: Catching Up

Given M machines

Given R (time to reduce backlog)

For a stage, given:

average backlog time B

Extra machines to remove backlog:

B

RExtra = M

Page 38: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Upscaling Policy: All Stages

Want all stages to:

1. keep up

2. have log backlog time

Pick Maximum over all stages of M’ + Extra

Page 39: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Example (signals)

input rate

throughput

backlog

growth

backlog

time

MB

/s

seconds

Page 40: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Example (signals)

input rate

throughput

backlog

growth

backlog

time

MB

/s

seconds

Page 41: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Example (signals)

input rate

throughput

backlog

growth

backlog

time

MB

/s

seconds

Page 42: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Example (signals)

input rate

throughput

backlog

growth

backlog

time

MB

/s

seconds

Page 43: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Example (policy)

M’

M

Extra R=60s

machin

es

Page 44: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Example (policy)

M’

Mmachin

es

Extra R=60s

Page 45: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Example (policy)

M’

Mmachin

es

Extra R=60s

Page 46: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Example (policy)

M’

Mmachin

es

Extra R=60s

Page 47: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Preconditions for Downscaling

Low backlog time

No backlog growth

Low CPU utilization

Page 48: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

How far can we downscale?

Stay tuned...

Page 49: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Adjusting Parallelism of a Running Streaming Pipeline

Mechanism: actuating Change

3

Page 50: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

S0

S2

S1

Optimized Pipeline = DAG of Stages

Page 51: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

S0

S2

S1

Optimized Pipeline = DAG of Stages

Page 52: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

S0

S2

S1

Optimized Pipeline = DAG of Stages

Machine 0

Page 53: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Adding Parallelism

Machine 0

S0

S2

S1

S0

S2

S1

S0

S2

S1

Page 54: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Adding Parallelism

S0

S2

S1

S0

S2

S1

Machine 0

Machine 1

Page 55: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Adding Parallelism = Splitting Key Ranges

S0

S2

S1

S0

S2

S1

Machine 0

Machine 1

Page 56: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Migrating a Computation

Page 57: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Adding Parallelism = Migrating Computation Ranges

S0

S2

S1

S0

S2

S1

Machine 0

Machine 1

Page 58: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Checkpoint and Recovery~

Computation Migration

Page 59: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Key Ranges and Persistence

S0

S2

S1Machine 0

S0

S2

S1Machine 1

S0

S2

S1Machine 3

S0

S2

S1Machine 2

Page 60: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Downscaling from 4 to 2 Machines

S0

S2

S1Machine 0

S0

S2

S1Machine 1

S0

S2

S1Machine 3

S0

S2

S1Machine 2

Page 61: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Downscaling from 4 to 2 Machines

S0

S2

S1Machine 0

S0

S2

S1Machine 1

S0

S2

S1Machine 3

S0

S2

S1Machine 2

Page 62: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Downscaling from 4 to 2 Machines

S0

S2

S1Machine 0

S0

S2

S1Machine 1

Page 63: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Downscaling from 4 to 2 Machines

S0

S2

S1Machine 1

S0

S2

S1Machine 2

Upsizing = Steps in Reverse

Page 64: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Granularity of Parallelism

As of March 2016, Google Cloud Dataflow:

• Splits Key Ranges initially Based on Max Machines

• At Max: 1 Logical Persistent Disk per Machine

Each disk has slice of key ranges from all stages

• Only (relatively) even Disk Distributions

• Results in Scaling Quanta

Page 65: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Parallelism Disk per Machine

3 N/A

4 15

5 12

6 10

7 8, 9

8 7, 8

9 6, 7

10 6

12 5

15 4

20 3

30 2

60 1

Example ScalingQuanta: Max = 60 Machines

Page 66: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Goals:1. No backlog growth2. Short backlog time3. Reasonable CPU utilization

Policy: making Decisions

Page 67: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Preconditions for Downscaling

Low backlog time

No backlog growth

Low CPU utilization

Page 68: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Next lower scaling quanta => M’ machines

Estimate future CPUM’ per machine:

If new CPUM’ < threshold (say 90%),

downscale to M’

Downscaling Policy

M

M’CPUM’ = CPUM

Page 69: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Summary +Future Work

4

Page 70: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Artificial Experiment

Page 71: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Auto-Scaling Summary

Signals: throughput, backlog time,backlog growth, CPU utilization

Policy: keep up, reduce backlog,use CPUs

Mechanism: split key ranges, migrate computations

Page 72: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

• Experiment with non-uniform disk distributions to

address hot ranges

• Dynamically splitting ranges finer than initially done.

• Approximate model of #VM - throughput relation

Future Work

Page 73: Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache Beam) SDK... Pipelines can run… On your development machine On the Dataflow Service

Questions?

Further reading on streaming model:

The world beyond batch: Streaming 101

The world beyond batch: Streaming 102