retro: targeted resource management in multi …...code change increases database usage, shifts...

145
Targeted Resource Management in Multi-tenant Distributed Systems Jonathan Mace Brown University Peter Bodik MSR Redmond Rodrigo Fonseca Brown University Madanlal Musuvathi MSR Redmond

Upload: others

Post on 27-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Targeted Resource Management in Multi-tenant Distributed Systems

Jonathan MaceBrown University

Peter BodikMSR Redmond

Rodrigo FonsecaBrown University

Madanlal MusuvathiMSR Redmond

Page 2: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Resource Management in Multi-Tenant Systems

2

Page 3: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Failure in availability zone cascades to shared control plane, causes thread pool starvation for all zones• April 2011 – Amazon EBS Failure

Resource Management in Multi-Tenant Systems

2

Page 4: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Failure in availability zone cascades to shared control plane, causes thread pool starvation for all zones• April 2011 – Amazon EBS Failure

Aggressive background task responds to increased hardware capacity with deluge of warnings and logging• Aug. 2012 – Azure Storage Outage

Resource Management in Multi-Tenant Systems

2

Page 5: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Code change increases database usage, shifts bottleneck to unmanaged application-level lock• Nov. 2014 – Visual Studio Online outage

Failure in availability zone cascades to shared control plane, causes thread pool starvation for all zones• April 2011 – Amazon EBS Failure

Aggressive background task responds to increased hardware capacity with deluge of warnings and logging• Aug. 2012 – Azure Storage Outage

Resource Management in Multi-Tenant Systems

2

Page 6: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Code change increases database usage, shifts bottleneck to unmanaged application-level lock• Nov. 2014 – Visual Studio Online outage

Failure in availability zone cascades to shared control plane, causes thread pool starvation for all zones• April 2011 – Amazon EBS Failure

Aggressive background task responds to increased hardware capacity with deluge of warnings and logging• Aug. 2012 – Azure Storage Outage

Resource Management in Multi-Tenant Systems

2

Shared storage layer bottlenecks circumvent resource management layer• 2014 – Communication with Cloudera

Page 7: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Code change increases database usage, shifts bottleneck to unmanaged application-level lock• Nov. 2014 – Visual Studio Online outage

Failure in availability zone cascades to shared control plane, causes thread pool starvation for all zones• April 2011 – Amazon EBS Failure

Aggressive background task responds to increased hardware capacity with deluge of warnings and logging• Aug. 2012 – Azure Storage Outage

Resource Management in Multi-Tenant Systems

2

Degraded performance, Violated SLOs, system outages

Shared storage layer bottlenecks circumvent resource management layer• 2014 – Communication with Cloudera

Page 8: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Code change increases database usage, shifts bottleneck to unmanaged application-level lock• Nov. 2014 – Visual Studio Online outage

Failure in availability zone cascades to shared control plane, causes thread pool starvation for all zones• April 2011 – Amazon EBS Failure

Aggressive background task responds to increased hardware capacity with deluge of warnings and logging• Aug. 2012 – Azure Storage Outage

Resource Management in Multi-Tenant Systems

3

Degraded performance, Violated SLOs, system outages

Shared storage layer bottlenecks circumvent resource management layer• 2014 – Communication with Cloudera

Page 9: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Code change increases database usage, shifts bottleneck to unmanaged application-level lock• Nov. 2014 – Visual Studio Online outage

Failure in availability zone cascades to shared control plane, causes thread pool starvation for all zones• April 2011 – Amazon EBS Failure

Aggressive background task responds to increased hardware capacity with deluge of warnings and logging• Aug. 2012 – Azure Storage Outage

Resource Management in Multi-Tenant Systems

3

Degraded performance, Violated SLOs, system outages

Shared storage layer bottlenecks circumvent resource management layer• 2014 – Communication with Cloudera

OSHypervisor

Page 10: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Code change increases database usage, shifts bottleneck to unmanaged application-level lock• Nov. 2014 – Visual Studio Online outage

Failure in availability zone cascades to shared control plane, causes thread pool starvation for all zones• April 2011 – Amazon EBS Failure

Aggressive background task responds to increased hardware capacity with deluge of warnings and logging• Aug. 2012 – Azure Storage Outage

Resource Management in Multi-Tenant Systems

3

Degraded performance, Violated SLOs, system outages

Shared storage layer bottlenecks circumvent resource management layer• 2014 – Communication with Cloudera

OSHypervisor

Page 11: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Containers / VMs

4

Page 12: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Containers / VMs

Shared Systems:Storage, Database,

Queueing, etc.

4

Page 13: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

5

Page 14: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

5

Page 15: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Monitors resource usage of each tenant in near real-time

Actively schedules tenants and activities

5

Page 16: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Monitors resource usage of each tenant in near real-time

Actively schedules tenants and activities

High-level, centralized policies:Encapsulates resource management logic

5

Page 17: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Monitors resource usage of each tenant in near real-time

Actively schedules tenants and activities

High-level, centralized policies:Encapsulates resource management logic

Abstractions – not specific to resource type, system

Achieve different goals: guarantee average latencies, fair share a resource, etc.

5

Page 18: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Hadoop Distributed File System (HDFS)

6

Page 19: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Hadoop Distributed File System (HDFS)

HDFS NameNode

Filesystem metadata6

Page 20: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

HDFS DataNodeHDFS DataNode

HDFS DataNode

Hadoop Distributed File System (HDFS)

Replicated block storage

HDFS NameNode

Filesystem metadata6

Page 21: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

HDFS DataNodeHDFS DataNode

HDFS DataNode

Hadoop Distributed File System (HDFS)

Replicated block storage

HDFS NameNode

Filesystem metadata

Rename

6

Page 22: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

HDFS DataNodeHDFS DataNode

HDFS DataNode

Hadoop Distributed File System (HDFS)

Replicated block storage

HDFS NameNode

Filesystem metadata

ReadRename

6

Page 23: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

HDFS DataNodeHDFS DataNode

HDFS DataNode

Hadoop Distributed File System (HDFS)

Replicated block storage

HDFS NameNode

Filesystem metadata

ReadRename

6

Page 24: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

HDFS NameNode HDFS DataNodeHDFS DataNode

HDFS DataNode

7

Page 25: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

HDFS NameNode HDFS DataNodeHDFS DataNode

HDFS DataNode

7

time

500

0

Page 26: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

HDFS NameNode HDFS DataNodeHDFS DataNode

HDFS DataNode

7

time

500

0

Page 27: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

HDFS NameNode HDFS DataNodeHDFS DataNode

HDFS DataNode

8

500

0

Page 28: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

HDFS NameNode HDFS DataNodeHDFS DataNode

HDFS DataNode

8

500

012

0

Page 29: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

HDFS NameNode HDFS DataNodeHDFS DataNode

HDFS DataNode

8

500

012

0

Page 30: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

HDFS NameNode HDFS DataNodeHDFS DataNode

HDFS DataNode

9

500

012

0

Page 31: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

HDFS NameNode HDFS DataNodeHDFS DataNode

HDFS DataNode

9

500

012

0500

0

Page 32: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

HDFS NameNode HDFS DataNodeHDFS DataNode

HDFS DataNode

10

500

012

0500

0

Page 33: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

HDFS NameNode HDFS DataNodeHDFS DataNode

HDFS DataNode

10

500

012

0500

0

Page 34: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

HDFS NameNode HDFS DataNodeHDFS DataNode

HDFS DataNode

10

500

012

0500

0500

0

Page 35: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Local storage

HDFS DataNode

11

Page 36: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Local storage

HDFS DataNode

11

Page 37: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Local storage

HBase RegionServer

HDFS DataNode

11

Page 38: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Local storage

HBase RegionServer

HDFS DataNode

11

Page 39: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Local storage

HBase RegionServer

HDFS DataNode

MapReduceShuffler

Yarn ContainerYarn ContainerHadoop YARN

Container

MapReduce Tasks

11

Page 40: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Local storage

HBase RegionServer

HDFS DataNode

MapReduceShuffler

Yarn ContainerYarn ContainerHadoop YARN

Container

MapReduce Tasks

11

Page 41: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Local storage

HBase RegionServer

HDFS DataNode

Hadoop YARNNodeManager

Yarn ContainerYarn ContainerHadoop YARN Container

MapReduce

Local storage

HBase RegionServer

HDFS DataNode

Hadoop YARNNodeManager

Yarn ContainerYarn ContainerHadoop YARN Container

MapReduce

Local storage

HBase RegionServer

HDFS DataNode

MapReduceShuffler

Yarn ContainerYarn ContainerHadoop YARN

Container

MapReduce Tasks

11

Page 42: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Goals

12

Page 43: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Co-ordinated control across processes, machines, and services

Goals

12

Page 44: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Co-ordinated control across processes, machines, and services

Handle system and application level resources

Goals

12

Page 45: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Co-ordinated control across processes, machines, and services

Handle system and application level resources

Principals: tenants, background tasks

Goals

12

Page 46: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Co-ordinated control across processes, machines, and services

Handle system and application level resources

Principals: tenants, background tasks

Real-time and reactive

Goals

12

Page 47: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Co-ordinated control across processes, machines, and services

Handle system and application level resources

Principals: tenants, background tasks

Real-time and reactive

Efficient: Only control what is needed

Goals

12

Page 48: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Architecture

13

Page 49: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

14

Page 50: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Tenant Requests

14

Page 51: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows

Purpose: identify requests from different users, background activities

eg, all requests from a tenant over time

eg, data balancing in HDFS

15

Page 52: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows

Purpose: identify requests from different users, background activities

eg, all requests from a tenant over time

eg, data balancing in HDFS

Unit of resource measurement, attribution, and enforcement

15

Page 53: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows

Purpose: identify requests from different users, background activities

eg, all requests from a tenant over time

eg, data balancing in HDFS

Unit of resource measurement, attribution, and enforcement

Tracks a request across varying levels of granularityOrthogonal to threads, processes, network flows, etc.

15

Page 54: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows

16

Page 55: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows

16

Page 56: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows Resources

Purpose: cope with diversity of resources

16

Page 57: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows Resources

Purpose: cope with diversity of resources

What we need:

1. Identify overloaded resources

16

Page 58: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows Resources

Purpose: cope with diversity of resources

What we need:

1. Identify overloaded resources

2. Identify culprit workflows

16

Page 59: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows Resources

Purpose: cope with diversity of resources

What we need:

1. Identify overloaded resources

Slowdown Ratio of how slow the resource is now compared to its baseline performance with no contention.

2. Identify culprit workflows

16

Page 60: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows Resources

Purpose: cope with diversity of resources

What we need:

1. Identify overloaded resources

Slowdown Ratio of how slow the resource is now compared to its baseline performance with no contention.

2. Identify culprit workflows

Load Fraction of current utilization that we can attribute to each workflow

16

Page 61: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows Resources

17

Purpose: cope with diversity of resources

What we need:

1. Identify overloaded resources

Slowdown Ratio of how slow the resource is now compared to its baseline performance with no contention.

2. Identify culprit workflows

Load Fraction of current utilization that we can attribute to each workflow

Page 62: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows Resources

17

Purpose: cope with diversity of resources

What we need:

1. Identify overloaded resources

Slowdown Ratio of how slow the resource is now compared to its baseline performance with no contention.

2. Identify culprit workflows

Load Fraction of current utilization that we can attribute to each workflow

Page 63: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows Resources

Slowdown (queue time + execute time) / execute timeeg. 100ms queue, 10ms execute

=> slowdown 11

Load time spent executingeg. 10ms execute

=> load 10

17

Purpose: cope with diversity of resources

What we need:

1. Identify overloaded resources

Slowdown Ratio of how slow the resource is now compared to its baseline performance with no contention.

2. Identify culprit workflows

Load Fraction of current utilization that we can attribute to each workflow

Page 64: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows Resources

17

Page 65: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows Resources Control Points

Goal: enforce resource management decisions

18

Page 66: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows Resources Control Points

Goal: enforce resource management decisions

Decoupled from resources

Rate-limits workflows, agnostic to underlying implementation e.g.,

token bucket

priority queue

18

Page 67: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows Resources Control Points

Goal: enforce resource management decisions

Decoupled from resources

Rate-limits workflows, agnostic to underlying implementation e.g.,

token bucket

priority queue

19

Page 68: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows Resources Control Points

Goal: enforce resource management decisions

Decoupled from resources

Rate-limits workflows, agnostic to underlying implementation e.g.,

token bucket

priority queue

19

Page 69: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows Resources Control Points

Goal: enforce resource management decisions

Decoupled from resources

Rate-limits workflows, agnostic to underlying implementation e.g.,

token bucket

priority queue

19

Page 70: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows Resources Control Points

Goal: enforce resource management decisions

Decoupled from resources

Rate-limits workflows, agnostic to underlying implementation e.g.,

token bucket

priority queue

19

Page 71: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows Resources Control Points

Goal: enforce resource management decisions

Decoupled from resources

Rate-limits workflows, agnostic to underlying implementation e.g.,

token bucket

priority queue

19

Page 72: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows Resources Control Points

Goal: enforce resource management decisions

Decoupled from resources

Rate-limits workflows, agnostic to underlying implementation e.g.,

token bucket

priority queue

19

Page 73: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows Resources Control Points

Goal: enforce resource management decisions

Decoupled from resources

Rate-limits workflows, agnostic to underlying implementation e.g.,

token bucket

priority queue

19

Page 74: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Workflows Resources Control Points

20

Page 75: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Pervasive Measurement

1. Pervasive MeasurementAggregated locally then reported centrally once per second

Workflows Resources Control Points

20

Page 76: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Re

tro

Co

ntr

oll

er

AP

I

Pervasive Measurement

1. Pervasive MeasurementAggregated locally then reported centrally once per second

2. Centralized ControllerGlobal, abstracted view of the system

Workflows Resources Control Points

20

Page 77: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Re

tro

Co

ntr

oll

er

AP

I

Pervasive Measurement

1. Pervasive MeasurementAggregated locally then reported centrally once per second

2. Centralized ControllerGlobal, abstracted view of the system

Policies run in continuous control loop

Po

licy

Po

licy

Po

licy

Workflows Resources Control Points

20

Page 78: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Re

tro

Co

ntr

oll

er

AP

I

Pervasive Measurement

1. Pervasive MeasurementAggregated locally then reported centrally once per second

2. Centralized ControllerGlobal, abstracted view of the system

Policies run in continuous control loop

3. Distributed EnforcementCo-ordinates enforcement using distributed token bucket

Po

licy

Po

licy

Po

licy

Workflows Resources Control Points

Distributed Enforcement

20

Page 79: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Re

tro

Co

ntr

oll

er

AP

I

Distributed Enforcement

Pervasive Measurement

Workflows Resources Control Points

Po

licy

Po

licy

Po

licy

“Control Plane” for resource management

Global, abstracted view of the systemEasier to write

Reusable

21

Page 80: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

22

Example: LatencySLOPolicy

Page 81: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

22

Example: LatencySLOH High Priority Workflows

Policy

“200ms average request latency”

Page 82: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

22

Example: LatencySLOH High Priority Workflows

Policy

L Low Priority Workflows

“200ms average request latency”

(use spare capacity)

Page 83: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

22

Example: LatencySLOH High Priority Workflows

Policy

L Low Priority Workflows

“200ms average request latency”

(use spare capacity)monitor latencies

Page 84: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

22

Example: LatencySLOH High Priority Workflows

Policy

L Low Priority Workflows

“200ms average request latency”

(use spare capacity)monitor latencies

attribute interference

Page 85: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

22

Example: LatencySLOH High Priority Workflows

Policy

L Low Priority Workflows

“200ms average request latency”

(use spare capacity)monitor latencies

throttle interfering workflows

Page 86: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

23

1 foreach candidate in H2 miss[candidate] = latency(candidate) / guarantee[candidate]3 W = candidate in H with max miss[candidate]

4 foreach rsrc in resources() // calculate importance of each resource for hipri5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc))

6 foreach lopri in L // calculate low priority workflow interference7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc)

8 foreach lopri in L // normalize interference9 interference[lopri] /= Σk interference[k]

10 foreach lopri in L11 if miss[W] > 1 // throttle12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri]13 else // release14 scalefactor = 1 + β

15 foreach cpoint in controlpoints() // apply new rates16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)

Example: LatencySLOH High Priority Workflows

Policy

L Low Priority Workflows

Page 87: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

23

1 foreach candidate in H2 miss[candidate] = latency(candidate) / guarantee[candidate]3 W = candidate in H with max miss[candidate]

4 foreach rsrc in resources() // calculate importance of each resource for hipri5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc))

6 foreach lopri in L // calculate low priority workflow interference7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc)

8 foreach lopri in L // normalize interference9 interference[lopri] /= Σk interference[k]

10 foreach lopri in L11 if miss[W] > 1 // throttle12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri]13 else // release14 scalefactor = 1 + β

15 foreach cpoint in controlpoints() // apply new rates16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)

Example: LatencySLOH High Priority Workflows

Policy

L Low Priority Workflows

Page 88: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

24

1 foreach candidate in H2 miss[candidate] = latency(candidate) / guarantee[candidate]3 W = candidate in H with max miss[candidate]

4 foreach rsrc in resources() // calculate importance of each resource for hipri5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc))

6 foreach lopri in L // calculate low priority workflow interference7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc)

8 foreach lopri in L // normalize interference9 interference[lopri] /= Σk interference[k]

10 foreach lopri in L11 if miss[W] > 1 // throttle12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri]13 else // release14 scalefactor = 1 + β

15 foreach cpoint in controlpoints() // apply new rates16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)

Example: LatencySLOH High Priority Workflows

Policy

L Low Priority Workflows

Select the high priority workflow Wwith worst performance

Page 89: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

24

1 foreach candidate in H2 miss[candidate] = latency(candidate) / guarantee[candidate]3 W = candidate in H with max miss[candidate]

4 foreach rsrc in resources() // calculate importance of each resource for hipri5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc))

6 foreach lopri in L // calculate low priority workflow interference7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc)

8 foreach lopri in L // normalize interference9 interference[lopri] /= Σk interference[k]

10 foreach lopri in L11 if miss[W] > 1 // throttle12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri]13 else // release14 scalefactor = 1 + β

15 foreach cpoint in controlpoints() // apply new rates16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)

Example: LatencySLOH High Priority Workflows

Policy

L Low Priority Workflows

Select the high priority workflow Wwith worst performance

Weight low priority workflows by their interference with W

Page 90: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

24

1 foreach candidate in H2 miss[candidate] = latency(candidate) / guarantee[candidate]3 W = candidate in H with max miss[candidate]

4 foreach rsrc in resources() // calculate importance of each resource for hipri5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc))

6 foreach lopri in L // calculate low priority workflow interference7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc)

8 foreach lopri in L // normalize interference9 interference[lopri] /= Σk interference[k]

10 foreach lopri in L11 if miss[W] > 1 // throttle12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri]13 else // release14 scalefactor = 1 + β

15 foreach cpoint in controlpoints() // apply new rates16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)

Example: LatencySLOH High Priority Workflows

Policy

L Low Priority Workflows

Select the high priority workflow Wwith worst performance

Weight low priority workflows by their interference with W

Throttle low priority workflows proportionally to their weight

Page 91: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

25

1 foreach candidate in H2 miss[candidate] = latency(candidate) / guarantee[candidate]3 W = candidate in H with max miss[candidate]

4 foreach rsrc in resources() // calculate importance of each resource for hipri5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc))

6 foreach lopri in L // calculate low priority workflow interference7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc)

8 foreach lopri in L // normalize interference9 interference[lopri] /= Σk interference[k]

10 foreach lopri in L11 if miss[W] > 1 // throttle12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri]13 else // release14 scalefactor = 1 + β

15 foreach cpoint in controlpoints() // apply new rates16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)

Example: LatencySLOH High Priority Workflows

Policy

L Low Priority Workflows

Weight low priority workflows by their interference with W

Select the high priority workflow Wwith worst performance

Throttle low priority workflows proportionally to their weight

Page 92: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

26

1 foreach candidate in H2 miss[candidate] = latency(candidate) / guarantee[candidate]3 W = candidate in H with max miss[candidate]

4 foreach rsrc in resources() // calculate importance of each resource for hipri5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc))

6 foreach lopri in L // calculate low priority workflow interference7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc)

8 foreach lopri in L // normalize interference9 interference[lopri] /= Σk interference[k]

10 foreach lopri in L11 if miss[W] > 1 // throttle12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri]13 else // release14 scalefactor = 1 + β

15 foreach cpoint in controlpoints() // apply new rates16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)

Example: LatencySLOH High Priority Workflows

Policy

L Low Priority Workflows

Throttle low priority workflows proportionally to their weight

Select the high priority workflow Wwith worst performance

Weight low priority workflows by their interference with W

Page 93: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

27

1 foreach candidate in H2 miss[candidate] = latency(candidate) / guarantee[candidate]3 W = candidate in H with max miss[candidate]

4 foreach rsrc in resources() // calculate importance of each resource for hipri5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc))

6 foreach lopri in L // calculate low priority workflow interference7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc)

8 foreach lopri in L // normalize interference9 interference[lopri] /= Σk interference[k]

10 foreach lopri in L11 if miss[W] > 1 // throttle12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri]13 else // release14 scalefactor = 1 + β

15 foreach cpoint in controlpoints() // apply new rates16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)

Example: LatencySLOH High Priority Workflows

Policy

L Low Priority Workflows

Select the high priority workflow Wwith worst performance

Weight low priority workflows by their interference with W

Throttle low priority workflows proportionally to their weight

Page 94: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

27

1 foreach candidate in H2 miss[candidate] = latency(candidate) / guarantee[candidate]3 W = candidate in H with max miss[candidate]

4 foreach rsrc in resources() // calculate importance of each resource for hipri5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc))

6 foreach lopri in L // calculate low priority workflow interference7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc)

8 foreach lopri in L // normalize interference9 interference[lopri] /= Σk interference[k]

10 foreach lopri in L11 if miss[W] > 1 // throttle12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri]13 else // release14 scalefactor = 1 + β

15 foreach cpoint in controlpoints() // apply new rates16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)

Example: LatencySLOH High Priority Workflows

Policy

L Low Priority Workflows

Select the high priority workflow Wwith worst performance

Weight low priority workflows by their interference with W

Throttle low priority workflows proportionally to their weight

Page 95: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Other types of policy…

28

Page 96: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Other types of policy…

Bottleneck FairnessDetect most overloaded resource

Fair-share resource between tenants using it

Policy

28

Page 97: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Other types of policy…

Bottleneck FairnessDetect most overloaded resource

Fair-share resource between tenants using it

Dominant Resource FairnessEstimate demands and capacities from measurements

Policy

Policy

28

Page 98: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Other types of policy…

Bottleneck FairnessDetect most overloaded resource

Fair-share resource between tenants using it

Dominant Resource FairnessEstimate demands and capacities from measurements

Policy

Policy

ConciseAny resources can be bottleneck (policy doesn’t care)Not system specific

28

Page 99: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Evaluation

29

Page 100: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Instrumentation

30

Page 101: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Instrumentation

Retro implementation in JavaInstrumentation Library

Central controller implementation

30

Page 102: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Instrumentation

Retro implementation in JavaInstrumentation LibraryCentral controller implementation

To enable RetroPropagate Workflow ID within application (like X-Trace, Dapper)Instrument resources with wrapper classes

30

Page 103: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Instrumentation

Retro implementation in JavaInstrumentation LibraryCentral controller implementation

To enable RetroPropagate Workflow ID within application (like X-Trace, Dapper)Instrument resources with wrapper classes

Overheads

31

Page 104: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Instrumentation

Retro implementation in JavaInstrumentation LibraryCentral controller implementation

To enable RetroPropagate Workflow ID within application (like X-Trace, Dapper)Instrument resources with wrapper classes

OverheadsResource instrumentation automatic using AspectJ

31

Page 105: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Instrumentation

Retro implementation in JavaInstrumentation LibraryCentral controller implementation

To enable RetroPropagate Workflow ID within application (like X-Trace, Dapper)Instrument resources with wrapper classes

OverheadsResource instrumentation automatic using AspectJ

Overall 50-200 lines per system to modify RPCs

31

Page 106: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Instrumentation

Retro implementation in JavaInstrumentation LibraryCentral controller implementation

To enable RetroPropagate Workflow ID within application (like X-Trace, Dapper)Instrument resources with wrapper classes

OverheadsResource instrumentation automatic using AspectJ

Overall 50-200 lines per system to modify RPCs

Retro overhead: max 1-2% on throughput, latency

31

Page 107: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Experiments

32

Page 108: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

ExperimentsYARN MapReduce

ZooKeeper

Systems

HDFSHBase

32

Page 109: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

ExperimentsYARN MapReduce

ZooKeeper

Systems

Workflows MapReduce Jobs (HiBench)HBase (YCSB)HDFS clientsBackground Data ReplicationBackground Heartbeats

HDFSHBase

32

Page 110: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

ExperimentsYARN MapReduce

ZooKeeper

Systems

Workflows MapReduce Jobs (HiBench)HBase (YCSB)HDFS clientsBackground Data ReplicationBackground Heartbeats

ResourcesCPU, Disk, Network (All systems)Locks, Queues (HDFS, HBase)

HDFSHBase

32

Page 111: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

ExperimentsYARN MapReduce

ZooKeeper

Systems

Workflows MapReduce Jobs (HiBench)HBase (YCSB)HDFS clientsBackground Data ReplicationBackground Heartbeats

ResourcesCPU, Disk, Network (All systems)Locks, Queues (HDFS, HBase)

PoliciesLatencySLO

Bottleneck Fairness

Dominant Resource Fairness

HDFSHBase

32

Page 112: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Experiments

Policies for a mixture of systems, workflows, and resources

Results on clusters up to 200 nodes

YARN MapReduce

ZooKeeper

Systems

Workflows MapReduce Jobs (HiBench)HBase (YCSB)HDFS clientsBackground Data ReplicationBackground Heartbeats

ResourcesCPU, Disk, Network (All systems)Locks, Queues (HDFS, HBase)

PoliciesLatencySLO

Bottleneck Fairness

Dominant Resource Fairness

See paper for full experiment results

HDFSHBase

32

Page 113: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Experiments

Policies for a mixture of systems, workflows, and resources

Results on clusters up to 200 nodes

YARN MapReduce

ZooKeeper

Systems

Workflows MapReduce Jobs (HiBench)HBase (YCSB)HDFS clientsBackground Data ReplicationBackground Heartbeats

ResourcesCPU, Disk, Network (All systems)Locks, Queues (HDFS, HBase)

PoliciesLatencySLO

Bottleneck Fairness

Dominant Resource Fairness

See paper for full experiment results

This talk LatencySLO policy results

HDFSHBase

32

Page 114: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Experiments

Policies for a mixture of systems, workflows, and resources

Results on clusters up to 200 nodes

YARN MapReduce

ZooKeeper

Systems

Workflows MapReduce Jobs (HiBench)HBase (YCSB)HDFS clientsBackground Data ReplicationBackground Heartbeats

ResourcesCPU, Disk, Network (All systems)Locks, Queues (HDFS, HBase)

PoliciesLatencySLO

Bottleneck Fairness

Dominant Resource Fairness

See paper for full experiment results

This talk LatencySLO policy results

HDFSHBase

32

Page 115: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

HDFS read 8k

HBase read 1 cached row

HBase read 1 row

33

Page 116: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

HDFS read 8k

HBase read 1 cached row

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

33

Page 117: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

33

Page 118: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

33

Page 119: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

SLOtarget

33

Page 120: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

SLOtarget

33

Page 121: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

SLOtarget

33

Page 122: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

0

5

10

15

0 5 10 15 20 25 30

Slo

wd

ow

n

Time [minutes]

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

SLOtarget

34

Page 123: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

0

5

10

15

0 5 10 15 20 25 30

Slo

wd

ow

n

Time [minutes]

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

Disk

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

SLOtarget

34

Page 124: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

0

5

10

15

0 5 10 15 20 25 30

Slo

wd

ow

n

Time [minutes]

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

Disk

CPU

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

SLOtarget

34

Page 125: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

0

5

10

15

0 5 10 15 20 25 30

Slo

wd

ow

n

Time [minutes]

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

Disk

CPU

HDFS NN Lock

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

SLOtarget

34

Page 126: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

0

5

10

15

0 5 10 15 20 25 30

Slo

wd

ow

n

Time [minutes]

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

Disk

CPU

HDFS NN Lock

HDFS NN Queue

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

SLOtarget

34

Page 127: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

0

5

10

15

0 5 10 15 20 25 30

Slo

wd

ow

n

Time [minutes]

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

Disk

CPU

HDFS NN Lock

HDFS NN Queue

HBase Queue

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

SLOtarget

34

Page 128: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

0.1

1

10

0 5 10 15 20 25 30

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

+ SLO Policy Enabled

SLO target

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

35

SLOtarget

Page 129: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

0.1

1

10

0 5 10 15 20 25 30

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

+ SLO Policy Enabled

SLO target

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

35

SLOtarget

Page 130: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

0.1

1

10

0 5 10 15 20 25 30

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

36

Page 131: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

0.1

1

10

0 5 10 15 20 25 30

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

0.2

1 HBaseTable Scan

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

36

Page 132: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

0.1

1

10

0 5 10 15 20 25 30

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

0.2

1 HBaseTable Scan

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

36

Page 133: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

0.1

1

10

0 5 10 15 20 25 30

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

0.2

1 HBaseTable Scan

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

36

Page 134: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

0.1

1

10

0 5 10 15 20 25 30

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

0.2

1 HBaseTable Scan

0.5

1 HDFS mkdir

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

36

Page 135: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

0.1

1

10

0 5 10 15 20 25 30

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

0.2

1 HBaseTable Scan

0.5

1 HDFS mkdir

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

36

Page 136: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

0.1

1

10

0 5 10 15 20 25 30

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

0.2

1 HBaseTable Scan

0.5

1 HDFS mkdir

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

36

Page 137: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

0.1

1

10

0 5 10 15 20 25 30

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

0.5

1 HBaseCached Table Scan

0.2

1 HBaseTable Scan

0.5

1 HDFS mkdir

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

36

Page 138: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

0.1

1

10

0 5 10 15 20 25 30

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

0.5

1 HBaseCached Table Scan

0.2

1 HBaseTable Scan

0.5

1 HDFS mkdir

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

36

Page 139: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

0.1

1

10

0 5 10 15 20 25 30

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k

HBase read 1 cached row

0.5

1 HBaseCached Table Scan

0.2

1 HBaseTable Scan

0.5

1 HDFS mkdir

HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

36

Page 140: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Conclusion

37

Page 141: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Resource management for shareddistributed systems

Conclusion

37

Page 142: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Resource management for shareddistributed systems

Centralized resource management

Conclusion

37

Page 143: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Resource management for shareddistributed systems

Centralized resource management

Conclusion

Comprehensive: resources, processes, tenants, background tasks

37

Page 144: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Resource management for shareddistributed systems

Centralized resource management

Conclusion

Comprehensive: resources, processes, tenants, background tasks

Abstractions for writing concise, general-purpose policies:

WorkflowsResources (slowdown, load)Control points

37

Page 145: Retro: Targeted Resource Management in Multi …...Code change increases database usage, shifts bottleneck to unmanaged application-level lock • Nov. 2014 –Visual Studio Online

Resource management for shareddistributed systems

Centralized resource management

Conclusion

http://cs.brown.edu/~jcmace

Comprehensive: resources, processes, tenants, background tasks

Abstractions for writing concise, general-purpose policies:

WorkflowsResources (slowdown, load)Control points