usenix nsdi 2016 (session: resource sharing)

USENIX NSDI2016Session: Resource Sharing2016-05-29 @oraccha

Co-located Events ACM Symposium on SDN Research 2016 (SOSR), March 13-17

2016 Open Networking Summit (ONS), March 14-17

The 12th ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS16), March 17-19

The 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI16)

The USENIX Workshop on Cool Topics in Sustainable Data Centers (CoolDC16), March 19

2

Session: Resource Sharing Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics, Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica, University of California, Berkeley

Cliffhanger: Scaling Performance Cliffs in Web Memory Caches, Asaf Cidon and Assaf Eisenman, Stanford University; Mohammad Alizadeh, MIT CSAIL; Sachin Katti, Stanford University

FairRide: Near-Optimal, Fair Cache Sharing, Qifan Pu and HaoyuanLi, University of California, Berkeley; Matei Zaharia, Massachusetts Institute of Technology; Ali Ghodsi and Ion Stoica, University of California, Berkeley

HUG: Multi-Resource Fairness for Correlated and Elastic Demands, Mosharaf Chowdhury, University of Michigan; Zhenhua Liu, Stony Brook University; Ali Ghodsi and Ion Stoica, University of California, Berkeley, and Databricks Inc. 3

Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics Who?SparkMesosUCB AMPLabSoCC12EuroSys13OSDI14SIGMOD16

What?

4

DO CHOICES MATTER ?

0

5

10

15

20

25

30

Tim

e (s)

1 r3.8xlarge

2 r3.4xlarge

4 r3.2xlarge

8 r3.xlarge

16 r3.large

Matrix Multiply: 400K by 1K

0

5

10

15

20

25

30

35

Tim

e (s)

QR Factorization 1M by 1K

Network Bound Mem Bandwidth Bound

DO CHOICES MATTER ? MATRIX MULTIPLY

0

5

10

15

20

25

30

Tim

e (s)

1 r3.8xlarge

2 r3.4xlarge

4 r3.2xlarge

8 r3.xlarge

16 r3.large

Matrix size: 400K by 1K

Cores = 16 Memory = 244 GB Cost = $2.66/hr

CosineTransform Normalization

Linear Solver

~100 iterations

Iterative (each iteration many jobs)

Long Running Expensive

Numerically Intensive

7

Keystone-ML TIMIT PIPELINE RawData

Properties

0

10

20

30

0 100 200 300 400 500 600

Tim

e (s)

Cores

Actual Ideal

r3.4xlarge instances, QR Factorization:1M by 1K

13

Do choices MATTER ?

Computation + Communication Non-linear Scaling

Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics

5

How?Training jobTraining job

OPTIMAL Design of EXPERIMENTS

1%

2%

4%

8%

1 2 4 8

Inp

ut

Machines

Use off-the-shelf solver(CVX)

USING ERNEST

Training Jobs

JobBinary

Machines,Input Size

Linear Model

Experiment Design

Use few iterations for training

0

200

400

600

800

1000

1 30 900

Tim

e

Machines

ERNEST

BASIC Model

time = x1 + x2 input

machines+ x3 log(machines)+ x4 (machines)

Serial Execution

Computation (linear)

Tree DAG

All-to-One DAG

Collect Training Data Fit Linear Regression

Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics Results

6

TRAINING TIME: Keystone-ml

TIMIT Pipeline on r3.xlarge instances, 100 iterations

29

7 data pointsUp to 16 machinesUp to 10% data

EXPERIMENT DESIGN

0 1000 2000 3000 4000 5000 6000

42 machines

Time (s)

Training TimeRunning Time

0% 20% 40% 60% 80% 100%

Regression

Classification

KMeans

PCA

TIMIT

Prediction Error (%)

Experiment Design

Cost-based

Is Experiment Design useful ?

30

Cliffhanger: Scaling Performance Cliffs in Web Memory Caches Who?Stanford CSSookasaCEOSIGCOMM12USENIX ATC13, 15

What?Performance cliffMemcachedSlab allocator

70 2000 4000 6000 8000 10000 12000 14000 16000 1800000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Items in LRU Queue

Hitr

ate

Concave HullApplication 19, Slab 0

Performance Cliff, Talus[HPCA15]

+1 cache hit-rate

+35% speedup

The cache hit-rate of Facebooks Memcached poolis 98.2%[SIGMETRICS12]

Hit-rate Curve

Cliffhanger: Scaling Performance Cliffs in Web Memory Caches How?shadow queues

Hill climbing algorithm: Hit rate curvequeue (slab)queue

Cliff scaling algorithm: performance cliff

8

Using&Shadow&Queues&to&Estimate&Local&Gradient

823221

879

53

Queue$1

Queue$2

Physical$Queue Shadow$Queue

Physical$Queue Shadow$Queue

CreditsQueue&1 2Queue&2 @2

1

Resize$Queues

Cliffhanger+Runs+Both+Algorithms+in+Parallel

Par$$oned)

Original)Queue)

Par$$oned)Queues)

Track)le4)of)pointer)

Track)le4)of)pointer)

Track)right)of)pointer)Track)right)of)pointer)

Track)hill)climbing)

Track)hill)climbing)

Algorithm+1:+incrementally+optimize+memory+across+queues Across+slab+classes Across+applications

Algorithm+2:+scales+performance+cliffs

Cliffhanger: Scaling Performance Cliffs in Web Memory Caches FairRideFairness

9

Cliffhanger+Reduces+Misses+and+Can+Save+Memory

Average+misses+reduced:+36.7% Average+potential+memory+savings:+45%

Cliffhanger+Outperforms+Default+and+Optimized+Schemes

Average+Cliffhanger+hit+rate+increase:+1.2%

FairRide: Near-Optimal, Fair Cache Sharing Who?UCB AMPLabMobiCom13SIGCOMM15

What?Isolation guaranteeStrategy proofnessPareto Efficiency

106

Statically allocated

*

Globally shared

Cache

Backend (storage/network)

Backend (storage/network)

Cache Cache Cache

What we want

Isolation Strategy-proof

Higher utilization Share data

Isolation Guarantee

Strategy Proofness

Pareto Efficiency

max-min fairness

priority allocation

max-min rate

static allocation

Isolation Guarantee

Strategy Proofness

Pareto Efficiency

106

Properties

FairRide Near-optimal

SIP

FairRide: Near-Optimal, Fair Cache Sharing How

Max-minProbabilistic blockingdis-incentive

Alluxio (Tachyon)[SoCC14]

11

USENIX Association 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16) 397

LEGEND

A

C

5

5

A

B

C

5

510

B

A

B

C

5

510

true access

free-ridecheat

blocked

Figure 3: Example with 2 users, 3 files and total cachesize of 2. Numbers represent access frequencies. (a). Al-location under max-min fairness; (b). Allocation undermax-min fairness when second user makes spurious ac-cess (red line) to file C; (c). Blocking free-riding access(blue dotted line).

3.3 CheatingWhile max-min fairness is strategy-proof when users ac-cess different files, this is no longer the case when filesare shared. There are two types of cheating that couldbreak strategy-proofness: (1) Intuitively, when files areshared, a user can free ride files that have been alreadycached by other users. (2) A thrifty user can choose tocache files that are shared by more users, as such files aremore economic due to cost-sharing.

Free-riding To illustrate free riding, consider twousers: user 1 accesses files A and B, and user 2 accessesfiles A and C. Assume size of cache is 2, and that we cancache a fraction of a file. Next, assume that every useruses LFU replacement policy and that both users accessA much more frequently than the other files. As a result,the system will cache file A and charge each user by1/2. In addition, each user will get half of their otherfiles in the cache, i.e., half of file B for user 1, and fileB for user 2, as shown in Figure 3(a). Each user gets acache hit rate of 50.5+10 = 12.51 hits/sec.

Now assume user 2 cheats by spuriously accessing fileC to artificially increase its access rate such that to exceedAs access rate (Figure 3(b)), effectively sets the priorityof C higher than B. Since now C has the highest accessrate for user 2, while A remains the most accessed file ofuser 1, the system will cache A for user 1 and C for user 2,respectively. The problem is that user 2 will still be ableto benefit from accessing file A, which has already beencached by user 1. At the end, user 1 gets 10 hits/sec, anduser 2 gets 15 hits/sec. In this way, user 2 free-rides onuser 1s file A.

Thrifty-cheating To explain the kind of cheatingwhere a user carefully calculates cost-benefits andthen changes file priorities accordingly, we first definecost/(hit/sec) as the amount of budget cost a user pays

1When half of a file is in cache, half of the page-level accesses tothe file will result in cache miss. Numerically, it is the equal to missingthe entire file 50% of the time. So hit rate is calculated as access ratemultiplied by percentage cached.

to get 1 hit/sec access rate for a unit file. To opti-mize over the utility, which is defined as the total hitrate, a users optimal strategy is not to cache the filesthat one has highest access frequencies, but the oneswith lowest cost/(hit/sec). Compare a file of 100MB,shared by 2 users and another file of 100MB, shared by 5users. Even though a user access the former 10 times/secand the latter only 8 times/sec, it is overall economicto cache the second file (comparing 5MB/(hit/sec) vs.2.5MB/(hit/sec)).

The consequence of thrift-cheating, however, ismore complicated. As it might appear to improveuser and system performance at first glance, it doesntlead to an equilibrium where all users are contentabout their allocations. This can cause users to con-stantly game the system which leads to a worse outcome.

In the above examples we have shown that due to an-other user cheating, one can experience utility loss. Anatural question to ask is, how bad could it be? i.e. Whatis the upper bound a user can lose when being cheated?By construction, one can show that for two-user cases, auser can lose up to 50% of cache/hit rate when all herfiles are shared and free ridden by the other strategicuser. As the free-rider evades charges of shared files, thehonest user double pays. This can be extended to a moregeneral case with n (n> 2) users, where loss can increaselinearly with the number of cheating users. Suppose thatcached files are shared by n users, each user pays 1n of thefile sizes. If n 1 strategic users decide to cache otherfiles, the only honest user left has to pay the total cost.In turn, the honest user has to evict at most ( n1n ) of herfiles to maintain the same budget.

It is also worth mentioning that for many applications,moderate or even minor cache loss can result in dras-tic performance drop. For example, in many file sys-tems with overall high cache hit ratio, the effective I/Olatency with caching could be approximated as TIO =RatiomissLatencymiss. A slight difference in the cachehit ratio, e.g. from 99.7% to 99.4%, means 2 I/O av-erage latency drop! This indeed necessitates strategy-proofness in cache policies.

3.4 Blocking Access to Avoid Cheating

At the heart of providing strategy-proofness is this ques-tion of how free-riding can be prevented. In the previ-ous example, user 2 was incentivized to cheat becauseshe was able to access the cached shared files regardlessher access patterns. Intuitively, if user 2 is blocked fromaccessing files that she tries to free-ride, she will be dis-incentivized to cheat.

Applying blocking to our previous example, user 2will not be allowed to access A, despite the fact that user1 has already cached A (Figure 3(c)). The system blocks

5

(a) Max-min fairness

(b) second usermakes cheating

(c) blocking free-riding access

Probabilistic blocking FairRide blocks a user with p(nj) = 1/(nj+1) probability

nj is number of other users caching file j e.g., p(1)=50%, p(4)=20%

The best you can do in a general case Less blocking does not prevent cheating

25

FairRide: Near-Optimal, Fair Cache Sharing

12

0

15

30

45

60

0 150 300 450 600 750 900 1050

mis

s ra

tio (%

)

Time (s)

user 1user 2

Cheating under FairRide

user 2 cheats

user 1 cheats

32

FairRide dis-incentives users from cheating.

400

300

200

100

0 Avg

. res

pons

e (m

s)

Facebook experiments

FairRide outperforms max-min fairness by 29%

34

0

15

30

45

60

1-10 11-50 51-100 101-500 501-

Redc

utio

n in

Med

ian

Job

Tim

e (%

)

Bin (#Tasks)

max-minFairRide

HUG: Multi-Resource Fairness for Correlated and Elastic Demands Who?UCB AMPLabcoflow-based networking, multi-resource allocation in dataceters, compute and storage for big data, network virtualizationSIGCOMMDRF[NSDI11]FairCloud[SIGCOMM12]

What?

13

M1 M2 M3 MN

Congestion-Less Core

L1 L2 L3 LNLN+1 LN+2 LN+3 L2N

How to share the links between multiple tenants to provide1. optimal performance

guarantees and2. maximize utilization?

Tenant-As VMs

Tenant-Bs VMs

HUG: Multi-Resource Fairness for Correlated and Elastic Demands Highest Utilization with the Optimal Isolation Guarantee

14

Isolation Guarantee

Uti

lizat

ion

Work-Conserving

Low

Low Optimal

PS-P

DRF

Per-Flow Fairness

HUG

HUG in Cooperative Setting

1. Optimal Isolation Guarantee

2. Work Conservation

Isolation Guarantee

Uti

lizat

ionWork-

Conserving

Low

Low Optimal

PS-P

DRF

Per-Flow Fairness

HUG

1. Optimal Isolation Guarantee

2. Highest Utilization3. Strategyproof

HUG in Non-Cooperative Setting

408 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16) USENIX Association

Intuitively, we want to maximize the minimumprogress over all tenants, i.e., maximize mink Mk,where mink Mk corresponds to the isolation guaran-tee of an allocation algorithm. We make three observa-tions. First, when there is a single link in the system,this model trivially reduces to max-min fairness. Sec-ond, getting more aggregate bandwidth is not always bet-ter. For tenant-A in the example, 50Mbps, 100Mbps isbetter than 90Mbps, 90Mbps or 25Mbps, 200Mbps,even though the latter ones have more bandwidth in to-tal. Third, simply applying max-min fairness to individ-ual links is not enough. In our example, max-min fairnessallocates equal resources to both tenants on both links,resulting in allocations 12 ,

12 on both links (Figure 1b).

Corresponding progress (MA = MB = 12 ) result in asuboptimal isolation guarantee (min{MA,MB} = 12 ).

Dominant Resource Fairness (DRF) [33] extends max-min fairness to multiple resources and prevents such sub-optimality. It equalizes the shares of dominant resources link-2 (link-1) for tenant-A (tenant-B) across all ten-ants with correlated demands and maximizes the iso-lation guarantee in a strategyproof manner. As shownin Figure 1c, using DRF, both tenants have the sameprogress MA = MB = 23 , 50% higher than usingmax-min fairness on individual links. Moreover, DRFsisolation guarantee (min{MA,MB} = 23 ) is optimalacross all possible allocations and is strategyproof.

However, DRF assumes inelastic demands [40], and itis not work-conserving. For example, bandwidth on link-2 in shades is not allocated to either tenant. In fact, weshow that DRF can result in arbitrarily low utilization(Lemma 6). This is wasteful, because unused bandwidthcannot be recovered.

We start by showing that strategy-proofness is a neces-sary condition for providing the optimal isolation guar-antee i.e., to maximize mink Mk in non-cooperativeenvironments (2). Next, we prove that work conserva-tion i.e., when tenants are allowed to use unallocatedresources, such as the shaded area in Figure 1c, withoutconstraints spurs a race to the bottom. It incentivizeseach tenant to continuously lie about her demand cor-relations, and in the process, it decreases the amount ofuseful work done by all tenants! Meaning, simply mak-ing DRF work-conserving can do more harm than good.

We propose a two-stage algorithm, High Utilizationwith Guarantees (HUG), to achieve our goals (3). Fig-ure 2 surveys the design space for cloud network shar-ing and places HUG in context by following the thicklines. At the highest level, unlike many alternatives[13, 14, 37, 44], HUG is a dynamic allocation algo-rithm. Next, HUG enforces its allocations at the tenant-/network-level, because flow- or (virtual) machine-levelallocations [61, 62] do not provide isolation guarantee.

Due to the hard tradeoff between optimal isolation

Cloud Network Sharing

Dynamic Sharing

Flow-Level(Per-Flow Fairness)

No isolation guarantee

VM-Level(Seawall, GateKeeper)No isolation guarantee

Tenant-/Network-Level

Non-CooperativeEnvironments

Require strategy-proofness

Highest Utilization forOptimal Isolation Guarantee

(HUG)

CooperativeEnvironments

Do not require strategy-proofness

Reservation(SecondNet, Oktopus, Pulsar, Silo)

Uses admission control

LowUtilization

(DRF)Optimal isolation guarantee

Work-ConservingOptimal Isolation Guarantee

(HUG)

SuboptimalIsolation Guarantee(PS-P, EyeQ, NetShare)

Work-conserving

Figure 2: Design space for cloud network sharing.

guarantee and work conservation in non-cooperative en-vironments, HUG ensures the highest utilization possi-ble while maintaining the optimal isolation guarantee.It incentivizes tenants to expose their true demands, en-suring that they actually consume their allocations in-stead of causing collateral damages. In cooperative en-vironments, where strategy-proofness might be a non-requirement, HUG simultaneously ensures both workconservation and the optimal isolation guarantee. Incontrast, existing solutions [33, 45, 51, 58, 59] are sub-optimal in both environments. Overall, HUG generalizessingle- [25, 43, 55] and multi-resource max-min fairness[27, 33, 38, 56] and multi-tenant network sharing solu-tions [45, 51, 58, 59, 61, 62] under a unifying framework.

HUG is easy to implement and scales well. Even with100, 000 machines, new allocations can be centrally cal-culated and distributed throughout the network in lessthan a second faster than that suggested in the litera-ture [13]. Moreover, each machine can locally enforceHUG-calculated allocations using existing traffic controltools without any changes to the network (4).

We demonstrate the effectiveness of our proposal us-ing EC2 experiments and trace-driven simulations (5).In non-cooperative environments, HUG provides the op-timal isolation guarantee, which is 7.4 higher than ex-isting network sharing solutions like PS-P [45, 58, 59]and 7000 higher than traditional per-flow fairness, and1.4 better utilization than DRF for production traces. Incooperative environments, HUG outperforms PS-P andper-flow fairness by 1.48 and 17.35 in terms of the95th percentile slowdown of job communication stages,and 70% jobs experience lower slowdown w.r.t. DRF.

We discuss current limitations and future research inSection 6 and compare HUG to related work in Section 7.

2

HUG: Multi-Resource Fairness for Correlated and Elastic Demands 100EC2

ACpairwise one-to-one communication Ball-to-all communication

15

USENIX Association 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16) 415

0

50

100

0 60 120 180 240 300 360 420 480 540

Tota

l Allo

c (G

bps)

Time (Seconds)

Tenant ATenant BTenant C

(a) Per-flow Fairness (TCP)

0

50

100

0 60 120 180 240 300 360 420 480 540

Tota

l Allo

c (G

bps)

Time (Seconds)

Tenant ATenant BTenant C

(b) HUG

Figure 10: [EC2] Bandwidth consumptions of three tenants arriving over time in a 100-machine EC2 cluster. Each tenant has 100VMs, but each uses a different communication pattern (5.1.1). We observe that (a) using TCP, tenant-B dominates the network bycreating more flows; (b) HUG isolates tenants A and C from tenant B.

flow fairness, PS-P [58], and DRF [33] (5.2). Finally,we evaluate HUGs long-term impact on application per-formance using a 3000-machine Facebook cluster traceused by Chowdhury et al. [23] and compare against per-flow fairness, PS-P, DRF, as well as Varys, which focusesonly on improving performance (5.3).

5.1 Testbed ExperimentsMethodology We performed our experiments on 100m2.4xlarge Amazon EC2 [2] instances running onLinux kernel 3.4.37 and used the default htb and tc im-plementations. While there exist proposals for more ac-curate qdisc implementations [45, 57], the default htbworked sufficiently well for our purposes. Each of themachines had 1 Gbps NICs, and we could use close tofull 100 Gbps bandwidth simultaneously.

5.1.1 Network-Wide Isolation

We consider a cluster with 100 EC2 machines, dividedbetween three tenants A, B, and C that arrive over time.Each tenant has 100 VMs; i.e., VMs Ai, Bi, and Ciare collocated on the i-th physical machine. However,they have different communication patterns: tenants Aand C have pairwise one-to-one communication patterns(100 VM-VM flows each), whereas tenant-B followsan all-to-all pattern using 10, 000 flows. Specifically, Aicommunicates with A(i+50)%100, Cj communicates withC(j+25)%100, and any Bk communicates with all Bl,where i, j, k, l {1, ..., 100}. Each tenant demands theentire capacity at each machine; hence, the entire capac-ity of the cluster should be equally divided among theactive tenants to maximize isolation guarantees.

Figure 10a shows that as soon as tenant-B arrives, shetakes up the entire capacity in the absence of isolationguarantee. Tenant-C receives only marginal share as shearrives after tenant-B and leaves before her. Note thattenant-A (when alone) uses only about 80% of the avail-able capacity; this is simply because just one TCP flowper VM-VM pair often cannot saturate the link.

Figure 10b presents the allocation using HUG. As ten-ants arrive and depart, allocations are dynamically calcu-

lated, propagated, and enforced in each machine of thecluster. As before, tenants A and C use marginally lessthan their allocations because of creating only one flowbetween each VM-VM pair.

5.1.2 ScalabilityThe key challenge in scaling HUG is its centralized re-source allocator, which must recalculate tenant sharesand redistribute them across the entire cluster wheneverany tenant changes her correlation vector.

We found that the time to calculate new allocations us-ing HUG is less than 5 microseconds in our 100 machinecluster. Furthermore, a recomputation due to a tenantsarrival, departure, or change of correlation vector wouldtake about 8.6 milliseconds on average for a 100, 000-machine datacenter.

Communicating a new allocation takes less than 10milliseconds to 100 machines and around 1 second for100, 000 emulated machines (i.e., sending the same mes-sage 1000 times to each of the 100 machines).

5.2 Instantaneous FairnessWhile Section 5.1 evaluated HUG in controlled, syn-thetic scenarios, this section focuses on HUGs instanta-neous allocation characteristics in the context of a large-scale cluster.

Methodology We use a one-hour snapshot with 100concurrent jobs from a production MapReduce trace,which was extracted from a 3200-machine Facebookcluster by Popa et al. [58, Section 5.3]. Machines are con-nected to the network using 1 Gbps NICs. In the trace, ajob with M mappers and R reducers hence, the corre-sponding M R shuffle is described as a matrix withthe amount of data to transfer between each M -R pair.We calculated the correlation vectors of individual shuf-fles from their communication matrices ourselves usingthe optimal rate allocation algorithm for a single shuffle[22, 23], ensuring all the flows of each shuffle to finishsimultaneously.

Given the workload, we calculate progress of eachjob/shuffle using different allocation mechanisms and

9

NSDI

20

UCB AMPLab

Facebook trace data

16NSDI2016proceedingsslides