usenix nsdi 2016 (session: resource sharing)

16
USENIX NSDI2016 Session: Resource Sharing 20160529 @oraccha

Upload: ryousei-takano

Post on 13-Apr-2017

458 views

Category:

Technology


0 download

TRANSCRIPT

  • USENIX NSDI2016Session: Resource Sharing2016-05-29 @oraccha

  • Co-located Events ACM Symposium on SDN Research 2016 (SOSR), March 13-17

    2016 Open Networking Summit (ONS), March 14-17

    The 12th ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS16), March 17-19

    The 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI16)

    The USENIX Workshop on Cool Topics in Sustainable Data Centers (CoolDC16), March 19

    2

  • Session: Resource Sharing Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics, Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica, University of California, Berkeley

    Cliffhanger: Scaling Performance Cliffs in Web Memory Caches, Asaf Cidon and Assaf Eisenman, Stanford University; Mohammad Alizadeh, MIT CSAIL; Sachin Katti, Stanford University

    FairRide: Near-Optimal, Fair Cache Sharing, Qifan Pu and HaoyuanLi, University of California, Berkeley; Matei Zaharia, Massachusetts Institute of Technology; Ali Ghodsi and Ion Stoica, University of California, Berkeley

    HUG: Multi-Resource Fairness for Correlated and Elastic Demands, Mosharaf Chowdhury, University of Michigan; Zhenhua Liu, Stony Brook University; Ali Ghodsi and Ion Stoica, University of California, Berkeley, and Databricks Inc. 3

  • Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics Who?SparkMesosUCB AMPLabSoCC12EuroSys13OSDI14SIGMOD16

    What?

    4

    DO CHOICES MATTER ?

    0

    5

    10

    15

    20

    25

    30

    Tim

    e (s)

    1 r3.8xlarge

    2 r3.4xlarge

    4 r3.2xlarge

    8 r3.xlarge

    16 r3.large

    Matrix Multiply: 400K by 1K

    0

    5

    10

    15

    20

    25

    30

    35

    Tim

    e (s)

    QR Factorization 1M by 1K

    Network Bound Mem Bandwidth Bound

    DO CHOICES MATTER ? MATRIX MULTIPLY

    0

    5

    10

    15

    20

    25

    30

    Tim

    e (s)

    1 r3.8xlarge

    2 r3.4xlarge

    4 r3.2xlarge

    8 r3.xlarge

    16 r3.large

    Matrix size: 400K by 1K

    Cores = 16 Memory = 244 GB Cost = $2.66/hr

    CosineTransform Normalization

    Linear Solver

    ~100 iterations

    Iterative (each iteration many jobs)

    Long Running Expensive

    Numerically Intensive

    7

    Keystone-ML TIMIT PIPELINE RawData

    Properties

    0

    10

    20

    30

    0 100 200 300 400 500 600

    Tim

    e (s)

    Cores

    Actual Ideal

    r3.4xlarge instances, QR Factorization:1M by 1K

    13

    Do choices MATTER ?

    Computation + Communication Non-linear Scaling

  • Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics

    5

    How?Training jobTraining job

    OPTIMAL Design of EXPERIMENTS

    1%

    2%

    4%

    8%

    1 2 4 8

    Inp

    ut

    Machines

    Use off-the-shelf solver(CVX)

    USING ERNEST

    Training Jobs

    JobBinary

    Machines,Input Size

    Linear Model

    Experiment Design

    Use few iterations for training

    0

    200

    400

    600

    800

    1000

    1 30 900

    Tim

    e

    Machines

    ERNEST

    BASIC Model

    time = x1 + x2 input

    machines+ x3 log(machines)+ x4 (machines)

    Serial Execution

    Computation (linear)

    Tree DAG

    All-to-One DAG

    Collect Training Data Fit Linear Regression

  • Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics Results

    6

    TRAINING TIME: Keystone-ml

    TIMIT Pipeline on r3.xlarge instances, 100 iterations

    29

    7 data pointsUp to 16 machinesUp to 10% data

    EXPERIMENT DESIGN

    0 1000 2000 3000 4000 5000 6000

    42 machines

    Time (s)

    Training TimeRunning Time

    0% 20% 40% 60% 80% 100%

    Regression

    Classification

    KMeans

    PCA

    TIMIT

    Prediction Error (%)

    Experiment Design

    Cost-based

    Is Experiment Design useful ?

    30

  • Cliffhanger: Scaling Performance Cliffs in Web Memory Caches Who?Stanford CSSookasaCEOSIGCOMM12USENIX ATC13, 15

    What?Performance cliffMemcachedSlab allocator

    70 2000 4000 6000 8000 10000 12000 14000 16000 1800000.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Number of Items in LRU Queue

    Hitr

    ate

    Concave HullApplication 19, Slab 0

    Performance Cliff, Talus[HPCA15]

    +1 cache hit-rate

    +35% speedup

    The cache hit-rate of Facebooks Memcached poolis 98.2%[SIGMETRICS12]

    Hit-rate Curve

  • Cliffhanger: Scaling Performance Cliffs in Web Memory Caches How?shadow queues

    Hill climbing algorithm: Hit rate curvequeue (slab)queue

    Cliff scaling algorithm: performance cliff

    8

    Using&Shadow&Queues&to&Estimate&Local&Gradient

    823221

    879

    53

    Queue$1

    Queue$2

    Physical$Queue Shadow$Queue

    Physical$Queue Shadow$Queue

    CreditsQueue&1 2Queue&2 @2

    1

    Resize$Queues

    Cliffhanger+Runs+Both+Algorithms+in+Parallel

    Par$$oned)

    Original)Queue)

    Par$$oned)Queues)

    Track)le4)of)pointer)

    Track)le4)of)pointer)

    Track)right)of)pointer)Track)right)of)pointer)

    Track)hill)climbing)

    Track)hill)climbing)

    Algorithm+1:+incrementally+optimize+memory+across+queues Across+slab+classes Across+applications

    Algorithm+2:+scales+performance+cliffs

  • Cliffhanger: Scaling Performance Cliffs in Web Memory Caches FairRideFairness

    9

    Cliffhanger+Reduces+Misses+and+Can+Save+Memory

    Average+misses+reduced:+36.7% Average+potential+memory+savings:+45%

    Cliffhanger+Outperforms+Default+and+Optimized+Schemes

    Average+Cliffhanger+hit+rate+increase:+1.2%

  • FairRide: Near-Optimal, Fair Cache Sharing Who?UCB AMPLabMobiCom13SIGCOMM15

    What?Isolation guaranteeStrategy proofnessPareto Efficiency

    106

    Statically allocated

    *

    Globally shared

    Cache

    Backend (storage/network)

    Backend (storage/network)

    Cache Cache Cache

    What we want

    Isolation Strategy-proof

    Higher utilization Share data

    Isolation Guarantee

    Strategy Proofness

    Pareto Efficiency

    max-min fairness

    priority allocation

    max-min rate

    static allocation

    Isolation Guarantee

    Strategy Proofness

    Pareto Efficiency

    106

    Properties

    FairRide Near-optimal

    SIP

  • FairRide: Near-Optimal, Fair Cache Sharing How

    Max-minProbabilistic blockingdis-incentive

    Alluxio (Tachyon)[SoCC14]

    11

    USENIX Association 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16) 397

    LEGEND

    A

    C

    5

    5

    A

    B

    C

    5

    510

    B

    A

    B

    C

    5

    510

    true access

    free-ridecheat

    blocked

    Figure 3: Example with 2 users, 3 files and total cachesize of 2. Numbers represent access frequencies. (a). Al-location under max-min fairness; (b). Allocation undermax-min fairness when second user makes spurious ac-cess (red line) to file C; (c). Blocking free-riding access(blue dotted line).

    3.3 CheatingWhile max-min fairness is strategy-proof when users ac-cess different files, this is no longer the case when filesare shared. There are two types of cheating that couldbreak strategy-proofness: (1) Intuitively, when files areshared, a user can free ride files that have been alreadycached by other users. (2) A thrifty user can choose tocache files that are shared by more users, as such files aremore economic due to cost-sharing.

    Free-riding To illustrate free riding, consider twousers: user 1 accesses files A and B, and user 2 accessesfiles A and C. Assume size of cache is 2, and that we cancache a fraction of a file. Next, assume that every useruses LFU replacement policy and that both users accessA much more frequently than the other files. As a result,the system will cache file A and charge each user by1/2. In addition, each user will get half of their otherfiles in the cache, i.e., half of file B for user 1, and fileB for user 2, as shown in Figure 3(a). Each user gets acache hit rate of 50.5+10 = 12.51 hits/sec.

    Now assume user 2 cheats by spuriously accessing fileC to artificially increase its access rate such that to exceedAs access rate (Figure 3(b)), effectively sets the priorityof C higher than B. Since now C has the highest accessrate for user 2, while A remains the most accessed file ofuser 1, the system will cache A for user 1 and C for user 2,respectively. The problem is that user 2 will still be ableto benefit from accessing file A, which has already beencached by user 1. At the end, user 1 gets 10 hits/sec, anduser 2 gets 15 hits/sec. In this way, user 2 free-rides onuser 1s file A.

    Thrifty-cheating To explain the kind of cheatingwhere a user carefully calculates cost-benefits andthen changes file priorities accordingly, we first definecost/(hit/sec) as the amount of budget cost a user pays

    1When half of a file is in cache, half of the page-level accesses tothe file will result in cache miss. Numerically, it is the equal to missingthe entire file 50% of the time. So hit rate is calculated as access ratemultiplied by percentage cached.

    to get 1 hit/sec access rate for a unit file. To opti-mize over the utility, which is defined as the total hitrate, a users optimal strategy is not to cache the filesthat one has highest access frequencies, but the oneswith lowest cost/(hit/sec). Compare a file of 100MB,shared by 2 users and another file of 100MB, shared by 5users. Even though a user access the former 10 times/secand the latter only 8 times/sec, it is overall economicto cache the second file (comparing 5MB/(hit/sec) vs.2.5MB/(hit/sec)).

    The consequence of thrift-cheating, however, ismore complicated. As it might appear to improveuser and system performance at first glance, it doesntlead to an equilibrium where all users are contentabout their allocations. This can cause users to con-stantly game the system which leads to a worse outcome.

    In the above examples we have shown that due to an-other user cheating, one can experience utility loss. Anatural question to ask is, how bad could it be? i.e. Whatis the upper bound a user can lose when being cheated?By construction, one can show that for two-user cases, auser can lose up to 50% of cache/hit rate when all herfiles are shared and free ridden by the other strategicuser. As the free-rider evades charges of shared files, thehonest user double pays. This can be extended to a moregeneral case with n (n> 2) users, where loss can increaselinearly with the number of cheating users. Suppose thatcached files are shared by n users, each user pays 1n of thefile sizes. If n 1 strategic users decide to cache otherfiles, the only honest user left has to pay the total cost.In turn, the honest user has to evict at most ( n1n ) of herfiles to maintain the same budget.

    It is also worth mentioning that for many applications,moderate or even minor cache loss can result in dras-tic performance drop. For example, in many file sys-tems with overall high cache hit ratio, the effective I/Olatency with caching could be approximated as TIO =RatiomissLatencymiss. A slight difference in the cachehit ratio, e.g. from 99.7% to 99.4%, means 2 I/O av-erage latency drop! This indeed necessitates strategy-proofness in cache policies.

    3.4 Blocking Access to Avoid Cheating

    At the heart of providing strategy-proofness is this ques-tion of how free-riding can be prevented. In the previ-ous example, user 2 was incentivized to cheat becauseshe was able to access the cached shared files regardlessher access patterns. Intuitively, if user 2 is blocked fromaccessing files that she tries to free-ride, she will be dis-incentivized to cheat.

    Applying blocking to our previous example, user 2will not be allowed to access A, despite the fact that user1 has already cached A (Figure 3(c)). The system blocks

    5

    (a) Max-min fairness

    (b) second usermakes cheating

    (c) blocking free-riding access

    Probabilistic blocking FairRide blocks a user with p(nj) = 1/(nj+1) probability

    nj is number of other users caching file j e.g., p(1)=50%, p(4)=20%

    The best you can do in a general case Less blocking does not prevent cheating

    25

  • FairRide: Near-Optimal, Fair Cache Sharing

    12

    0

    15

    30

    45

    60

    0 150 300 450 600 750 900 1050

    mis

    s ra

    tio (%

    )

    Time (s)

    user 1user 2

    Cheating under FairRide

    user 2 cheats

    user 1 cheats

    32

    FairRide dis-incentives users from cheating.

    400

    300

    200

    100

    0 Avg

    . res

    pons

    e (m

    s)

    Facebook experiments

    FairRide outperforms max-min fairness by 29%

    34

    0

    15

    30

    45

    60

    1-10 11-50 51-100 101-500 501-

    Redc

    utio

    n in

    Med

    ian

    Job

    Tim

    e (%

    )

    Bin (#Tasks)

    max-minFairRide

  • HUG: Multi-Resource Fairness for Correlated and Elastic Demands Who?UCB AMPLabcoflow-based networking, multi-resource allocation in dataceters, compute and storage for big data, network virtualizationSIGCOMMDRF[NSDI11]FairCloud[SIGCOMM12]

    What?

    13

    M1 M2 M3 MN

    Congestion-Less Core

    L1 L2 L3 LNLN+1 LN+2 LN+3 L2N

    How to share the links between multiple tenants to provide1. optimal performance

    guarantees and2. maximize utilization?

    Tenant-As VMs

    Tenant-Bs VMs

  • HUG: Multi-Resource Fairness for Correlated and Elastic Demands Highest Utilization with the Optimal Isolation Guarantee

    14

    Isolation Guarantee

    Uti

    lizat

    ion

    Work-Conserving

    Low

    Low Optimal

    PS-P

    DRF

    Per-Flow Fairness

    HUG

    HUG in Cooperative Setting

    1. Optimal Isolation Guarantee

    2. Work Conservation

    Isolation Guarantee

    Uti

    lizat

    ionWork-

    Conserving

    Low

    Low Optimal

    PS-P

    DRF

    Per-Flow Fairness

    HUG

    1. Optimal Isolation Guarantee

    2. Highest Utilization3. Strategyproof

    HUG in Non-Cooperative Setting

    408 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16) USENIX Association

    Intuitively, we want to maximize the minimumprogress over all tenants, i.e., maximize mink Mk,where mink Mk corresponds to the isolation guaran-tee of an allocation algorithm. We make three observa-tions. First, when there is a single link in the system,this model trivially reduces to max-min fairness. Sec-ond, getting more aggregate bandwidth is not always bet-ter. For tenant-A in the example, 50Mbps, 100Mbps isbetter than 90Mbps, 90Mbps or 25Mbps, 200Mbps,even though the latter ones have more bandwidth in to-tal. Third, simply applying max-min fairness to individ-ual links is not enough. In our example, max-min fairnessallocates equal resources to both tenants on both links,resulting in allocations 12 ,

    12 on both links (Figure 1b).

    Corresponding progress (MA = MB = 12 ) result in asuboptimal isolation guarantee (min{MA,MB} = 12 ).

    Dominant Resource Fairness (DRF) [33] extends max-min fairness to multiple resources and prevents such sub-optimality. It equalizes the shares of dominant resources link-2 (link-1) for tenant-A (tenant-B) across all ten-ants with correlated demands and maximizes the iso-lation guarantee in a strategyproof manner. As shownin Figure 1c, using DRF, both tenants have the sameprogress MA = MB = 23 , 50% higher than usingmax-min fairness on individual links. Moreover, DRFsisolation guarantee (min{MA,MB} = 23 ) is optimalacross all possible allocations and is strategyproof.

    However, DRF assumes inelastic demands [40], and itis not work-conserving. For example, bandwidth on link-2 in shades is not allocated to either tenant. In fact, weshow that DRF can result in arbitrarily low utilization(Lemma 6). This is wasteful, because unused bandwidthcannot be recovered.

    We start by showing that strategy-proofness is a neces-sary condition for providing the optimal isolation guar-antee i.e., to maximize mink Mk in non-cooperativeenvironments (2). Next, we prove that work conserva-tion i.e., when tenants are allowed to use unallocatedresources, such as the shaded area in Figure 1c, withoutconstraints spurs a race to the bottom. It incentivizeseach tenant to continuously lie about her demand cor-relations, and in the process, it decreases the amount ofuseful work done by all tenants! Meaning, simply mak-ing DRF work-conserving can do more harm than good.

    We propose a two-stage algorithm, High Utilizationwith Guarantees (HUG), to achieve our goals (3). Fig-ure 2 surveys the design space for cloud network shar-ing and places HUG in context by following the thicklines. At the highest level, unlike many alternatives[13, 14, 37, 44], HUG is a dynamic allocation algo-rithm. Next, HUG enforces its allocations at the tenant-/network-level, because flow- or (virtual) machine-levelallocations [61, 62] do not provide isolation guarantee.

    Due to the hard tradeoff between optimal isolation

    Cloud Network Sharing

    Dynamic Sharing

    Flow-Level(Per-Flow Fairness)

    No isolation guarantee

    VM-Level(Seawall, GateKeeper)No isolation guarantee

    Tenant-/Network-Level

    Non-CooperativeEnvironments

    Require strategy-proofness

    Highest Utilization forOptimal Isolation Guarantee

    (HUG)

    CooperativeEnvironments

    Do not require strategy-proofness

    Reservation(SecondNet, Oktopus, Pulsar, Silo)

    Uses admission control

    LowUtilization

    (DRF)Optimal isolation guarantee

    Work-ConservingOptimal Isolation Guarantee

    (HUG)

    SuboptimalIsolation Guarantee(PS-P, EyeQ, NetShare)

    Work-conserving

    Figure 2: Design space for cloud network sharing.

    guarantee and work conservation in non-cooperative en-vironments, HUG ensures the highest utilization possi-ble while maintaining the optimal isolation guarantee.It incentivizes tenants to expose their true demands, en-suring that they actually consume their allocations in-stead of causing collateral damages. In cooperative en-vironments, where strategy-proofness might be a non-requirement, HUG simultaneously ensures both workconservation and the optimal isolation guarantee. Incontrast, existing solutions [33, 45, 51, 58, 59] are sub-optimal in both environments. Overall, HUG generalizessingle- [25, 43, 55] and multi-resource max-min fairness[27, 33, 38, 56] and multi-tenant network sharing solu-tions [45, 51, 58, 59, 61, 62] under a unifying framework.

    HUG is easy to implement and scales well. Even with100, 000 machines, new allocations can be centrally cal-culated and distributed throughout the network in lessthan a second faster than that suggested in the litera-ture [13]. Moreover, each machine can locally enforceHUG-calculated allocations using existing traffic controltools without any changes to the network (4).

    We demonstrate the effectiveness of our proposal us-ing EC2 experiments and trace-driven simulations (5).In non-cooperative environments, HUG provides the op-timal isolation guarantee, which is 7.4 higher than ex-isting network sharing solutions like PS-P [45, 58, 59]and 7000 higher than traditional per-flow fairness, and1.4 better utilization than DRF for production traces. Incooperative environments, HUG outperforms PS-P andper-flow fairness by 1.48 and 17.35 in terms of the95th percentile slowdown of job communication stages,and 70% jobs experience lower slowdown w.r.t. DRF.

    We discuss current limitations and future research inSection 6 and compare HUG to related work in Section 7.

    2

  • HUG: Multi-Resource Fairness for Correlated and Elastic Demands 100EC2

    ACpairwise one-to-one communication Ball-to-all communication

    15

    USENIX Association 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16) 415

    0

    50

    100

    0 60 120 180 240 300 360 420 480 540

    Tota

    l Allo

    c (G

    bps)

    Time (Seconds)

    Tenant ATenant BTenant C

    (a) Per-flow Fairness (TCP)

    0

    50

    100

    0 60 120 180 240 300 360 420 480 540

    Tota

    l Allo

    c (G

    bps)

    Time (Seconds)

    Tenant ATenant BTenant C

    (b) HUG

    Figure 10: [EC2] Bandwidth consumptions of three tenants arriving over time in a 100-machine EC2 cluster. Each tenant has 100VMs, but each uses a different communication pattern (5.1.1). We observe that (a) using TCP, tenant-B dominates the network bycreating more flows; (b) HUG isolates tenants A and C from tenant B.

    flow fairness, PS-P [58], and DRF [33] (5.2). Finally,we evaluate HUGs long-term impact on application per-formance using a 3000-machine Facebook cluster traceused by Chowdhury et al. [23] and compare against per-flow fairness, PS-P, DRF, as well as Varys, which focusesonly on improving performance (5.3).

    5.1 Testbed ExperimentsMethodology We performed our experiments on 100m2.4xlarge Amazon EC2 [2] instances running onLinux kernel 3.4.37 and used the default htb and tc im-plementations. While there exist proposals for more ac-curate qdisc implementations [45, 57], the default htbworked sufficiently well for our purposes. Each of themachines had 1 Gbps NICs, and we could use close tofull 100 Gbps bandwidth simultaneously.

    5.1.1 Network-Wide Isolation

    We consider a cluster with 100 EC2 machines, dividedbetween three tenants A, B, and C that arrive over time.Each tenant has 100 VMs; i.e., VMs Ai, Bi, and Ciare collocated on the i-th physical machine. However,they have different communication patterns: tenants Aand C have pairwise one-to-one communication patterns(100 VM-VM flows each), whereas tenant-B followsan all-to-all pattern using 10, 000 flows. Specifically, Aicommunicates with A(i+50)%100, Cj communicates withC(j+25)%100, and any Bk communicates with all Bl,where i, j, k, l {1, ..., 100}. Each tenant demands theentire capacity at each machine; hence, the entire capac-ity of the cluster should be equally divided among theactive tenants to maximize isolation guarantees.

    Figure 10a shows that as soon as tenant-B arrives, shetakes up the entire capacity in the absence of isolationguarantee. Tenant-C receives only marginal share as shearrives after tenant-B and leaves before her. Note thattenant-A (when alone) uses only about 80% of the avail-able capacity; this is simply because just one TCP flowper VM-VM pair often cannot saturate the link.

    Figure 10b presents the allocation using HUG. As ten-ants arrive and depart, allocations are dynamically calcu-

    lated, propagated, and enforced in each machine of thecluster. As before, tenants A and C use marginally lessthan their allocations because of creating only one flowbetween each VM-VM pair.

    5.1.2 ScalabilityThe key challenge in scaling HUG is its centralized re-source allocator, which must recalculate tenant sharesand redistribute them across the entire cluster wheneverany tenant changes her correlation vector.

    We found that the time to calculate new allocations us-ing HUG is less than 5 microseconds in our 100 machinecluster. Furthermore, a recomputation due to a tenantsarrival, departure, or change of correlation vector wouldtake about 8.6 milliseconds on average for a 100, 000-machine datacenter.

    Communicating a new allocation takes less than 10milliseconds to 100 machines and around 1 second for100, 000 emulated machines (i.e., sending the same mes-sage 1000 times to each of the 100 machines).

    5.2 Instantaneous FairnessWhile Section 5.1 evaluated HUG in controlled, syn-thetic scenarios, this section focuses on HUGs instanta-neous allocation characteristics in the context of a large-scale cluster.

    Methodology We use a one-hour snapshot with 100concurrent jobs from a production MapReduce trace,which was extracted from a 3200-machine Facebookcluster by Popa et al. [58, Section 5.3]. Machines are con-nected to the network using 1 Gbps NICs. In the trace, ajob with M mappers and R reducers hence, the corre-sponding M R shuffle is described as a matrix withthe amount of data to transfer between each M -R pair.We calculated the correlation vectors of individual shuf-fles from their communication matrices ourselves usingthe optimal rate allocation algorithm for a single shuffle[22, 23], ensuring all the flows of each shuffle to finishsimultaneously.

    Given the workload, we calculate progress of eachjob/shuffle using different allocation mechanisms and

    9

  • NSDI

    20

    UCB AMPLab

    Facebook trace data

    16NSDI2016proceedingsslides