a generalized blind scheduling policysamir/dcscheduling18/slides/ttic-misra.pdfapache web server...

36
A Generalized Blind Scheduling Policy Hanhua Feng 1 , Vishal Misra 2,3 and Dan Rubenstein 2 1 Infinio Systems 2 Columbia University in the City of New York 3 Google TTIC SUMMER WORKSHOP: DATA CENTER SCHEDULING FROM THEORY TO PRACTICE Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 1 / 37

Upload: others

Post on 11-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

A Generalized Blind Scheduling Policy

Hanhua Feng1, Vishal Misra2,3 and Dan Rubenstein2

1Infinio Systems

2Columbia University in the City of New York

3Google

TTIC SUMMER WORKSHOP: DATA CENTER SCHEDULINGFROM THEORY TO PRACTICE

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 1 / 37

Page 2: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Overview

1 Introduction

2 PBS Policy

3 Properties of the PBS policy

4 Implementation and Experimental Results

5 PBS in the Data Center

6 Conclusion and Future Work

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 2 / 37

Page 3: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Introduction

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 3 / 37

Page 4: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Scheduling Policies in Queueing Models

Scheduling is a compromise . . .not only between individual tasks, but also . . .

between systems with different workload patterns,

between different performance requirements, including

mean response time, mean slowdown, responsiveness, . . .fairness measures: seniority, RAQFM, . . .

Our workDesign a flexible scheduling policy to balance these requirements.

Assumptions in this talkSingle-server queueing model

Work-conserving, preemption allowed

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 4 / 37

Page 5: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Blind Scheduling Policies

Non-blind policiesKnow required and remainingservice time when tasks arrive.

Blind policiesNo information about remainingservice until tasks complete.

Non-blind policy examplesSJF, SRPT, SMART . . .

Blind policy examplesFCFS, PS, LAS, LCFS, . . .

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 5 / 37

Page 6: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

How Do We Measure Fairness of a Policy?

Fairness criteria [cf. Raz,Levy&Avi-Itzhak 2004]

Task seniority (emphasis on ti) ⇒ FCFS

Task service requirements (emphasis on xi)

Equal attained service ⇒ LAS/FBPS

Combination of the two: Equal share of processor

Current: dxi (t)/dti (t) ≡ x ′i (t) ⇒ PS

Aggregated: xi (t)/ti (t) ⇒ GAS

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 6 / 37

Page 7: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

How to Measure Fairness of a Policy? (cont’d)

Fairness measures in literatureComparison vs FCFS [Wang & Morris 1985]

RAQFM: Comparison vs PS [Raz,Levy&Avi-Itzhak 2004]

A quantitative measure.Difficult to analyze: with results for FCFS, LCFS, PLCFS, andRandom in M/M/1.G/D/m [Raz,Levy&Avi-Itzhak 2005]

Expected slowdown for given required service E [S |X = x ]compared with PS [Wierman&Harchol-Balter 2004]

A classification: always fair/unfair, sometimes fair.Assume M/G/1.Extended in [Wierman&Harchol-Balter 2005].

SQF [Avi-Itzhak,Brosh&Levy 2007]

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 7 / 37

Page 8: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

PBS Policy

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 8 / 37

Page 9: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Balance Between Two Fairness Criteria

Two fairness criteria (cont’d)

Seniority — Prefer larger sojourn time ti(t)

Service requirements — Prefer smaller attained service xi(t)

Our idea: A configurable balance

Schedule a task with maximal ti(t)− αxi(t).

More general: g(ti(t))− αg(xi(t)), e,g., log ti(t)− α log xi(t).

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 9 / 37

Page 10: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Our Parameterized Scheduler: PBS

The PBS policy with a single serverFor every task i , compute its priority value

pi(t) = log ti(t)− α log xi(t), Equivalent to Pi(t) =ti(x)

[xi(t)]α

α is a configurable parameter in [0,∞).

At time t, serve the task with the highest priority pi (or Pi).

Randomly choose among equal-priority tasks.Preempt low-priority tasks, if currently been served.

Can be used in continuous time (theory)or in discrete time (practice).

PBS: A Unified Priority-Based Scheduler Sigmetrics 2007

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 10 / 37

Page 11: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

PBS: Priority-based Blind Scheduling (cont’d)

Why PBS?Tunable: Parameter α can be changed from 0 to ∞.

Emulate well-known policies: Pi = ti/xαi

α = 0: First-come first-serve (FCFS) Pi = tiα→∞: Least attained service (LAS), Pi ∼ 1/xi

a.k.a. Foreground-Background Processor-Sharing (FBPS)α = 1: Greatest Attained Slowdown (GAS), Pi = ti/xi

closely emulate Processor-Sharing (PS).α = other values: Hybrid policies.

Blind: Using only past information (ti , xi)

Simple: Easy to implement.

Dimensionless: Not dependent on scale of time unit (minute,second).

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 11 / 37

Page 12: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Behavior of PBS

An exampleFour tasks in 4 colors

Arrival time: 0s,1s,3s,5s

Service: 4.5s,2.5s,3s,2s

How to read the graphsX-axis: Time

Y-axis: CPU utilization per task.

Area: Service received.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 12 / 37

Page 13: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Properties

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 13 / 37

Page 14: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Properties of PBS for 0 < α <∞

Some properties of PBS proved in the paperA new task immediately receives service after arrival.

Small CPU fraction for α < 1Large CPU fraction for α > 1.

Seniority: Earlier tasks get more attained service.

Time-shared: CPU may be shared by two or more tasks.

Hospitality: A new task always gets a CPU share.

Convergence: Converge to PS in a long run for long jobs.

Converge to DPS with an offset to log formula,

No Starvation: Priority values of temporarily blocked tasksincrease towards infinity, and will become highest-priority task.

For α close to 0 (FCFS) or ∞ (LAS), tasks may be blockedfor a long time.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 14 / 37

Page 15: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

PBS Tunability: A Graphical Conclusion

PBS is monotonic in many aspectsGuidelines for tuning α manually.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 15 / 37

Page 16: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Implementation

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 16 / 37

Page 17: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Implementation in Linux Kernel

CPU utilization measurementDiscrete time implementation in Linux 2.6.15.

50ms moving average of measured CPU utilization per task.

Measurement results are close to simulation results.

Difference is the roughness on small time scales.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 17 / 37

Page 18: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Emulating Existing Linux Scheduler

A small tweakAdd a bonus priority γ to the current taskin order to limit context switch.

With α = 2 and γ = 0.07, PBS looks close to Linux nativescheduler.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 18 / 37

Page 19: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Experimental model

A closed modelA fixed number of users.

Each user submits a taskafter thinking.

Exponentially distributedthinking time.

Response time of everytask is measured.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 19 / 37

Page 20: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Experimental Results (Set A)

Computational tasks with almost deterministic CPU usage.

About 3-second processing for each task.

8 users, 25s average thinking time.

For this work load,small α works best.

PBS (α < 0.7)outperforms Linuxand Round-robin.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 20 / 37

Page 21: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Experimental Results (Set B) (1/2)

Apache web server 2.0, dynamic pages with heavy processing.

Overloaded with 30 users, 10s average thinking time.

Processing time is heavy-tailed.

For this workload,big α works best.

PBS (α > 2)outperforms Linuxand Round-robin.

ConclusionDifferent α’s are betterfor different workloads.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 21 / 37

Page 22: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Experimental Results (Set B) (2/2)

Apache web server 2.0, dynamic pages with heavy processing.

Overloaded with 30 users, 10s average thinking time.

Processing time is heavy-tailed.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 22 / 37

Page 23: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Data center

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 23 / 37

Page 24: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Data center fabric: A giant switch

DC Fabric: Just a Giant Switch

H1H2

H3H4

H5H6

H7H8

H9H1

H2H3

H4H5

H6H7

H8H9

TX RX8

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 24 / 37

Page 25: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Transport in Data Centers

H1H2

H3H4

H5H6

H7H8

H9H1

H2H3

H4H5

H6H7

H8H9

Objective?Ø Minimize avg FCT

DC transport = Flow scheduling on giant switch

ingress & egress capacity constraints

TX RX9

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 25 / 37

Page 26: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

pFabric

pFabric: Minimal Near-Optimal Datacenter Transport (Alizadeh et al.Sigcomm 2013)

Goal: Complete Flows QuicklyRequires scheduling flows such that:

High throughput for large flowsFabric latency (no queuing delays) for small flows

Prior work: use rate control to schedule flowsvastly improve performance, but complex

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 26 / 37

Page 27: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

pFabric in one slide

pFabric PacketsPackets carry a single priority number, e.g., prio = remaining flow size

pFabric Switches

Very small buffers (20-30KB for 10Gbps fabric)Send highest priority / drop lowest priority pkts

pFabric Hosts

Send/retransmit aggressivelyMinimal rate control: just prevent congestion collapse

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 27 / 37

Page 28: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

pFabric Switch

pFabric Switch

SwitchPort

7 1

9 43

ØPriority Scheduling send highest priority packet first

ØPriority Dropping drop lowest priority packets first

6

5

small “bag” of packets per-port

13

prio = remaining flow size

H1H2H3

H4H5H6

H7H8H9

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 28 / 37

Page 29: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

pFabric results

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 29 / 37

Page 30: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Summary of results

SJF and SRPT achieve nearly equal performance (for large flowsSRPT ≈ 15% better than SJF).

LAS (BytesSent) for DataMining nearly as good asoptimal/SRPT, for WebSearch performance breaks down at highload.

Many jobs of similar sizes keep getting pre-empted at high loads,until the new job “catches up”, and then it starts again

TradeoffsSJF/SRPT work across workload distributions, but require JobSize information

Job size often not availabe ahead of time

LAS requires no job size information, but doesn’t work well withnon-heavytailed job size distributions.

PBS can achieve balance with right α

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 30 / 37

Page 31: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Homa: Practical Low Latency Datacenter

Transport (Sigcomm 2018)

● No persistent congestion in the core

Congestion At The Edge

Big Switch

Core

TOR

Servers

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 31 / 37

Page 32: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Approach and Performance

Schedule messages in shortest-remaining-first order (SRPT)Near-optimal average latency & good tail latency for short messages

Key IdeasReceiver-driven congestion control and packet scheduling

Reduce buffer occupancy & improve latency

Use of network priorities dynamically assigned by receivers

Bypass queues for short messages

Controlled overcommitment on receivers downlink

Avoid bandwidth waste, leads to high bandwidth utilization

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 32 / 37

Page 33: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Homa mechanism

SRPT to schedule packets● Homa receivers schedule incoming packets: one grant per packet● Problem: 1 RTT additional latency for scheduling (size unknown)● Solution: Transmit 1 RTT of packets per message blindly

Sender

Receivertime

Grants Scheduled Data Packets

Unscheduled Data Packets

PBS for HomaNo need to know size of flows

First packet has natural high priority (unless α = 0)

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 33 / 37

Page 34: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Conclusion

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 34 / 37

Page 35: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

Conclusion and Future Work

ContributionsWe introduce a novel configurable policy, PBS.

By varying the single parameter, we can tune for variousperformance and fairness requirements.

Demonstrate properties and advantages of PBS by analysis,simulations, implementation, and experiments.

Current/Future work

Closed form of mean response time in M/G/1 for any α.

Design an automatic mechanism to dynamically adapt α toworkload.

Implement PBS with Data Center scheduling/transport systems

Extend PBS to multi-core systems.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 35 / 37

Page 36: A Generalized Blind Scheduling Policysamir/DCscheduling18/slides/ttic-misra.pdfApache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking

The End

The End

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 36 / 37