curbing delays in datacenters: need time to save time?

29
Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1

Upload: more

Post on 23-Feb-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Curbing Delays in Datacenters: Need Time to Save Time?. Mohammad Alizadeh Sachin Katti , Balaji Prabhakar Insieme Networks Stanford University . Window-based rate control schemes (e.g., TCP) do not work at near zero round-trip latency . Datacenter Networks. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Curbing Delays in Datacenters: Need Time to Save Time?

1

Curbing Delays in Datacenters:Need Time to Save Time?

Mohammad Alizadeh

Sachin Katti, Balaji Prabhakar

Insieme Networks Stanford University

Page 2: Curbing Delays in Datacenters: Need Time to Save Time?

2

Window-based rate control schemes (e.g., TCP) do not work at near zero round-trip latency

Page 3: Curbing Delays in Datacenters: Need Time to Save Time?

3

Datacenter Networks

1000s of server ports

Message latency is King need very high throughput, very low latency

web app db map-reduce HPC monitoringcache

10-40Gbps links1-5μs latency

Page 4: Curbing Delays in Datacenters: Need Time to Save Time?

4

Transport in Datacenters• TCP widely used, but has poor performance– Buffer hungry: adds significant queuing latency

TCP~1–10ms

DCTCP ~100μs

~Zero Latency

How do we get here?

Que

uing

Lat

ency

Baseline fabric latency: 1-5μs

Page 5: Curbing Delays in Datacenters: Need Time to Save Time?

5

(KBy

tes)

Experiment: 2 flows (Win 7 stack), Broadcom 1Gbps Switch

Reducing Queuing: DCTCP vs TCP

S1

Sn

ECN Marking Thresh = 30KB

Page 6: Curbing Delays in Datacenters: Need Time to Save Time?

Towards Zero Queuing

S1

Sn

ECN@90%

S1

Sn

ECN@90%

S1

Sn

ECN@90%

Page 7: Curbing Delays in Datacenters: Need Time to Save Time?

0 5 10 15 20 25 30 35 40 45 5005

101520253035404550

Queueing Latency

Total Latency

Round-Trip Propagation Time (μs)

Late

ncy

(μs)

Towards Zero Queuingns2 sim: 10 DCTCP flows, 10Gbps switch, ECN at 9Gbps (90% util)

0 5 10 15 20 25 30 35 40 45 507

7.5

8

8.5

9

9.5

10

Round-Trip Propagation Time (us)

Thro

ughp

ut (G

bps)

Target Throughput

Floor ≈ 23μs

S1

Sn

ECN@90%

Page 8: Curbing Delays in Datacenters: Need Time to Save Time?

8

ReceiverSender

RTT = 10 C×RTT = 10 pkts

Cwnd = 1

Throughput = 1/RTT = 10%

Window-based Rate Control

C = 1

Page 9: Curbing Delays in Datacenters: Need Time to Save Time?

9

ReceiverSender

Cwnd = 1

Throughput = 1/RTT = 50%

Window-based Rate Control

RTT = 2 C×RTT = 2 pkts

C = 1

Page 10: Curbing Delays in Datacenters: Need Time to Save Time?

10

ReceiverSender

Cwnd = 1

Throughput = 1/RTT = 99%

Window-based Rate Control

RTT = 1.01 C×RTT = 1.01 pkts

C = 1

Page 11: Curbing Delays in Datacenters: Need Time to Save Time?

11

Receiver

Sender 1

Cwnd = 1

Sender 2 Cwnd = 1

As propagation time 0: Queue buildup is unavoidable

Window-based Rate Control

RTT = 1.01 C×RTT = 1.01 pkts

Page 12: Curbing Delays in Datacenters: Need Time to Save Time?

12

So What?Window-based RC needs lag in the loopNear-zero latency transport must:1. Use timer-based rate control / pacing2. Use small packet size

Or… Change the Problem!

Both increase CPU overhead (not practical in software)Possible in hardware, but complex (e.g., HULL NSDI’12)

Page 13: Curbing Delays in Datacenters: Need Time to Save Time?

13

Changing the Problem…

Priority queue

Switch Port

FIFO queue

Switch Port

7 1

9 435

Queue buildup costly need precise rate control

Queue buildup irrelevant coarse rate control OK

Page 14: Curbing Delays in Datacenters: Need Time to Save Time?

14

pFABRIC

Page 15: Curbing Delays in Datacenters: Need Time to Save Time?

15

H1 H2 H3 H4 H5 H6 H7 H8 H9

DC Fabric: Just a Giant Switch

Page 16: Curbing Delays in Datacenters: Need Time to Save Time?

16

H1H2

H3H4

H5H6

H7H8

H9H1

H2H3

H4H5

H6H7

H8H9

H1H2

H3H4

H5H6

H7H8

H9

TX RX

DC Fabric: Just a Giant Switch

Page 17: Curbing Delays in Datacenters: Need Time to Save Time?

17

H1H2

H3H4

H5H6

H7H8

H9H1

H2H3

H4H5

H6H7

H8H9

TX RX

DC Fabric: Just a Giant Switch

Page 18: Curbing Delays in Datacenters: Need Time to Save Time?

18

H1H2

H3H4

H5H6

H7H8

H9H1

H2H3

H4H5

H6H7

H8H9

Objective? Minimize avg FCT

DC transport = Flow scheduling on giant switch

ingress & egress capacity constraints

TX RX

Page 19: Curbing Delays in Datacenters: Need Time to Save Time?

19

“Ideal” Flow SchedulingProblem is NP-hard [Bar-Noy et al.]– Simple greedy algorithm: 2-approximation

1

2

3

1

2

3

Page 20: Curbing Delays in Datacenters: Need Time to Save Time?

20

pFabric in 1 SlidePackets carry a single priority #• e.g., prio = remaining flow size

pFabric Switches • Very small buffers (~10-20 pkts for 10Gbps fabric)• Send highest priority / drop lowest priority pkts

pFabric Hosts• Send/retransmit aggressively• Minimal rate control: just prevent congestion collapse

Page 21: Curbing Delays in Datacenters: Need Time to Save Time?

21

Key Idea

Decouple flow scheduling from rate control

H1 H2 H3 H4 H5 H6 H7 H8 H9

Switches implement flow scheduling via local mechanisms

Hosts use simple window-based rate control (≈TCP) to avoid high packet loss

Queue buildup does not hurt performance Window-based rate control OK

Page 22: Curbing Delays in Datacenters: Need Time to Save Time?

22

Switch Port

7 1

9 43

Priority Scheduling send highest priority packet first

Priority Dropping drop lowest priority packets first

5

small “bag” of packets per-port prio = remaining flow size

H1

H2

H3

H4

H5

H6

H7

H8

H9

pFabric Switch

Page 23: Curbing Delays in Datacenters: Need Time to Save Time?

23

pFabric Switch Complexity• Buffers are very small (~2×BDP per-port)– e.g., C=10Gbps, RTT=15µs → Buffer ~ 30KB– Today’s switch buffers are 10-30x larger

Priority Scheduling/Dropping• Worst-case: Minimum size packets (64B)– 51.2ns to find min/max of ~600 numbers– Binary comparator tree: 10 clock cycles– Current ASICs: clock ~ 1ns

Page 24: Curbing Delays in Datacenters: Need Time to Save Time?

24

Why does this work?

Invariant for ideal scheduling: At any instant, have the highest priority packet (according to ideal algorithm) available at the switch.

• Priority scheduling High priority packets traverse fabric as quickly as possible

• What about dropped packets? Lowest priority → not needed till all other packets depart Buffer > BDP → enough time (> RTT) to retransmit

Page 25: Curbing Delays in Datacenters: Need Time to Save Time?

25

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80123456789

10

Ideal pFabric PDQDCTCP TCP-DropTail

Load

FCT

(nor

mal

ized

to o

ptim

al in

idle

fabr

ic)

Evaluation (144-port fabric; Search traffic pattern)

Recall: “Ideal” is REALLY idealized!

• Centralized with full view of flows• No rate-control dynamics• No buffering• No pkt drops• No load-balancing inefficiency

Page 26: Curbing Delays in Datacenters: Need Time to Save Time?

26

Mice FCT (<100KB)Average 99th Percentile

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80123456789

10

Ideal pFabric PDQ DCTCP TCP-DropTail

Load

Nor

mal

ized

FCT

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80123456789

10

Load

Nor

mal

ized

FCT

Page 27: Curbing Delays in Datacenters: Need Time to Save Time?

27

Conclusion

• Window-based rate control does not work at near-zero round-trip latency

• pFabric: simple, yet near-optimal– Decouples flow scheduling from rate control– Allows use of coarse window-base rate control

• pFabric is within 10-15% of “ideal” for realistic DC workloads (SIGCOMM’13)

Page 28: Curbing Delays in Datacenters: Need Time to Save Time?

28

Thank You!

Page 29: Curbing Delays in Datacenters: Need Time to Save Time?

29