revisiting network support for rdmanetseminar.stanford.edu/seminars/03_16_17.pdfrevisiting network...

56
Revisiting Network Support for RDMA Radhika Mittal 1 , Alex Shpiner 3 , Aurojit Panda 1 , Eitan Zahavi 3 , Arvind Krishnamurthy 2 , Sylvia Ratnasamy 1 , Scott Shenker 1 (1: UC Berkeley, 2: Univ. of Washington, 3: Mellanox Inc.)

Upload: others

Post on 25-Dec-2019

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Revisiting Network Support �for RDMA

Radhika Mittal1, Alex Shpiner3, Aurojit Panda1, Eitan Zahavi3,

Arvind Krishnamurthy2, Sylvia Ratnasamy1, Scott Shenker1

(1: UC Berkeley, 2: Univ. of Washington, 3: Mellanox Inc.)

Page 2: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Rise of RDMA in datacenters

Traditional Networking Stack

DataUser Application

OS

Hardware NIC

Data

Copy

User Application

OS

Specialized NIC

RDMA

Copy

RDMA enables low CPU utilization, low latency, high throughput

Page 3: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

•  RoCE (RDMA over Converged Ethernet)– canonical approach for deploying RDMA in datacenters.– needs lossless network to get good performance.

•  Network made lossless using Priority Flow Control (PFC)– Complicates network management, congestion spreading,

deadlocks

Current Status

Page 4: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

•  RoCE (RDMA over Converged Ethernet)– canonical approach for deploying RDMA in datacenters.– needs lossless network to get good performance.

•  Network made lossless using Priority Flow Control (PFC)– Complicates network management, congestion spreading,

deadlocks

Current Status

Page 5: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

•  RoCE (RDMA over Converged Ethernet)– canonical approach for deploying RDMA in datacenters.– needs lossless network to get good performance.

•  Network made lossless using Priority Flow Control (PFC)– Complicates network management, congestion spreading,

deadlocks

Current Status

Is losslessness really needed? No!

Simple changes to NIC design enable similar and often better performance without

losslessness.

Page 6: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Background

Page 7: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Evolution of RDMA

•  Traditionally used in Infiniband clusters.– Losses are rare (credit-based flow control).

•  NICs were not designed to deal with losses efficiently.– Receiver discards out-of-order packets.– Sender does go-back-N on detecting packet loss (via

timeouts/negative acks).

Page 8: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Go-Back-N Loss Recovery

12345

2345

Retransmit all packets sent after the

last acknowledged packet.

Page 9: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

RDMA over Converged Ethernet

•  RoCE: RDMA over Ethernet fabric.•  RoCEv2: RDMA over IP-routed networks.•  Infiniband transport was adopted as it is.– Go-back-N loss recovery– Needs losslessless for good performance.

•  Network was made lossless using PFC.

Page 10: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Why not iWARP?

•  Designed to support RDMA over generic (non-lossless) networks.

•  Implements entire TCP stack in hardware along with multiple other layers.

•  Not as popular as RoCE– Less mature, more power, more expensive.

RoCE + PFC emerged as popular choice.

Page 11: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Priority Flow Control (PFC)

•  XOFF frame sent when queuing exceeds a certain threshold to pause transmission.

XOFF

Paused

XON

Resumed

Switch A Switch B

Page 12: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Drawbacks of PFC

•  Adds complexity to network management.– Need enough headroom to absorb all packets in flight.

Page 13: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Drawbacks of PFC

•  Adds complexity to network management.– Need enough headroom to absorb all packets in flight.

XOFF

Switch A Switch B

Page 14: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Drawbacks of PFC

•  Performance Issues– Unfairness, HoL blocking

Page 15: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Drawbacks of PFC

•  Performance Issues– Unfairness, HoL blocking

Switch A Switch B

Page 16: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Drawbacks of PFC

•  Performance Issues– Unfairness, HoL blocking

Switch A Switch B

Page 17: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Drawbacks of PFC

•  Performance Issues– Unfairness, HoL blocking

Switch A Switch B

Page 18: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Drawbacks of PFC

Congestion Spreading

Page 19: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Drawbacks of PFC

Congestion Spreading

Switch A Switch B

Page 20: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Drawbacks of PFC

A

C B

Deadlocks caused by cyclic buffer dependency

Page 21: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Congestion Control with RoCE

RoCE

Advanced Congestion control

DCQCN: Zhu et al, SIGCOMM 2015 (Microsoft)•  ECN-based congestion control •  Implemented on NIC hardware (Mellanox ConnectX4)

Timely: Mittal et al, SIGCOMM 2015 (Google)•  Delay-based congestion control

Page 22: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Recent Works highlighting PFC Issues

A

C Bs

•  RDMA over commodity Ethernet at scale, SIGCOMM 2016

•  Deadlocks in datacenter: why do they form and how to avoid them, HotNets 2016

•  Unlocking credit loop deadlock, HotNets 2016

Page 23: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Our approach

•  Based on the principle of iWARP– NIC efficiently deals with packet losses

•  But on the implementation of RoCE– Incorporate only necessary bare-minimal features

Page 24: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Is losslessness needed for RDMA?

Page 25: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Experimental Setup

•  Mellanox simulator modeling MLX CX4 NICs, built on Omnet/Inet.

•  Three layered fat-tree topology.•  Links with capacity10Gbps and delay 2us.•  Heavy-tailed distribution at 70% utilization.

Page 26: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Metrics

•  Average Flow Completion Time

•  99%ile Flow Completion Time

•  Average Slowdown

Page 27: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Current RoCE NICs

PFC helps a lot.

Page 28: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Current RoCE NICs

PFC helps a lot.

Page 29: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Current RoCE NICs

PFC helps a lot.

Page 30: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Key results

•  PFC is needed with current RoCE NICs.

•  PFC is not needed with IRN.

•  IRN performs better than current RoCE NICs.

Page 31: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Improved RoCE NIC (IRN)

•  Three key changes:– Selective retransmit instead of go-back-N– Back-off on losses– Cap the number of outstanding bytes by BDP

Page 32: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

IRN: no advanced congestion control

Disabling PFC gives much better

performance.

Page 33: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

IRN: no advanced congestion control

Disabling PFC gives much better

performance.

Page 34: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

IRN: no advanced congestion control

Disabling PFC gives much better

performance.

Page 35: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

IRN with advanced congestion control

PFC is not needed.

Page 36: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

IRN with advanced congestion control

PFC is not needed.

Page 37: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

IRN with advanced congestion control

PFC is not needed.

Page 38: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Key results

•  PFC is needed with current RoCE NICs.

•  PFC is not needed with IRN.

•  IRN performs better than current RoCE NICs.

Page 39: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

IRN vs RoCE

IRN performs better than RoCE.

Page 40: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

IRN vs RoCE

IRN performs better than RoCE.

Page 41: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

IRN vs RoCE

IRN performs better than RoCE.

Page 42: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Key results

•  PFC is needed with current RoCE NICs.

•  PFC is not needed with IRN.

•  IRN performs better than current RoCE NICs.

Page 43: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Robustness of results

•  Tested a wide range of experimental scenarios:-  Higher link bandwidth of 40Gbps and 100Gbps-  More uniform workload distribution-  Varying link delay-  Varying link utilization

•  Our key take-away hold across all of these scenarios.

Page 44: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Efficient RDMA transport does not require �a lossless network. �

�With simple NIC changes, generic out-of-the-box Ethernet

network outperforms a lossless network.

Page 45: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Deep Dive

Page 46: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Improved RoCE NIC (IRN)

•  Three key changes:– Selective retransmit instead of go-back-N– Back-off on losses– Cap the number of outstanding bytes by BDP

Page 47: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Improved RoCE NIC (IRN)

•  Three key changes:– Selective retransmit instead of go-back-N– Back-off on losses– Cap the number of outstanding bytes by BDP

Page 48: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Need for Selective Retransmit

12345

2345

Go-Back-N

12345

2

Selective Retransmit

Disabling selective retransmit for IRN+DCQCN increased avg FCT by

28% for our default scenario.

Page 49: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Improved RoCE NIC (IRN)

•  Three key changes:– Selective retransmit instead of go-back-N– Back-off on losses– Cap the number of outstanding bytes by BDP

Page 50: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Benefit of Backing Off on Losses

•  Exploit losses as congestion signal

•  Found to be particularly usefulo  When no advanced congestion control is being used.-  Disabling this feature for IRN increased avg FCT by

158% for our default scenario.o  For scenarios where advanced congestion control

does not react fast enough.-  Disabling this feature for IRN+DCTCP increased avg

FCT by 140% when link bandwidth is 100Gbps.

Page 51: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Improved RoCE NIC (IRN)

•  Three key changes:– Selective retransmit instead of go-back-N– Back-off on losses– Cap the number of outstanding bytes by BDP

Page 52: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Benefit of capping by BDP

•  Upper bound on number of out-of-order packets.

•  Improves performance irrespective of whether PFC is enabled or disabled.-  Disabling this feature for IRN+DCQCN increased the avg

FCT by 64% for our default scenario.

Page 53: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Design Details

•  IRN supports both window mode or rate mode.•  Start at line rate (or with cwnd = BDP)•  Losses detected via:

o  Three dupacks-  selective retransmit + reduce rate or cwnd by half

o  Partial ack-  selective retransmit

o  Fixed timeout-  go-back-N + reduce rate or cwnd by half

•  Option for additive increase on new acks.•  For RDMA reads: requester generates read acks.

Page 54: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Implementation Feasibility

•  Support for out of order packet delivery at the receiver– New feature in Mellanox CX5 NICs for adaptive routing.– Straight forward to implement selective retransmit.

•  Other IRN changes: few additional instructions.

•  Increase in per-flow state: less than 3%.

Page 55: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Summary

•  IRN performs better than RoCE and does not need PFC.

•  Holds across various congestion control algorithms and experimental scenarios.

•  The NIC changes required are simple and feasible.

Page 56: Revisiting Network Support for RDMAnetseminar.stanford.edu/seminars/03_16_17.pdfRevisiting Network Support for RDMA Radhika Mittal1, Alex Shpiner 3, Aurojit Panda1, Eitan Zahavi ,

Thank you!