reducing latency in multi-tenant data ... - jnamaral.github.io · the root of the performance...

31
Reducing Latency in Multi-Tenant Data Centers via Cautious Congestion Watch The 49th International Conference on Parallel Processing (ICPP) 2020 Ahmed M. Abdelmoniem, Hengky Susanto, and Brahim Bensaou

Upload: others

Post on 16-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Reducing Latency in Multi-Tenant Data Centers via Cautious Congestion Watch

The 49th International Conference on Parallel Processing (ICPP) 2020

Ahmed M. Abdelmoniem, Hengky Susanto, and Brahim Bensaou

Page 2: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Outline

• Introduction and background

• Preliminary Investigation

• Exploring solution design space

• The solution (Hwatch) design

• Evaluation

• Conclusion

Page 3: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Living in Big Data Era

• Massive amount of data being generated, collected, and processed daily.

› Data for recommendation, machine learning, business intelligence, scientific research, etc.

• To accelerate the processing speed, this massive amount of data must be processed in parallel.

Page 4: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Congestion Control in Data Center

• Data parallel processing is generally conducted in a data center.

• Modern datacenter employ Data Center TCP (DCTCP).

• DCTCP is a TCP-like congestion control protocol designed for data center networks.

• DCTCP leverages Explicit Congestion Notification (ECN) to provide multi-bit feed-back to the end host.

DCTCP DCTCP DCTCP

Data parallel processing application

Page 5: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

DCTCP Overview

Sender Switch Receiver

ECN threshold

Switch Buffer

Packet

Marked by ECN marking

TCP ACK

ACK packets carry ECN echo

Page 6: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Incast Problem in Datacenter Network

Switch Buffer

Worker 1

Worker 2

Worker 3

AggregatorECN threshold

Packets drop

• Incast is bloated buffer incident.• Caused by burst of packets arriving at the same time.

• Common occurrence for data parallel processing in data center network.

• Often leads to network and application level performance degradation.

Data parallel processing

Page 7: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Preliminary InvestigationBetter understanding of how well DCTCP coping with the incast problem.

Page 8: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Investigating Incast Problem

Conduct preliminary experiment via NS2 simulator in dumbbell network topology.

• Short-lived flows are sensitive to the choice of the initial sending window size.

• DCTCP marking results in the more aggressive acquisition of the available bandwidth. • Favor large flows (e.g. data backup traffic) over short-lived flow (e.g. search query) that may degrade the

performance short-lived flow .

Page 9: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

The Root of The Performance Degradation

• Observation 1 : Short-lived flows (e.g. 1 to 5 packets) do not have enough packets to

generate 3 DUPACKs to trigger TCP fast re-transmission scheme.

• Or the sender may lose the entire window of packets.

• Sender must rely on TCP RTO to detect packet drops. (Default Linux’s RTO = 200 ms to 300 ms).

• Observation 2: Incast problem primarily affects short-lived flows.

• Today’s data parallel processing applications (E.g. Map-Reduce) generate many short-lived flows.

Page 10: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Our findings show that, short-lived flows in DCTCP still suffer from packet losses due to the use of large initial congestion windows that

results in incast problem.

Page 11: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Research Question

What is the optimal choice of initial congestion window for DCTCP

that mitigates incast problem, while minimizing the average

completion time of short-lived flows?

Page 12: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Exploring Solution Design Space To provide a holistic view of the problem and better understanding on the complex

interactions between network components (switch, sender, and receiver).

Page 13: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Exploring The Design Space at Sender and Receiver

Sender SwitchReceiver

Due to nature of TCP based protocol, there is a wealth of information available at the sender. • Number of transmitted packets, congestion window

size, RTT, .., etc.

A receiver has a natural position to evaluate information carried by packets with ECN marking from inbound traffic.

Combining information gathered from both the sender and receiver provides a sender with a richer and more holistic view of the network condition. • For instance, the number of packets dropped can be approximated by subtracting the number of packets received

by the receiver from the total number of transmitted packets by the sender.

Page 14: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Exploring The Design Space at The Switch

At time t

Worker 1

Worker 2

Worker 3

Aggregator

Switch Buffer

Switch

Worker 2

Worker 3

Worker 1

Observation : Transmitting packets at different time helps to mitigate packets drop.

At time t+1

Page 15: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Visualizing Queue Management as Bin-Packing

At time t

Worker 2 Worker 3Worker 1

At time t+1

Switch Buffer

Visualize switch buffer as the bin.

Visualize packet as the items that are packed into a bin.

Page 16: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Visualizing Incast As Bin-Packing

• Different perspective in understanding incast problem. • Starting point to think how to transmit a bulk data of short-lived flows, which only consists

of few packets.

• Allowing us to inherit wisdom from earlier study on bin-packing problem.• Our solution draws inspiration from the classic solution, Next Fit Algorithm.

• Emulate Bin-Packing problem by utilizing ECN used in DCTCP.• The ECN setup follows the recommendation for DCTCP.

Page 17: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Theoretical Results

Key insight gained from our theoretical results:

• The initial congestion window size can be approximated by leveraging information base on the number of packets marked and unmarked by ECN.

• Given the number of packets marked by ECN, the first 𝑛 transmitted packets should be transmitted in at least two batches (rounds).

Page 18: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

From Theory to Practice - Practical Challenges

Despite encouraging outcomes from our theoretical results, to realize the theory to practice, we must consider the practical challenges.

• Incast problem is a distributed online problem.• Bin-packing is an offline problem.

• Short-lived flows (e.g. 1 to 5 packets) may be completed before the sender receives the ECN echo via ACK packets.• Short-lived flows only learn about in network congestion after receiving ECN echo, which is

too late.

Page 19: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

System Design Requirements of The Solution

• Improves the performance of short-lived flows with minimal impact on large flows.

• Simple for deployment in data center.

• Being Independent of the TCP variant.

• No modifications to the VM’s network stack.

• Addresses the practical challenges.

Page 20: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

HWatch(Hypervisor-based congestion watching mechanism)

An active network probe scheme, that incorporates insights from our theoretical results, to determine the initial congestion window based on the congestion level in the network.

Our solution:

Page 21: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

HWatch System Design in a Nutshell

• Injects probe packets at connection start-up during TCP synchronization stage.

• The probe packets carry the ECN marks to the receiver in the case of congestion.

• Receiver sets the receive window (RWND) according to number of probe packets marked with ECN marking

• Conveys RWND to sender via ACK packet.

• Sender sets the Initial congestion window (ICWND) according to RWND.

Hypervisor

NIC

Flow# ECN

S1:D1 7

S2:D2 3

Receiver

S1:data S3:data

ECN-ECHO

Sender

S3:D3 5

S2:data

VM2 VM1VM3

S1-D1S2-D2S3-D3

IN Hook

Pre

_R

ou

te

Ip_

rcv

OUT Hook

Po

st_

rou

te

Ip_

fin

ish

rwnd_update

D1:ACK D3:ACKD2:ACK

H-WATCH

ECN

Page 22: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

HWatch Implementation Overview

• The Hwatch module is implemented in the shim-layer in both sender and receiver.• Hwatch is implemented via Netfilters by inserting hooks in the forward processing path.

• HWatch is realized by modifying the OpenSwitch kernel datapath module.• Adjusting the flow processing table.

• By doing so, Hwatch is deployable friendly in the production datacenter.

vSwitch daemon

flow_lookup

receive_from_vport

action:do_output(Packet Interceptor)

send_packet_to_vswitchd handle_packet_cmd

send_to_vport

new

TCP Flow TableECN tracking

RWND Update

extract_key action_execute

User Space

Kernel Datapath

Local TCP/IP stack

RoutingPreroutingForward HWATCH Hook

(packet interceptor)

InputOutput

Postrouting

TCP Flow TableECN tracking

RWND Update

Page 23: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Evaluation and Analysis(1) Simulation experiments.(2) Testbed experiments.

Page 24: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

ToRswitches

Aggregationswitches

core routers

Servers

pod

Simulation Experiment Setups

• NS2 simulator.

• Over 100s of servers connected by commodity switches.

• Compared to different type TCP traffic sources (TCP-DropTail, TCP-Red, and DCTCP).

Fat Tree Topology

Page 25: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Simulation Experiment Results

• HWatch improves the performance of short-lived flows by up to 10X compares to TCP-DropTail, TCP-Red, and DCTCP.• Minimal impact to the performance of large-flows.

Short-Lived flows: Avg FCT Long-Lived flows: Avg Goodput Persistent queue over time Bottleneck utilization over time

Performance of short-lived and long-lived flows in over 100 sources scenario.

Page 26: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Testbed Experiment Setup

• Build and deploy HWatch prototype in mini data center.

• Fat Tree topology connecting 4 Racks of DC-grade servers installed with Incast Guard End-host Module.

• Commodity (EdgeCore) Top of Rack switches.

• Core Switch is PC installed with the NetFPGA Switch.

Rack 1 Rack 2 Rack 3

Core

ToR

Rack 4

NetFPGASwitch

1 Gb/s

Fat Tree topology

Page 27: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Testbed Experiments Results

Test experiments confirms that HWatch mitigates packets drop.

• Improves the performance of short-lived flows by up to 100%.

• Minimal impact to the performance large flows.

Short-Lived flows: Average FCTLong-Lived flows: Average Goodput

Page 28: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Key Insights From HWatch Performance

• Dispersing packets transmission over time, creates smaller size of incasts.• Allowing buffer to observed the incoming packets with fewer packet drop.

• Fewer packet drop allows short-lived flows to achieve faster completion time.

• HWatch stochastically prioritizes the available buffer space to short-lived flows.• Because probe packets indirectly reserve buffer space for short-lived flows.

• Probing mechanism functions as incast early warning system allowing long-lived flows that are actively sending data to scale back to release some buffer space for short-lived flows.

Page 29: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Conclusion

• HWatch mitigates incast problem in the data center.

• Hwatch strikes the balance between improving the performance of short-livedflows while imposing minimal impact on the large flows.

• Hwatch is deployment-friendly in production data center.

Page 30: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Thank You

Page 31: Reducing Latency in Multi-Tenant Data ... - jnamaral.github.io · The Root of The Performance Degradation •Observation 1: Short-lived flows (e.g. 1 to 5 packets) do not have enough

Exploring The Design Space at The Switch

Time

Number of packets

Y number of packets burst (incast)

Packets are dropped from buffer overflow

Time required to drain packets in buffer that are queued at time t.

Packets queued in the buffer

X number of existing packets queued inside the buffer prior the burst

Illustration of the switch buffer experiencing incast.

3

3

t t'

Buffer Size