enabling wide-spread communications on optical fabric … · enabling wide-spread communications on...

31
Enabling Wide-spread Communications on Optical Fabric with MegaSwitch Li Chen Kai Chen, Zhonghua Zhu, MinlanYu, George Porter, Chunming Qiao, Shan Zhong

Upload: tranthien

Post on 06-Aug-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Enabling Wide-spread Communications on Optical Fabric with MegaSwitch

Li Chen

Kai Chen, Zhonghua Zhu, MinlanYu, George Porter, Chunming Qiao, Shan Zhong

Optical Networking in Data Centers

• Data center traffic demand is growing

3/29/17 SING-CSE-HKUST 2

Optical Fabric

Jupiter Rising [Sigcomm’15]

• Optical networking in data centers• Low cost• Low power consumption• Low wiring complexity• High one-to-one bandwidth

How to design an optical fabric that enables high bisection bandwidth?

• Optical fabric usually provides one-to-one high-bandwidth circuits.• Data Center traffic is wide-spread

Optical Networking in Data Centers

3/29/17 SING-CSE-HKUST 3

How to design an optical fabric that supports high-bandwidth & wide-spread traffic?

Microsoft Data Center Network [ProjecToR, Sigcomm’16]

Prior Works

• 2010: C-Through, Helios• 2012: OSA• 2013: Mordia, ReacToR

3/29/17 SING-CSE-HKUST 4

Prior works reuse wavelengths temporally to meet traffic demand

t1 t2 time

Schedule 1 Schedule 2 Schedule 3

1 2 3

1 - 1 0

2 0 - 1

3 1 0 -

1 2 3

1 - 0 1

2 1 - 0

3 0 1 -

1 2 3

1 - 0 1

2 0 - 0

3 1 0 -1 2 3

1 - 1 1

2 1 - 1

3 2 1 -

Traffic Demand Matrix

Src

Dest

Wavelength/Circuit assignments

BvN decomp.

High Bisection Bandwidth + Wide-Spread Connectivity

• 2010: C-Through, Helios• 2012: OSA• 2013: Mordia, ReacToR

3/29/17 SING-CSE-HKUST 5

Prior works take several rounds to meet a wide-spread demand

MegaSwitch: Meet a wide-spread demand simultaneously

t1 t2 time

Schedule 1 Schedule 2 Schedule 3

timeParallel circuits

Circuit 1 Circuit 2 Circuit 3

…using MegaSwitch

MegaSwitch Data PlaneEnabling Spatial Reuse of WavelengthPrototype implementation

3/29/17 SING-CSE-HKUST 6

Multiplexer

• MUX: k input wavelengths, 1 output fiber

4x1MUX

3/29/17 SING-CSE-HKUST 7

Demultiplexer

• DEMUX: 1 input fiber, k output wavelengths

3/29/17 SING-CSE-HKUST 8

1x4DEMUX

Wavelength Selective Switch

• WSS: w input fibers, 1 output fiber• Same set of wavelengths can be reused on different fibers.

4x1WSS

3/29/17 SING-CSE-HKUST 9

Fiber 1

Fiber 2

Fiber 3

Fiber 4

Wavelength Selective Switch

• WSS: w input fibers, 1 output fiber• Same set of wavelengths can be reused on different fibers.

4x1WSS

3/29/17 SING-CSE-HKUST 10

Fiber 1

Fiber 2

Fiber 3

Fiber 4

Sending

3/29/17 SING-CSE-HKUST 11

ToR 1 ToR 2 ToR 3

Send using k wavelengths

Each ToR sends on its own fiber

Receiving

3/29/17 SING-CSE-HKUST 12

Recv from w other nodes

Select k wavelengths from w●k wavelengths

ToR 1 ToR 2 ToR 3

MegaSwitch: Full 3-Node Example

3/29/17 SING-CSE-HKUST 13

ToR 1 ToR 2 ToR 3

Recv from w other nodes

Send using k wavelengths

Select k wavelengths from w●k wavelengths

Each ToR sends on its own fiber

MegaSwitch: Scalability

3/29/17 SING-CSE-HKUST 14

Recv from w other nodes

Send using k wavelengths

Select k wavelengths from w●k wavelengths

Each ToR sends on its own fiber

ToR 1 ToR 2 ToR 3

Key parameters: w = 2, n = w+1 =3, k =4, Port Count = n x k

# nodes on ring

WSS port count # wavelength on a fiber

With current technology, MegaSwitch can scale to n x k = 33 x 192 = 6336 Ports

Unicast from H1 to H12

1, Control plane select Blue as the wavelength for the unicast

3/29/17 SING-CSE-HKUST 15

Unicast from H1 to H12

2, Configure WSS in Node 3 to select Blue in Fiber from Node 11, Control plane select Blue as the wavelength for the unicast

3/29/17 SING-CSE-HKUST 16

Unicast from H1 to H12

3, Setup routing in both EPSes2, Configure WSS in Node 3 to select Blue in Fiber from Node 11, Control plane select Blue as the wavelength for the unicast

3/29/17 SING-CSE-HKUST 17

Multicast from H1 to H5, H6, H7, and H10

3, Setup routing in both EPSes

2, Configure WSS in OWS2 and OWS3 to select Red in Fiber from OWS1

1, Control plane select Red as the wavelength for the multicast

3/29/17 SING-CSE-HKUST 18

MegaSwitch: Full 3-Node Example

3/29/17 SING-CSE-HKUST 19

OWS+PRF box OWS+PRF box OWS+PRF box

OWS+PRF box

Prototype Implementation

• Implemented OWS+PRF box for practical deployment.• Implemented prototype with 40 × 10Gbps

Ports • 5 nodes (OWS-EPS)• 8 wavelengths per node

3/29/17 SING-CSE-HKUST 20

High Port Count & Low Switching Latency

• Lowest WSS switching latency reported: 11.5us [Mordia, Sigcomm’13]

• Digital Light Processing (DLP) technology.• Cannot scale beyond 8 ports with 11.5us switching

• MegaSwitch need a large WSS port count to scale to more ports• Liquid Crystal tech. is a middle-ground in terms of both port count (10~100s) and

switching latency (milliseconds).• Measured WSS switching latency: ~3ms

• Milliseconds switching latency is a hard limit for now.• Optics community are working on it…

• How to mitigate impact to short flows?

3/29/17 SING-CSE-HKUST 21

…cannot be achieved at the same time…

MegaSwitch Control PlaneBasemesh for latency-sensitive applications

3/29/17 SING-CSE-HKUST 22

Basemesh

• Problem:• …when traffic matrix changes quickly• …when traffic matrix is estimated incorrectly

• Basemesh: a flexible overlay network on MegaSwitch to provide consistent connectivity for low latency traffic. • Each node dedicates b wavelengths to construct an overlay network on fully

connected fiber mesh.

3/29/17 SING-CSE-HKUST 23

Dynamic Allocation

BasemeshConnectivity

b wavelengths

k-b wavelengths

Basemesh: Learning from DHT literature

• MegaSwitch uses Symphony [USITS’ 13] DHT topology for basemeshconstruction. • b (“routing table size”) is adjustable for varying degree of traffic volatility • Guaranteed average latency (“average hop count per look-up”)

3/29/17 SING-CSE-HKUST 24

Basemesh b=5, Avg Hops =1Fully connected mesh network (w=5)Recommended for low latency apps

Basemesh b=3Avg Hops = 1.4

Basemesh b=1Avg Hops = 2.5

EvaluationsTestbed benchmarkReal application deployments

3/29/17 SING-CSE-HKUST 25

Prototype Evaluation

• Setting:• 5 nodes (OWS-EPS pair) • 8 wavelengths per node• 8 servers per node• Out-of-band control plane for EPS and OWS• Traffic demand matrices are known

3/29/17 SING-CSE-HKUST 26

Basic measurements

• Host-level stride• All-to-all pattern [Helios, Sigcomm’10]

• Every 10s, one wavelength changes per rack.

• Measured ~20ms total reconfiguration delay.• WSS (~3ms), EPS routing (~5ms),

transceiver initialization (~10ms)…

3/29/17 SING-CSE-HKUST 27

~50Gbps

MegaSwitch achieves full-bisection bandwidth when circuit is stable

Redis on MegaSwitch

• Latency-sensitive application• 1 Million GET/SET requests from all nodes to a server in Node1

3/29/17 SING-CSE-HKUST 28

Fully connected basemesh à Uniform latency (one-hop)

32 30 32

142

9167

12789

6692

68 6767 67 68

0

50

100

150

1 2 4

Mic

rose

cond

s

Number of Basemesh Wavelengths (b)

Average Query Completion Time

Node 1

Node 2

Node 3

Node 4

Node 5

Apache Spark on MegaSwitch

• Parallel computing applications• First connect the servers to a

single ToR switch, and measure the bandwidth demand • MegaSwitch updates wavelength

assignment every 1 sec.• <10 reconfigurations in run-time.

3/29/17 SING-CSE-HKUST 29

MegaSwitch performs similar to the optimal scenario: all servers in the same rack

212

320

415

218

336

423

0

100

200

300

400

500

KMeans WordCount WikiPageRank

Seco

nds

Job Completion Time

Single ToR MegaSwitch (1s)

Summary

• Spatial reuse of wavelengths to provide non-blocking connectivity for all ports.• Basemesh to provide consistent connectivity to latency-sensitive

flows.• Practical implementation of a 5-node ring of 40 ports.

MegaSwitch: An optical design that supports wide-spread, high-bandwidth traffic patterns in today’s production workloads.

More in our paper: Fault-tolerance, delay measurements, power budget, cost…

3/29/17 SING-CSE-HKUST 30

Got New Ideas for NSDI’18?Test them in APNet’17!

• The first Asia-Pacific Workshop on Networking• Aug. 3-4, 2017 @Hong Kong

• A good venue to test your innovative ideas and get feedback from the community.• Submit your 6-page paper on/before Apr. 21th, 2017

3/29/17 SING-CSE-HKUST 31