a resource disaggregated platform for realizing ultra-fast

26
Deepak Pathania, Senior Technical Leader NEC Corporation and NEC Technologies India Private Limited Join the Conversation #OpenPOWERSummit A Resource disaggregated Platform for realizing Ultra-Fast Failover Recovery High Availability Systems

Upload: others

Post on 11-Nov-2021

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Resource disaggregated Platform for realizing Ultra-Fast

Deepak Pathania, Senior Technical LeaderNEC Corporation and NEC Technologies India Private Limited

Join the Conversation #OpenPOWERSummit

A Resource disaggregated Platform for realizing Ultra-Fast Failover Recovery High Availability Systems

Page 2: A Resource disaggregated Platform for realizing Ultra-Fast

So, What can be a Resource Disaggregated Platform?A technology that extends PCI Express beyond the confines of a computer chassis via Ethernet, WITHOUT any modification of existing hardware and software or PCIe switch over Ethernet (ExpEther or EE)

Server

CPU

Memory

PCI Express

ExpEther NIC

L2 Switch

StandardEthernet

PCI Express

IO Device

ExpEtherEngine

ExpEtherEngine

IO Expansion Unit with PCIe Cards

Page 3: A Resource disaggregated Platform for realizing Ultra-Fast

Just another implementation of PCIe Switch

IODevic

e

IODevic

e

ExpEther Engine is seen as PCIe Switch from CPU● Ethernet region is invisible from the CPU

Upstream Port(PCI Bridge)

Downstream Port(PCI Bridge)

Downstream Port(PCI Bridge)

Internal PCI bus

CPU

IODevic

e

IODevic

e

PCIe Switch

CPU

Ethernet

Switch

ExpEther Engine(PCI Bridge)

ExpEther Engine(PCI Bridge)

ExpEther Engine(PCI Bridge)

Ethernet Fabric(Invisible)

PCI Express

PCI Express

PCI Express

PCI Express

Page 4: A Resource disaggregated Platform for realizing Ultra-Fast

Broad-Scale Single Computer

PCIeSwitc

hIO

Device

IODevic

e

CPU

CPU

IODevic

e

IODevic

e

IODevic

e

IODevic

e

IODevic

e

IODevic

eIn the same rack In the next rack

IODevic

eIO

Device

In another floor

IODevic

eIO

Device

In another buildingA PCI express switch is equivalent to Ethernet

fabric.

ExpEther

Engines

ExpEther

Engines

100 m

2 Km

10 m

ExpEther

Engine

ExpEther

Engines

ExpEther

Engines

Ethernet

Switch

Ethernet

Switch

Ethernet

Switch

Ethernet

Switch

ExpEther can build new type of computing environment without physical constraints

Page 5: A Resource disaggregated Platform for realizing Ultra-Fast

ExpEther Architecture• Achieve the “System on Network”

• Merge the PCI Express technology into Ethernet technology

• Connect logically in MAC layer• No impact for upper or lower layer of the PCIe and Ethernet standard for future

expansion

ApplicationOS

PCI DriverEFI/PCI BIOS

ExpEther Logic

MAC

PHY40G

10G 1G

ApplicationOS

NDIS Driver

Ethernet Logic

MAC

PHY10M

100M 1G 10

G40G

Ethernet

ExpEther

SoftwareHardware

Upper Compatible

No modification forfuture expansion of

ExpEther or Ethernet

Page 6: A Resource disaggregated Platform for realizing Ultra-Fast

Resource Disaggregated Platform or ExpEther features

EtherFrame

CPU

PCI Express

ExpEtherEngine

PCI Express

EthernetSwitch

ExpEtherEngine

ExpEtherEngine

ExpEtherEngine

I/ODevice

I/ODevice

I/ODevice

PCI E

xpre

ss

Equivalent to direct connection(Ethernet is invisible from CPU/IO)

1

EthernetFabric

Low Latency(L2 Ether w/o SW stack)

2

I/O Dynamic Reconfiguration(Hot-Plug Scheme)

4

EE PCI Express TLP

No packet loss(Adding reliability to Ethernet)

3

Page 7: A Resource disaggregated Platform for realizing Ultra-Fast

Dual Path for Throughput and Reliability• Two Ethernet connections are established between the Host Chip and I/O Chip

• Load balancing for performance• Path redundancy for failure recovery

Dual Port

CPU ExpEtherHost Chip

I/ODevice

ExpEther

IO Chip

I/ODevice

ExpEther

IO Chip

Failure RecoveryQuickly detects path failures and switches paths

Load-balancing

Round-robin data packet transmission between the two redundant connections

Ethernet Fabric-I

Ethernet Fabric-II

40G ExpEther NIC

Page 8: A Resource disaggregated Platform for realizing Ultra-Fast

Frame Rate ControlTCP/IP : Rate control is triggered by packet loss (TCP Reno)

NetworkBandwidth

Slow Start AvoidCongestion

TimeAvoid

CongestionAvoid

Congestion

Packet loss causes significant performance degradation because of retransmission.

ExpEther : Rate control is always done by measuring network latency

Probing Avoid Congestion

NetworkBandwidth

Time

Packet loss does not occur basically in ExpEther.

Congestion (Packet Loss)

ExpEther engine always measures the frame arrival time of receive side and minutely controls the frame rate to avoid packet loss.

Page 9: A Resource disaggregated Platform for realizing Ultra-Fast

Sequence of Network Path Failover (1/2)• Both network paths are used as ACT-ACT

EE NIC (Tx side)

RetransferBuffer Ar

bite

r

ExpEther Packet

EE (Rx side)

Rcv. Buf.

Rcv. Buf.

OrderingBuffer

Ether

Switch

Ether

Switch

Ether

Switch

Ether

Switch

Ether

Switch

Ether

Switch

123

4

5

6

7

8

10

121

4161

820

9

If a path is failed, ExpEther resends lost packets. This failover time is about 10 RTT (several microseconds).

11

13

15

17

19

EE NIC (Tx side)

RetransferBuffer Ar

bite

r

ExpEther Packet

EE (Rx side)

Rcv. Buf.

Rcv. Buf.

OrderingBuffer

Ether

Switch

Ether

Switch

Ether

Switch

Ether

Switch

Ether

Switch

Lost Packet

131

4151

617

Sequence Number Check

Re-receive packets after several microseconds

Ether

Switch

Resending

Page 10: A Resource disaggregated Platform for realizing Ultra-Fast

Sequence of Network Path Failover (2/2)• Network path is recovered by some Ethernet recovering scheme like P-Flow

linked with EE manager.EE NIC (Tx side)

RetransferBuffer Ar

bite

r

ExpEther Packet

EE (Rx side)

Rcv. Buf.

Rcv. Buf.

OrderingBuffer

Ether

Switch

Ether

Switch

Ether

Switch

Ether

Switch

Ether

Switch

123

456

7

89

101

112

DEVINFO

When ExpEther device receives a management packet indicating the path recovered, it starts reusing both network paths.

EE NIC (Tx side)

RetransferBuffer Ar

bite

r

ExpEther Packet

EE (Rx side)

Rcv. Buf.

Rcv. Buf.

OrderingBuffer

Ether

Switch

Ether

Switch

Ether

Switch

Ether

Switch

Ether

Switch

789

10

11

12

14

161

8202

224

13

15

Lost Packet

17

19

21

23 Eth

erSwitch

Ether

Switch

New path is enabled

Page 11: A Resource disaggregated Platform for realizing Ultra-Fast

SAS JBOD

Multi-path IO with Resource Disaggregation or ExpEther

• Multi-Path IO (MPIO)• MPIO is one of the technic for achieving high-reliability. If the target IO device supports MPIO,

it can support MPIO even under ExpEther.• Multi-Path Ethernet

• It supports the high-speed network path failover.

Host

SASHBA#0

SASHBA#1

HostEE

NIC#0

SAS JBOD

SASHBA#0

SASHBA#1

Equivalent

Act Act

MPIO

Ether

Switch

Ether

Switch

EE EE

MPIO

High-SpeedNetwork Failover

Page 12: A Resource disaggregated Platform for realizing Ultra-Fast

Dynamic Reconfiguration and Hot-Plug Capability

Host

B D G I

Host

A J

Host

C E H

Host

F

Group#1 Group#2 Group#3 Group#4

Logical View

Host Host Host

1 2 4

A B C D E F G H I J

1 1 1 12 23 3 34

EEManager

PCIeSwitch

PCIeSwitch

PCIeSwitch

PCIeSwitch

Host

Ethernet Fabric

3

Page 13: A Resource disaggregated Platform for realizing Ultra-Fast

Dynamic Reconfiguration and Hot-Plug Capability• Group ID (GID : 1~4,095)

• GID range from 1 to 15 is set by physical DIP switch residing on card.• Setting GID to 0 allows Management Software to program a soft GID.

Host Host HostHost

ManagementServer

EE1

EE2

EE3

EE4

EE

EE

EE

EE

EE

EE

EE

EE

EE

EE

EE

EE

EE

EE

EE

EE

1 1 1 1 12 2 2 23 3 34 4 4 4

IO

IO

IO

IO

IO

IO

IO

IO

IO

IO

IO

IO

IO

IO

IO

IO

Ethernet Fabric

Group ID Configuration

Group ID Configuration

Collecting Various

Information

- ExpEther Manager -➢ Configuration

• Group ID Configuration➢ Monitoring

• ExpEther network status• PCIe device status• New ExpEther detection• Failure detection

Management Frame

- Mng. Frame -➢ Special Ether Frame

• ExpEther hard wired logic directly receives and sends the frames for configuration and management

Page 14: A Resource disaggregated Platform for realizing Ultra-Fast

ExpEther Technology Architectural Possibilities▐ Std-EE : Standard PCIe-over-Ethernet

• Foundation of ExpEther▐ MR-EE : I/O sharing

• Multi-hosts are able to share an IO device by using SR-IOV compliant device▐ P2P-EE: I/O direct connection

• Support for the Peer-to-Peer data transfer between I/O devices.▐ NTB-EE : Remote direct memory access by NTB

• Hi-speed data transfer between hosts

Host

Std-EE

I/O I/O

P2P-EE

P2P-EE

Ethernet

Switch

Peer-to-Peer

P2P-EE

Current Path

Direct data transfer between IOs

Host

NTB-EEEthernet

Switch

Host

NTB-EE

Host

NTB-EE

NTB

NTB-EE

NTB Direct Access

Ethernet

I/O

Std-EE

I/O

Std-EE

I/O

EE V1

I/O

EE V1

Host

Std-EE

I/O I/O

Std-EE

Std-EE

Ethernet

Switch

PCIe-over-Ether

Standard Ho

stStd-EE

Partitioning

Foundation of ExpEther

I/O

Std-EE

Hot-Plug!

Partitioning

Host

Std-EE

SR-IOV

Ethernet

Switch

Host

Std-EE

Host

Std-EE

SR-IOV

MR-EE

MR-EE

Resource Sharing

MR-EE

Ethernet

Resource Sharing

Page 15: A Resource disaggregated Platform for realizing Ultra-Fast

• 40G ExpEther

ExpEther Lineup• 1G/10G ExpEther

● 2x 1000BASE-T● DVIx1,HDMI x1● USB3.0 x1● USB2.0 x3● Headphone x1● Microphone x1

● x1 PCI Express● Dual 1000BASE-T

● x8 PCIe Gen2● Dual 10G SFP+

● x16 PCIe x 1 slot● Dual 1000BASE-T

● x16 PCIe2 x 2 slots

(full height/full length)● Dual 10G SFP+ per slot

ExpEther HBA ExpEther Client

ExpEther IO Expansion Unit

IO Interface : x8 PCI Express 3.0Network I/F : QSFP+ x 2Form Factor : PCI Low Profile

IO Interface : x8 PCI Express 3.0Slots : x16 Slot x 4Network I/F : QSFP+ x 4Support IO : GPGPU (K80, P100, etc)

ExpEther HBA IO Expansion Unit

19” Rack Size

1,000W PSU for 2-Slot IO Expansion Unit

800W PSU for 4-Slot IO Expansion Unit

3U

400mm

1G1G

1G 10G

10G

40G

1G1G

1G 10G

10G

40G

Page 16: A Resource disaggregated Platform for realizing Ultra-Fast

Performance of EE vs Local with PCIe based SSD’s

name/depth 1 4 16 64

local 2363699.2 2266555.73

2264849.07

2197777.07

ExpEther 2657348.27 2490436.27

2491665.07 2462105.6

ExpEther/local (%) 112.42 109.88 110.01 112.03

 name/depth 1 4 16 64

local 1039667.2 1056358.4 1093597.87 1068544

ExpEther 1053832.53

1060447.57

1072503.47

1054276.27

ExpEther/local (%) 101.36 100.39 98.07 98.66

There is no impact on bandwidth in ExpEther that can fully support PCIe x8 gen3 (64Gbps)

Page 17: A Resource disaggregated Platform for realizing Ultra-Fast

Performance of EE vs Local with PCIe based SSD’s

name/depth 1 4 16 64

local 42467.67 157166.67 370562 394857.33

ExpEther 39113.67 145294.33 359693.67 392591

ExpEther/local (%) 92.10 92.45 97.07 99.43

name/depth  1 4 16 64

local 216616.67 231208.67 226872.67 231524

ExpEther 130586.67 230349.33 232602.67 231147.33

ExpEther/local (%) 60.28 99.63 102.53 99.84

ExpEther can achieve the same IOPS as local by increasing the IO depth parameter to hide the latency of Ethernet.

Page 18: A Resource disaggregated Platform for realizing Ultra-Fast

Performance of EE vs Local with Tesla GPU(P100)

ExpEther able to achieve almost the same performance as local case.In DeepLearning applications, the data exchange between host and GPU is very small and most process

time is consumed in GPU, so there is no performance impact in ExpEther.

Local

via ExpEther

Page 19: A Resource disaggregated Platform for realizing Ultra-Fast

Main DB(FC SAN)

DB Journal(NVMe + EE)

Ultra-Fast Failover Recovery for Database system with EE

High-speed data restore to Standby server for In-memory DB.

Ethernet

FC

✓ NVMe SSD is faster than Fiber Channel.

✓ Use NVMe SSD as Journal for DB.

Logging

Restore

Failo

ver w

hen

serv

er is

faile

d.

Fail

Active Server

Standby Server

✓ When Active Server fails, NVMe SSDs’ connection is switched, allowing for DB journal restore on Standby Server.

Page 20: A Resource disaggregated Platform for realizing Ultra-Fast

Ultra-Fast Failover Recovery for Database system with EE and ExpressCluster X

• <Real pictures of the setup to be added here on left hand side>

Failover

Primary Power Server

Secondary Power Server

Page 21: A Resource disaggregated Platform for realizing Ultra-Fast

Ultra-Fast Failover Recovery for Database system with EE and ExpressClusterX

• <Experiments results, observations and conclusions to be added here>

Page 22: A Resource disaggregated Platform for realizing Ultra-Fast

Ultra-Fast Failover Recovery for Database system with EE and ExpressCluster X

• <Comparisons with existing Failover system architectures to be added here>

Page 23: A Resource disaggregated Platform for realizing Ultra-Fast

Service Acceleration Platform with ExpEther

EE Client

USB/VGA

KVM

CPU/Chips

et

CPU/Chips

et

Remote IO

GPGPUGPGPUGPGPUGPGPU

GPGPUGPGPUGPGPUAcceleratorFPGA

NVMe

SSDNVM

eSSD

NVMe

SSD

NVMe

SSD

ExpEther

Engines

NVMe

SSDNVM

eSSD

NVMe

SSD

NVMe

SSD

ExpEther

Engines

NVMe

SSDNVM

eSSD

NVMe

SSD

NVMe

SSD

ExpEther

Engines

NVMe

SSDNVM

eSSD

NVMe

SSD

NVMe

SSD

ExpEther

Engines

Accelerator NodeCompute Node

Remote IO Devices

ExpEtherHBA

ExpEtherHBA

ExpEther

Engine

EthernetPCIe TLP

Ether

Switch

ExpEther

Engine

USBCtrl

ExpEther

Engines

ExpEther

Engines

Sensors

Ether

Switch

Accelerator Resource PoolIO devices can be dynamically allocated to appropriate host according to workload

Ether

Switch

Hi-speed Storage Node

Page 24: A Resource disaggregated Platform for realizing Ultra-Fast

Future Roadmap of ExpEther or Universal Interconnect

Page 25: A Resource disaggregated Platform for realizing Ultra-Fast

Summary• <To be Added>

Page 26: A Resource disaggregated Platform for realizing Ultra-Fast

Thank you