scalable cluster interconnect...myricom, inc. 325 n. santa anita ave. arcadia ca 91006 1...

21
1 Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com Scalable Cluster Interconnect Overview and Technology Roadmap Charles L. Seitz [email protected] Linux Superclusters Users Conference Albuquerque, NM 13 September 2000

Upload: others

Post on 07-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

1Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Scalable ClusterInterconnect

Overview and Technology Roadmap

Charles L. [email protected]

Linux Superclusters Users ConferenceAlbuquerque, NM13 September 2000

Page 2: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

2Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

What is Myrinet?

• A high-performance, cost-effective, packet communicationand switching technology– ANSI Standard (ANSI/VITA 26-1998)

– Packets follow the route specified by the source host (sourcerouting).

– Processing power at the hosts and in the interfaces

– This architecture allows an elegant, streamlined, switchingtechnology

• A descendant of packet communication and routing inMPPs, but commodity and open

• Used principally for scalable clusters

Page 3: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

3Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet Products and Applications

Myricom supplies all that is required to make a high-performance cluster from a collection of computers.

Software Host Interface

PCI Interfaces

Link Cables SAN (to 3m) Serial (to 10m) Fiber (to 200m) Long-wave Fiber

Cut-Through Switches

In-CabinetClusters

Desktop Hosts

VME Single-Board-Computer Clusters

Any NetworkTopology

2+2Gbits/s

Page 4: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

4Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet Technology “in the Large”Sandia National Laboratory Cplant™

2,576 Compaq Alpha Personal Workstations,400 EV-5 + 768 EV-6 + 1408 EV-6, but not allin one cluster.

Compaq CustomSystems was the integrator.The system was built in three phases, in thesummers 1998, 1999, and 2000.

Cplant originally used 16-port Myrinet switchesin each 8-host cabinet. The latest increment usesa mesh variant of the M2LM-Clos64 “Networkin a Box” products for switching.

(Photo adapted from http://www.cs.sandia.gov/cplant/)

Page 5: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

5Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet Technology “in the Small”CSPI Quad-PowerPC VME Signal-Processing Board

This CSPI two-level-multicomputer productuses the Myricom LANai-5 chip to

interface the PowerPCs tothe message-passing

network.

This single-width VMEboard includes a packet-switchedMyrinet network interconnecting the 4 nodes onthe board and 4 external ports with an 8-portMyrinet switch (a chip not visible in this photo).

Page 6: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

6Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Why Myrinet? The “selling points.”• Low latency

– ~8.5µs today (UNIX user process touser process, fully protected, withend-to-end data integrity checking)

– The lower the latency, the wider theapplication span

• High data rate– 2+2 Gb/s shipping now

– 1.28+1.28 Gb/s legacy

– Copper and fiber links

• Unlimited scalability

• Very low host-CPU utilization– logP = ~1µs

• “Peg-the-needle” PCIimplementations

• High Availability features– Self-mapping, self-healing

– Link-continuity monitoring

• Data Integrity features– Memory and bus parity

– Link CRC

– Packet payload CRC

• More cost-effective than GigabitEthernet or Fibre Channel

– Cost per node < $1,500 today

– Cost per node < $1,000 soon

• Software drivers for all majorplatforms

– Download them from the Web

– Open source

Page 7: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

7Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet = ANSI/VITA 26-1998

Myrinet is defined at the Data-Link level (level 2 of the ISO reference model for computer networks) by its packet format and flow control. Think of Myrinet as the simplest packet-switched network you can devise.

Sourcerouteusedby theswitches, which strip the bytes as they are used

Type (allows multiple protocols on one Myrinet)

Payload (any length)CRC

(Bytes)

http://www.myri.com/open-specs/

There are multiple Physical-level implementations.

Page 8: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

8Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet Switches -- “Just Technology”

16-port 2nd-generation Myrinet switch (M2LM-SW16) with 8 SAN ports and 8 LAN ports

• 20.48 Gb/s bisection data rate (!) from a single-chip 16x16 crossbar.

• Path-formation latency 100ns SAN-SAN, 200ns SAN-LAN, 300ns LAN-LAN.

• 32 Watts, 2U rack mount size, no fan.

• SNMP/Ethernet monitoring & control (out of band) + Myrinet heartbeat.

• $5K US-list. The “workhorse” 2nd-generation Myrinet switch.

Page 9: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

9Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

2nd-Generation Myrinet “Network in a Box”

Clos network of 16 16-port switches,with 64 LAN host ports, and 64 SANinter-switch ports.

Full (maximal) bisection data ratebetween the 64 host ports = 32 links(41+41 Gb/s). Data rate between thehost ports and the inter-switch ports =64 links (82+82 Gb/s).

160 Watts, 12U rack mount size

SNMP/Ethernet monitoring andcontrol, with the full set of Myrinethigh-availability features.

$40K US-list.

Page 10: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

10Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet Interfaces

Network

InterfaceFast SRAM

RISC

DMA controller

& bus bridge

Packet

DMASANport

Parts of the LANai chipPCIDMA chip

M3M-PCI64B-2Universal 64/32-bit, 66/33MHzMyrinet-2000-SAN/PCI Interface

From a customer:“What makes Myrineteffective for clusters is

the autonomy of the interfaces,which lets us

get the OS out of the way.”

Page 11: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

11Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet Software Interfaces

Applications

MPI Middleware

TCPUDP

IP

Ethernet Myrinet

Myrinet Control Program (MCP)

HostOS

OS-bypassAPIs (multiple host processes)

(executes in the Myrinet interface)

10/100/1000 Mb/s1280+1280 Mb/s

VIA

Page 12: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

12Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

The GM Message-Passing System

No Compromises• Concurrent, protected,

user-level access

• Reliable, orderedmessage delivery

• Very low CPU overhead

• Robust under networkfaults

• Mapping

• Segmentation andreassembly of longmessages

• High-level flow control

• “Clean” API, withexception handling

• Zero-copy layering ofother APIs

GM Data-Rate Performance (Myrinet-2000 SAN Interfaces)

GM short-message latency (Myrinet-2000 interfaces)~ 8.5µs (best numbers)

GM CPU overhead = 1-2µs per message (LogP)

UNIX user process to user processFully protected

End-to-end data integrity

Page 13: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

13Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

GM and MPICH-over-GM Latencies

UNIX user process to user processFully protected

End-to-end data integrity

Page 14: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

14Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

MPICH over GM Data Rate

UNIX user process to user processFully protected

End-to-end data integrity

Page 15: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

15Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet 2000 – Third-Generation MyrinetThis evolutionary step improves the links at the Physical level -- boththe performance and the “look and feel” of Myrinet --, and introducesinterfaces with 1.7x and 2.5x faster RISCs, but Myrinet-2000 iscompatible with 2nd-generation Myrinet at the Data Link leveland in the software. (Don’t try to innovate along too manydimensions at once! This is a technology push, not an architecturechange.)

SAN-1280 SAN-2000 Circuit boards & ribbon cables (3m)

LANSerial copper HSSDC, 2+2 Gb/s to 10m

Low-cost fiber Multimode fiber, 2+2 Gb/s to 200m

Page 16: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

16Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet-2000

• 2+2 Gb/s links using the same Physical mediaand signaling as {2.5GbE, 2xFC, & 1xInfiniBand}.– HSSDC cables to 10m and low-cost fiber to 200m.

• 64/32-bit, 66/33MHz, Myrinet/PCI interfaces(LANai 9)– 132 MHz RISC, 1,056 MB/s local-memory data

rate (achieves 8.5µs GM latency)

– In 1Q01, 200MHz RISC, 1,600 MB/s local-memory data rate (~6.5µs GM latency)

• Modular Switches– 16-port crossbar and 32|64|128-host Clos switches,

with line-card options for SAN, serial, or fiberlinks.

Page 17: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

17Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet-2000 128-Host “Network in a Box”

This family of products support hot-plugging of line cards, fans, and dual redundant powersupplies. Microcomputer monitoring (SNMP over Ethernet) provides extensive diagnosticcapabilities, and management features needed for high-availability applications.

Different types of line cards have Serial, Fiber, SAN, or legacy LAN ports

Spine of the Clos Network (backplane)

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

8 hosts

Closspreadernetwork

Ports to up to 128 hosts (line cards)

Page 18: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

18Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

The family of Myrinet-2000 switch products

Clos“spreader”

network(128 links)

8 16-portswitcheson the

backplane Up to 1616-portswitches

on the linecards

17-slotenclosureup to 128

hosts

Clos“spreader”

network(64 links)

4 16-portswitcheson the

backplane

Up to 8 16-portswitches

on the linecards

9-slotenclosureup to 64

hosts

…(32 links)

2 16-portswitcheson the

backplane

Up to 416-portswitches

on the linecards

5-slotenclosureup to 32

hosts

One line cardwith a 16-port

switch, and onestraight-through

line card

3-slotenclosureup to 16

hosts

Add the optional monitoring line card to provide SNMP/Ethernet monitoring andcontrol. The monitoring line card includes a microcontroller and dual Ethernetports. All line cards are interchangable across the product family.

Page 19: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

19Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Why Clos Networks?; Maximal performance under arbitrary traffic patterns

; Minimum bisection is the largest possible; “Rearrangable Network” (can route any permutation); Network looks the same from any host (simplifies cluster management)

; Multiple paths; All progressive routes are deadlock-free; Use multiple paths for redundancy; Use multiple paths to avoid hot spots (random dispersion)

; Scales well. For n hosts (minimum bisection = n /2):; Diameter varies as log(n); Cost varies as nlog(n); Modular

; Economies of sharing the power supply and microcontroller betweenmany switches, and implementing many of the inter-switch links oncircuit boards rather than cables.

Page 20: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

20Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet Technology – History & Roadmap

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

1st Generation0.64+0.64 Gb/s links

2nd Generation1.28+1.28 Gb/s links

3rd Generation“Myrinet 2000”2+2 Gb/s links

32-bit SBus (SPARC) interfaces, 8-port switches

32-bit PCI interfaces (LANai 4), 8-port switches

SAN PHY level

Clos “network in a box” of 8-port switches

16-port switches, HA features

64-bit PCI interfaces (LANai 7), GM message system

Clos “network in a box” of 16-port switches

64-bit PCI interfaces (LANai 9), SW16, Clos128

PCI-X, multiple virtual channelsGigabit Ethernet ports on Myrinet switches

PastFuture

Full Interoperability with 1x InfiniBand

4x InfiniBand links

Products & Features

Page 21: Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1 626-821-5555 Fax: 626-821-5316 http:/ / Scalable Cluster Interconnect Overview and Technology

21Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006

626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com

Myrinet Technology Roadmap

• In mid-2001, PCI-X interfaces– PCI-X is not only 2x faster than 66MHz PCI, PCI-X allows concurrent,

interleaved transactions.

• Also in mid-2001, multiple virtual channels.– Allows “express lanes” for latency-sensitive traffic.

– Coordinated with PCI-X, because today’s PCI would otherwise get in the wayof latency-sensitive transactions.

– Required later for full interoperability with InfiniBand.

• Programmable bridges/routers between {Myrinet, Gigabit Ethernet,InfiniBand} with “Myrinet inside.”

• Support or converge with InfiniBand.– We have all of the necessary technology now for the PHY layer.

– Track and support the protocols and APIs in firmware.