high-bandwidth integrated optics for server...

High-Bandwidth Integrated Optics for Server Applications

Future Directions in Packaging (FDIP) Workshop

(in conjunction with the IEEE International Conference on Electrical Performance of Electronic Packaging and Systems)

Alan Benner, [email protected] Corp. – Sr. Technical Staff Member, Systems & Technology GroupInfiniBand Trade Assoc. – Chair, ElectroMechanical Working Group

2

The Mandatory Top500 Slide: Exponential growth in system performance

System-level improvements will continue, at faster than Moore’s-law exponential rateSystem performance comes from aggregation of larger numbers of chips & boxes

Bandwidth requirements must scale with system, roughly 0.5B/FLOP (memory + network)Receive an 8 Byte word, do ~32 ops with it, then transmit it onward 16B / 32 OperationsActual BW requirements vary by application & algorithm by >10x : 0.5B/FLOP is an average

Chip Trend: ~50-60% (2x/18 mo.)

Parallel System Trend: (~95%)= CPU trend + more parallelism

Nov. 13, 2009 http://www.top500.org

Roadrunner1PFlop

1PFlop=1015 Flop

1,000X performance per decade

~2020

Exa-

Transistors & Pkg:15%-20% CAGR,slowing

Box:70-80% CAGR,continuing

Uniprocessor:50% CAGR,slowing

Time (linear)

Perf

orm

ance

(log)

Cluster/Parallel:~95-100% CAGR,continuing

System Level Performance

CAGR = Compound Annual Growth Rate

Note: Interconnect requirements for Top500’s benchmark app (Linpack) is ~midway between many “real” supercomputing apps & data center apps Similar trends & CAGRs apply to data centers.

Chip: 100x/10 yr

System: 1,000x/10 yr

3

The Landscape of Interconnect

LaterProbably after 20152010-20152009-2010NowSince 90sSince 80sUse of optics

1 - 100s1 - 100s1 - 100s1 - 100s1 - 10s1 - 10s1Typical # lanes per link

0 mm- 20 mm

5 mm- 100 mm

0.1 m- 0.3 m

0.3 m- 1 m

1 m- 10 m

10,- 300 m

Multi-kmLength

Distinguished by

Length & Packaging

Intra-chipIntra-ModuleIntra-CardBackplane / Card-to-Card

Cables – ShortCables – LongMAN & WANPHYSICAL Link Types

Reliability & cost vs. DRAM

Reliability, massive BW,

reliability

ReliabilityShared tech between servers & desktops

Shared tech. between servers & desktops

Dominated by FC

BW & latency to <60 meters

100-300m over RJ-45 / CAT5 cabling, or wireless

Inter-operability

with “Everybody”

Key Characteristic

Maybe Never? (Wireless,

Building re-wiring, BW demand)

Traffic: HTML pages to laptops,..

Stds: 1G Ethernet, WiFi

Local Area Network

Coming

Traffic: Load/store coherency ops to other CPUs’cachesStds: Hyper-transport

SMP Coherency Bus

Coming laterNot yetScatteredNot yetSince 90sSince 2000sSince 80sUse of optics

Traffic: Load/Store to DRAM or Memory Fanout chipStds: DDR3/2/.

Traffic: Load/store to Hubs & bridges

Stds: Hyper-Transport

Traffic: Load/store to I/O adapters

Stds: PCI/PCIe

Traffic: Read/Write to disk, unshared

Stds: SAS, SATA

Traffic: Read/Write to disk, shared

Std: Fibre Channel

Traffic: Intra-application, or intra-distributed-application Stds: InfiniBand, 1G Ethernet, 10/40/100Enet

Traffic: IP

Stds: Ethernet, ATM, SONET,

Distinguished by

Function & Link Protocol

Memory BusMezzanine Bus

I/ODirect Attach Storage

Storage Area

Network

Cluster / Data Center

InternetLOGICAL Link Types

Link Technology Single-mode Optics Mixed multi-mode optics & copper Copper

4

Optical vs. Electrical - Cost-Effectiveness Link Crossover Length

Qualitative Summary: At short distances, copper is “cheaper” (power, $$, design complexity,..)At longer distances, optics is cheaper: System design requires finding optimal crossover length

Link Cost vs. Distance

0.1

1

10

100

1000

0.001 0.01 0.1 1 10 100 1000 10000Tx-Rx distance (Meters)

Cost($/Gbps)

PCB Traces on a circuit board

SAN/Cluster Cables in one room

LANCables in walls

CampusCables

underground

MAN/WAN Cables rented

Optical

Copper

On-chipTraces on a single chip

O/E cost-effectiveness crossover length

$

$$$$

$$$

$$ Cost of card-edge connectors

Cost of optical transceiver

Cost of single-mode optics

Cost of opening up walls for cabling

Curves shown for ~2.5 Gbps

5

Cost-Effectiveness Link Crossover Length – Dependence on bit-rate

Observations: Across the decades, the crossover lengths, for particular bit-rates, have stayed pretty constant – copper & optical get cheaper at pretty much the same rate. As bit-rates have risen, a higher percentage of overall interconnect has moved to optics.

Link Cost vs. Distance and Bandwidth

0.1

1

10

100

1000

0.001 0.01 0.1 1 10 100 1000 10000Tx-Rx distance (Meters)

Cost($/Gbps)

PCB Traces on a circuit board

SAN/Cluster Cables in one room

LANCables in walls

CampusCables

underground

MAN/WAN Cables rented

Optical

40

10

2.5.6

CopperCopper

40 Gb/s

10 Gb/s

2.5 Gb/s

.6 Gb/s

Optical

40 Gb/s

10 Gb/s

2.5 Gb/s

.6 Gb/sOn-chip

Traces on a single chip

O/E cost-effectiveness crossover lengths

$

$$$$

$$$

$$

6

Rack-to-rack cable: Recent history in HPC systems

Over time: higher bit-rates, similar lengths, more use of optics, denser connector packing

IBM Federation Switch for ASCI Purple (LLNL)- Copper for short-distance links (≤10 m)- Optical for longer links (20-40m)~3000 parallel links 12+12@2Gb/s/channel

• 4X DDR InfiniBand (5Gb/s)

• 55 miles of Active Optical Cables

Combination of Electrical & Optical Cabling

2005 2008: 1PF/s2002

NEC Earth Simulator• all copper, ~1 Gb/s

IBM Roadrunner (LLNL) Cray Jaguar(ORNL)

• InfiniBand • 3 miles of optical

cables, longest = 60m

*http://www.nccs.gov/jaguar/

*http://www.lanl.gov/roadrunner/

7

Evolution of Supercomputer-scale systems – 1980s-2020s

In 2018-2020, we’ll be building Exascale systems – 1018 ops/sec – with 10s of millions of processing cores, near billion-way parallelism

Yes, there are apps that can use this processing power: Molecular-level cell simulations, Modeling brain dynamics at level of individual neurons, Multi-scale & multi-rate fluid dynamics, …

Massive interconnects will be required both within racks and between racks.

Supercomputing 2000s: 10,000s of CPUs in 100s of racks

Supercomputing - 1980s 1-8 processors in 1 rack

Supercomputing 2020s: 10M to >100M CPU cores,

>500 racks?

8

2011-2013: “Practical Petascale” Blue Waters System

Target: #1 productivity supercomputer in 2011: 1-2 PetaFLOP/s Sustained (~10 PF Peak)

Selected Statistics:More than 300,000 coresMore than 1 PetaByte of memoryMore than 10 PetaBytes of user disk storageMore than 0.5 Exabyte of archival storageUp to 400 Gbps external connectivity>2.5M optical channels, 10 Gb/s each

Uses: Modeling Very Complex SystemsCells, Organs, and OrganismsHurricanes (incl. storm surge, etc), tornadoes,..Galaxy formation in early universeEffect of Sun’s corona on Earth’s ionosphereDesign: aircraft, jet engines, fusion, …Atomic-level design of new materials……

Maximum architected Power7-IH system is half-again bigger (500K cores)

The new Illinois NCSA Petascale Computing Facility that will house Blue Waters. Reference: www.ncsa.uiuc.edu/BlueWaters/

P7-IH Node Drawer:8 32-way SMP nodes1 TF per SMP node

Per node (“octant”): 128 GB DRAM>512 MB/s memory BW>190 GB/s network BW

Optical transceivers tightly integrated, mounted within drawer

99

Integrated Storage– 384 2.5” drives / drawer, 0-6 drawers / rack230 TBytes\drawer (w/ 600GB 10K SAS disks), full RAID, 154 GB/s BW/drawerStorage Drawers replace server drawers at 2-for-1 (up to 1.38 PetaBytes / rack)

Integrated Cooling – Water pumps and heat exchangersAll thermal load transferred directly to building chilled water – no load on room

Integrated Power Regulation, Control, & DistributionRuns off any building voltage supply world-wide (200-480 VAC or 370-575VDC), converts to 360 VDC for in-rack distribution. Full in-rack redundancy and automatic fail-over, 4 line cords. Up to 252 kW/rack max / 163 kW Typ.

All data center power & cooling infrastructure included in compute/storage/network rackNo need for external power distribution or computer room air handling equipment.All components correctly sized for max efficiency – extremely good 1.18 Power Utilization EfficiencyIntegrated management for all compute, storage, network, power, & thermal resources.Scales to 512K P7 cores (192 racks) – without any extraneous hardware except optical fiber cables

Servers – 256 Power7 cores / drawer, 1-12 drawers / rackCompute: 8-core Power7 CPU chip, 3.7 GHz, 12s technology, 32 MB L3 eDRAM/chip, 4-way SMT, 4 FPUs/core, Quad-Chip Module; >90 TF / rack

No accelerators: normal CPU instruction set, robust cache/memory hierarchyEasy programmability, predictable performance, mature compilers & libraries

Memory: 512 Mbytes/sec per QCM (0.5 Byte/FLOP), 12 Terabytes / rackExternal IO: 16 PCIe Gen2 x16 slots / drawer; SAS or external connectionsNetwork: Integrated Hub (HCA/NIC & Switch) per each QCM (8 / drawer), with 54-port switch, including total of 12 Tbits/s (1.1 TByte/s net BW) per Hub:

Host connection: 4 links, (96+96) GB/s aggregate (0.2 Byte/FLOP) On-card electrical links: 7 links to other hubs, (168+168) GB/s aggregateLocal-remote optical links: 24 links to near hubs, (120+120) GB/s aggregateDistant optical links: 16 links to far hubs (to 100M), (160+160) GB/s aggregatePCI-Express: 2-3 per hub, (16+16) to (20+20) GB/s aggregate

PERCS/Power7-IH System - Data-Center-In-A-Rack

10

P7-IH – Cable Density

Many many optical fibersEach of these cables is a 24-fiber multimode cable, carrying (10+10) GBytes/sec of traffic

For size referenceFor size reference

1111

P7 IH System Hardware – Node Front View (Blue Waters: ~1200 Node drawers)

P7 QCM (8x)

Hub Module (8x)

D-Link Optical InterfaceConnects to other Super Nodes

360VDC Input Power Supplies

Water Connection

L-Link Optical InterfaceConnects 4 Nodes to form Super Node

MemoryDIMM’s (64x)

MemoryDIMM’s (64x)

PCIeInterconnect

1m W x 1.8m D x 10cm H

IBM’s HPCS Programpartially supported by

MLC ModuleHub Assembly

PCIeInterconnect

D-Link Optical InterfaceConnects to other Super Nodes

Avago microPODTM All off-node communication optical

12

Hub Module – MCM with Optical I/Os

This shows the Hub module with full complement of Optical I/Os. Module in photo is partially assembled, to show construction – full module HW is symmetric

Heat Spreader for Optical DevicesCooling / Load Saddle for Optical Devices

Optical Transmitter/Receiver Devices 12 channel x 10 Gb/s28 pairs per Hub - (2,800+2,800) Gb/s of optical I/O BW

Heat Spreader over HUB ASIC

Strain Relief for Optical RibbonsTotal of 672 Fiber I/Os per Hub, 10 Gb/s each

Hub ASIC (Under Heat Spreader)

13

High-Density Optical Transceivers and Optical Connectors

For this program, we needed a new generation of optical components –Denser, faster, more configurable, equally reliable, & much more tightly-integrateable into the system

Joint development activities with Avago Technologies and USConec, Ltd. have led to successful demonstration of these components

We purposefully defined the interfaces (electrical, optical, mechanical, thermal, & management) to be compatible with industry-standards:

- InfiniBand-12x-QDR and- Ethernet 100 Gbit/sec SR

in order to make these technologies available to the rest of the IT industry.

Commercially available now, from multiple manufacturers

PRIZMTM Light-TurnR

Optical connector

MicroPODTM

Transmitter / Receiver Module

Optical TX

Prizm Connector

Optical RX

14

Avago MicroPOD – TX & RX Performance

This slide courtesy of Mitch Fields, Avago Technologies

15

MicroPOD Signal Integrity Features


16



17



18

MicroPOD TX Input Equalization


19

MicroPOD RX Output De-emphasis


20

Manufacture of MicroPOD – paradigm shift

Manufacturing Volume Change from ~50,000 per year, worldwide, to 500,000 per year …10X more parts to build, test, and install

…drives Massive Changes in Product Design and DeliverySimple vertical stack designInvestment in manufacturing technology for 100% automationManufactuer parallel optics in panel form.


21

The Payoff –

1,000s of channels manufactured and demonstrated, all running at 10 Gbps, error-free.

RXPAVE / TXLOP

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

RXPAVE / TX LOP

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

RXPAVE / TX LOP

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

RX PAVE / TX LOP

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

R X PA V E / T X LOP

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

22

Short history of supercomputing for Weather Simulation

~1995~2000 ~2005 ~2009

Thank you kindly

-- any questions?

high-bandwidth integrated optics for server...

Documents