high-bandwidth integrated optics for server...
TRANSCRIPT
High-Bandwidth Integrated Optics for Server Applications
Future Directions in Packaging (FDIP) Workshop
(in conjunction with the IEEE International Conference on Electrical Performance of Electronic Packaging and Systems)
Alan Benner, [email protected] Corp. – Sr. Technical Staff Member, Systems & Technology GroupInfiniBand Trade Assoc. – Chair, ElectroMechanical Working Group
2
The Mandatory Top500 Slide: Exponential growth in system performance
System-level improvements will continue, at faster than Moore’s-law exponential rateSystem performance comes from aggregation of larger numbers of chips & boxes
Bandwidth requirements must scale with system, roughly 0.5B/FLOP (memory + network)Receive an 8 Byte word, do ~32 ops with it, then transmit it onward 16B / 32 OperationsActual BW requirements vary by application & algorithm by >10x : 0.5B/FLOP is an average
Chip Trend: ~50-60% (2x/18 mo.)
Parallel System Trend: (~95%)= CPU trend + more parallelism
Nov. 13, 2009 http://www.top500.org
Roadrunner1PFlop
1PFlop=1015 Flop
1,000X performance per decade
~2020
Exa-
Transistors & Pkg:15%-20% CAGR,slowing
Box:70-80% CAGR,continuing
Uniprocessor:50% CAGR,slowing
Time (linear)
Perf
orm
ance
(log)
Cluster/Parallel:~95-100% CAGR,continuing
System Level Performance
CAGR = Compound Annual Growth Rate
Note: Interconnect requirements for Top500’s benchmark app (Linpack) is ~midway between many “real” supercomputing apps & data center apps Similar trends & CAGRs apply to data centers.
Chip: 100x/10 yr
System: 1,000x/10 yr
3
The Landscape of Interconnect
LaterProbably after 20152010-20152009-2010NowSince 90sSince 80sUse of optics
1 - 100s1 - 100s1 - 100s1 - 100s1 - 10s1 - 10s1Typical # lanes per link
0 mm- 20 mm
5 mm- 100 mm
0.1 m- 0.3 m
0.3 m- 1 m
1 m- 10 m
10,- 300 m
Multi-kmLength
Distinguished by
Length & Packaging
Intra-chipIntra-ModuleIntra-CardBackplane / Card-to-Card
Cables – ShortCables – LongMAN & WANPHYSICAL Link Types
Reliability & cost vs. DRAM
Reliability, massive BW,
reliability
ReliabilityShared tech between servers & desktops
Shared tech. between servers & desktops
Dominated by FC
BW & latency to <60 meters
100-300m over RJ-45 / CAT5 cabling, or wireless
Inter-operability
with “Everybody”
Key Characteristic
Maybe Never? (Wireless,
Building re-wiring, BW demand)
Traffic: HTML pages to laptops,..
Stds: 1G Ethernet, WiFi
Local Area Network
Coming
Traffic: Load/store coherency ops to other CPUs’cachesStds: Hyper-transport
SMP Coherency Bus
Coming laterNot yetScatteredNot yetSince 90sSince 2000sSince 80sUse of optics
Traffic: Load/Store to DRAM or Memory Fanout chipStds: DDR3/2/.
Traffic: Load/store to Hubs & bridges
Stds: Hyper-Transport
Traffic: Load/store to I/O adapters
Stds: PCI/PCIe
Traffic: Read/Write to disk, unshared
Stds: SAS, SATA
Traffic: Read/Write to disk, shared
Std: Fibre Channel
Traffic: Intra-application, or intra-distributed-application Stds: InfiniBand, 1G Ethernet, 10/40/100Enet
Traffic: IP
Stds: Ethernet, ATM, SONET,
Distinguished by
Function & Link Protocol
Memory BusMezzanine Bus
I/ODirect Attach Storage
Storage Area
Network
Cluster / Data Center
InternetLOGICAL Link Types
Link Technology Single-mode Optics Mixed multi-mode optics & copper Copper
4
Optical vs. Electrical - Cost-Effectiveness Link Crossover Length
Qualitative Summary: At short distances, copper is “cheaper” (power, $$, design complexity,..)At longer distances, optics is cheaper: System design requires finding optimal crossover length
Link Cost vs. Distance
0.1
1
10
100
1000
0.001 0.01 0.1 1 10 100 1000 10000Tx-Rx distance (Meters)
Cost($/Gbps)
PCB Traces on a circuit board
SAN/Cluster Cables in one room
LANCables in walls
CampusCables
underground
MAN/WAN Cables rented
Optical
Copper
On-chipTraces on a single chip
O/E cost-effectiveness crossover length
$
$$$$
$$$
$$ Cost of card-edge connectors
Cost of optical transceiver
Cost of single-mode optics
Cost of opening up walls for cabling
Curves shown for ~2.5 Gbps
5
Cost-Effectiveness Link Crossover Length – Dependence on bit-rate
Observations: Across the decades, the crossover lengths, for particular bit-rates, have stayed pretty constant – copper & optical get cheaper at pretty much the same rate. As bit-rates have risen, a higher percentage of overall interconnect has moved to optics.
Link Cost vs. Distance and Bandwidth
0.1
1
10
100
1000
0.001 0.01 0.1 1 10 100 1000 10000Tx-Rx distance (Meters)
Cost($/Gbps)
PCB Traces on a circuit board
SAN/Cluster Cables in one room
LANCables in walls
CampusCables
underground
MAN/WAN Cables rented
Optical
40
10
2.5.6
CopperCopper
40 Gb/s
10 Gb/s
2.5 Gb/s
.6 Gb/s
Optical
40 Gb/s
10 Gb/s
2.5 Gb/s
.6 Gb/sOn-chip
Traces on a single chip
O/E cost-effectiveness crossover lengths
$
$$$$
$$$
$$
6
Rack-to-rack cable: Recent history in HPC systems
Over time: higher bit-rates, similar lengths, more use of optics, denser connector packing
IBM Federation Switch for ASCI Purple (LLNL)- Copper for short-distance links (≤10 m)- Optical for longer links (20-40m)~3000 parallel links 12+12@2Gb/s/channel
• 4X DDR InfiniBand (5Gb/s)
• 55 miles of Active Optical Cables
Combination of Electrical & Optical Cabling
2005 2008: 1PF/s2002
NEC Earth Simulator• all copper, ~1 Gb/s
IBM Roadrunner (LLNL) Cray Jaguar(ORNL)
• InfiniBand • 3 miles of optical
cables, longest = 60m
*http://www.nccs.gov/jaguar/
*http://www.lanl.gov/roadrunner/
7
Evolution of Supercomputer-scale systems – 1980s-2020s
In 2018-2020, we’ll be building Exascale systems – 1018 ops/sec – with 10s of millions of processing cores, near billion-way parallelism
Yes, there are apps that can use this processing power: Molecular-level cell simulations, Modeling brain dynamics at level of individual neurons, Multi-scale & multi-rate fluid dynamics, …
Massive interconnects will be required both within racks and between racks.
Supercomputing 2000s: 10,000s of CPUs in 100s of racks
Supercomputing - 1980s 1-8 processors in 1 rack
Supercomputing 2020s: 10M to >100M CPU cores,
>500 racks?
8
2011-2013: “Practical Petascale” Blue Waters System
Target: #1 productivity supercomputer in 2011: 1-2 PetaFLOP/s Sustained (~10 PF Peak)
Selected Statistics:More than 300,000 coresMore than 1 PetaByte of memoryMore than 10 PetaBytes of user disk storageMore than 0.5 Exabyte of archival storageUp to 400 Gbps external connectivity>2.5M optical channels, 10 Gb/s each
Uses: Modeling Very Complex SystemsCells, Organs, and OrganismsHurricanes (incl. storm surge, etc), tornadoes,..Galaxy formation in early universeEffect of Sun’s corona on Earth’s ionosphereDesign: aircraft, jet engines, fusion, …Atomic-level design of new materials……
Maximum architected Power7-IH system is half-again bigger (500K cores)
The new Illinois NCSA Petascale Computing Facility that will house Blue Waters. Reference: www.ncsa.uiuc.edu/BlueWaters/
P7-IH Node Drawer:8 32-way SMP nodes1 TF per SMP node
Per node (“octant”): 128 GB DRAM>512 MB/s memory BW>190 GB/s network BW
Optical transceivers tightly integrated, mounted within drawer
99
Integrated Storage– 384 2.5” drives / drawer, 0-6 drawers / rack230 TBytes\drawer (w/ 600GB 10K SAS disks), full RAID, 154 GB/s BW/drawerStorage Drawers replace server drawers at 2-for-1 (up to 1.38 PetaBytes / rack)
Integrated Cooling – Water pumps and heat exchangersAll thermal load transferred directly to building chilled water – no load on room
Integrated Power Regulation, Control, & DistributionRuns off any building voltage supply world-wide (200-480 VAC or 370-575VDC), converts to 360 VDC for in-rack distribution. Full in-rack redundancy and automatic fail-over, 4 line cords. Up to 252 kW/rack max / 163 kW Typ.
All data center power & cooling infrastructure included in compute/storage/network rackNo need for external power distribution or computer room air handling equipment.All components correctly sized for max efficiency – extremely good 1.18 Power Utilization EfficiencyIntegrated management for all compute, storage, network, power, & thermal resources.Scales to 512K P7 cores (192 racks) – without any extraneous hardware except optical fiber cables
Servers – 256 Power7 cores / drawer, 1-12 drawers / rackCompute: 8-core Power7 CPU chip, 3.7 GHz, 12s technology, 32 MB L3 eDRAM/chip, 4-way SMT, 4 FPUs/core, Quad-Chip Module; >90 TF / rack
No accelerators: normal CPU instruction set, robust cache/memory hierarchyEasy programmability, predictable performance, mature compilers & libraries
Memory: 512 Mbytes/sec per QCM (0.5 Byte/FLOP), 12 Terabytes / rackExternal IO: 16 PCIe Gen2 x16 slots / drawer; SAS or external connectionsNetwork: Integrated Hub (HCA/NIC & Switch) per each QCM (8 / drawer), with 54-port switch, including total of 12 Tbits/s (1.1 TByte/s net BW) per Hub:
Host connection: 4 links, (96+96) GB/s aggregate (0.2 Byte/FLOP) On-card electrical links: 7 links to other hubs, (168+168) GB/s aggregateLocal-remote optical links: 24 links to near hubs, (120+120) GB/s aggregateDistant optical links: 16 links to far hubs (to 100M), (160+160) GB/s aggregatePCI-Express: 2-3 per hub, (16+16) to (20+20) GB/s aggregate
PERCS/Power7-IH System - Data-Center-In-A-Rack
10
P7-IH – Cable Density
Many many optical fibersEach of these cables is a 24-fiber multimode cable, carrying (10+10) GBytes/sec of traffic
For size referenceFor size reference
1111
P7 IH System Hardware – Node Front View (Blue Waters: ~1200 Node drawers)
P7 QCM (8x)
Hub Module (8x)
D-Link Optical InterfaceConnects to other Super Nodes
360VDC Input Power Supplies
Water Connection
L-Link Optical InterfaceConnects 4 Nodes to form Super Node
MemoryDIMM’s (64x)
MemoryDIMM’s (64x)
PCIeInterconnect
1m W x 1.8m D x 10cm H
IBM’s HPCS Programpartially supported by
MLC ModuleHub Assembly
PCIeInterconnect
D-Link Optical InterfaceConnects to other Super Nodes
Avago microPODTM All off-node communication optical
12
Hub Module – MCM with Optical I/Os
This shows the Hub module with full complement of Optical I/Os. Module in photo is partially assembled, to show construction – full module HW is symmetric
Heat Spreader for Optical DevicesCooling / Load Saddle for Optical Devices
Optical Transmitter/Receiver Devices 12 channel x 10 Gb/s28 pairs per Hub - (2,800+2,800) Gb/s of optical I/O BW
Heat Spreader over HUB ASIC
Strain Relief for Optical RibbonsTotal of 672 Fiber I/Os per Hub, 10 Gb/s each
Hub ASIC (Under Heat Spreader)
13
High-Density Optical Transceivers and Optical Connectors
For this program, we needed a new generation of optical components –Denser, faster, more configurable, equally reliable, & much more tightly-integrateable into the system
Joint development activities with Avago Technologies and USConec, Ltd. have led to successful demonstration of these components
We purposefully defined the interfaces (electrical, optical, mechanical, thermal, & management) to be compatible with industry-standards:
- InfiniBand-12x-QDR and- Ethernet 100 Gbit/sec SR
in order to make these technologies available to the rest of the IT industry.
Commercially available now, from multiple manufacturers
PRIZMTM Light-TurnR
Optical connector
MicroPODTM
Transmitter / Receiver Module
Optical TX
Prizm Connector
Optical RX
14
Avago MicroPOD – TX & RX Performance
This slide courtesy of Mitch Fields, Avago Technologies
15
MicroPOD Signal Integrity Features
This slide courtesy of Mitch Fields, Avago Technologies
16
MicroPOD Signal Integrity Features
This slide courtesy of Mitch Fields, Avago Technologies
17
MicroPOD Signal Integrity Features
This slide courtesy of Mitch Fields, Avago Technologies
18
MicroPOD TX Input Equalization
This slide courtesy of Mitch Fields, Avago Technologies
19
MicroPOD RX Output De-emphasis
This slide courtesy of Mitch Fields, Avago Technologies
20
Manufacture of MicroPOD – paradigm shift
Manufacturing Volume Change from ~50,000 per year, worldwide, to 500,000 per year …10X more parts to build, test, and install
…drives Massive Changes in Product Design and DeliverySimple vertical stack designInvestment in manufacturing technology for 100% automationManufactuer parallel optics in panel form.
This slide courtesy of Mitch Fields, Avago Technologies
21
The Payoff –
1,000s of channels manufactured and demonstrated, all running at 10 Gbps, error-free.
RXPAVE / TXLOP
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
RXPAVE / TX LOP
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
RXPAVE / TX LOP
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
RX PAVE / TX LOP
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
R X PA V E / T X LOP
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
22
Short history of supercomputing for Weather Simulation
~1995~2000 ~2005 ~2009
Thank you kindly
-- any questions?