noc bergman
Post on 27-Oct-2015
50 Views
Preview:
TRANSCRIPT
Frontiers in Nanophotonics and PlasmonicsGuaruja, SP Brazil
November, 2007
Silicon Photonic On-Chip Optical Interconnection Networks
Keren Bergman, Columbia University
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 20072
Acknowledgements
Columbia
– Prof. Luca Carloni
– Dr. Assaf Shacham, Michele Petracca, Ben Lee, Caroline Lai, Howard Wang, Sasha Biberman
IBM
– Jeff Kash
– Yurii Vlasov
Cornell
– Michal Lipson
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 20073
Emerging Trend of Chip MultiProcessors (CMP)Emerging Trend of Chip MultiProcessors (CMP)
NiagaraSun 2004
CELL BEIBM 2005
MontecitoIntel 2004
TerascaleIntel 2007
BarcelonaAMD 2007
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 20074
Networks on Chip (NoC)Networks on Chip (NoC)
Shared, packet-switched, optimized for communicationsShared, packet-switched, optimized for communications
– Resource efficiencyResource efficiency
– Design simplicity Design simplicity
– IP reusabilityIP reusability
– High performanceHigh performance
But… no true relief in But… no true relief in power dissipationpower dissipation
Kolo
dn
y, 2
005
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 20075
Chip Multiprocessors: the IBM Cell
Parameter Value Technology process 90nm SOI with low- dielectrics and 8 metal
layers of copper interconnect Chip area 235mm^2 Number of transistors ~234M Operating clock frequency 4Ghz Power dissipation ~100W Percentage of power dissipation due to global interconnect
30-50%
Intra-chip, inter-core communication bandwidth
1.024 Tbps, 2Gb/sec/lane (four shared buses, 128 bits data + 64 bits address each)
I/O communication bandwidth 0.819 Tbps (includes external memory)
IBM Cell:
Columbia University
The Interconnection The Interconnection Challenge: Challenge:
Off-Chip BandwidthOff-Chip Bandwidth Off-chip bandwidth is risingOff-chip bandwidth is rising
Pin countPin count Signaling rateSignaling rate
Some examplesSome examples
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 20077
Why Photonics for CMP NoC?
OPTICS: Modulate/receive ultra-high
bandwidth data stream once per communication event
Transparency: broadband switch routes entire multi-wavelength high BW stream
Low power switch fabric, scalable Off-chip and on-chip can use
essentially the same technology Off-chip BW = On-chip BW
for nearly same power
TX RX TXRX
TX
RX
TX
RX
TX
RX
TXRX
ELECTRONICS: Buffer, receive and re-transmit
at every switch Off chip is pin-limited and
really power hungry
Photonics changes the rulesfor Bandwidth-per-WattOn-chip AND Off-chip
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 20078
Recent advances in photonic integrationRecent advances in photonic integration
Infinera, 2005 IBM, 2007 Lipson, Cornell, 2005
Luxtera, 2005Bowers, UCSB, 2006
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 20079
3DI CMP System Concept Future CMP system in 22nm
Chip size ~625mm2
3D layer stacking used to combine:
– Multi-core processing plane
– Several memory planes
– Photonic NoC
Processor System Stack
Opt
ical
I/O
Opt
ical
I/O
For 22nm scaling will enable 36 multithreaded cores similar to today’s Cell
Estimated on-chip local memory per complex core ~0.5GB
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200710
Optical NoC: Design Considerations Design to exploit optical advantages:
– Bit rate transparency: transmission/switching power independent of bandwidth
– Low loss: power independent of distance
– Bandwidth: exploit WDM for maximum effective bandwidths across network• (Over) provision maximized bandwidth per port• Maximize effective communications bandwidth
– Seamless optical I/O to external memory with same BW
Design must address optical challenges:– No optical buffering– No optical signal processing– Network routing and flow control managed in electronics
• Distributed vs. Central• Electronic control path provisioning latency
Packaging constraints: CMP chip layout, avoid long electronic interfaces, network gateways must be in close proximity on photonic plane
Design for photonic building blocks: low switch radix
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200711
Goal: Design a NoC for a chip multiprocessor (CMP)Goal: Design a NoC for a chip multiprocessor (CMP)
ElectronicsElectronics
Integration density Integration density abundant buffering and processing abundant buffering and processing
Power dissipation grows with data ratePower dissipation grows with data rate
PhotonicsPhotonics
Low loss, large bandwidth, bit-rate transparencyLow loss, large bandwidth, bit-rate transparency
Limited processing, no buffersLimited processing, no buffers
Our solution – a hybrid approach:Our solution – a hybrid approach:
A dual-network designA dual-network design
– Data transmission in a photonic networkData transmission in a photonic network
– Control in an electronic networkControl in an electronic network
– Paths reserved before transmission Paths reserved before transmission No optical buffering No optical buffering
Photonic On-Chip NetworkPhotonic On-Chip Network
PPP
PPP
PPP
GGG
GGG
GGG
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200712
PG
PG
PG
PG
PG
PG
PG
PG
PG
On-Chip Optical Network ArchitectureBufferless, Deflection-switch based
Cell Core (on processor plane)
Gateway to Photonic NoC(on processor and photonic planes)
Thin Electrical Control Network(~1% BW, small messages)
Photonic NoC
Deflection Switch
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200713
Key Building Blocks
5cm SOI nanowire 1.28Tb/s (32 x 40Gb/s)
LOW LOSS BROADBAND NANO-WIRES
HIGH-SPEED MODULATOR
CornellOFF ON
BROADBAND ROUTER SWITCH
Si
SiO2
GeSi
Ti/Al
n+ p+ n+ p+
SiO2
SWi
Wm
tGe
Si
SiO2
GeSi
Ti/Al
n+ p+ n+ p+
SiO2
SWi
Wm
tGe
HIGH-SPEED RECEIVER
IBM/Columbia
IBM;Cornell/Columbia
IBM
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200714
4x4 Photonic Switch Element4x4 Photonic Switch Element
4 deflection switches grouped with 4 deflection switches grouped with electronic controlelectronic control
4 waveguide pairs I/O links4 waveguide pairs I/O links
Electronic routerElectronic router
– High speed simple logicHigh speed simple logic
– Links optimized for high speedLinks optimized for high speed
Nearly no power consumption in Nearly no power consumption in OFFOFF state state
CMOSDriver
CMOSDriver
CMOSDriver
CMOSDriver
CMOSDriver
CMOSDriver
CMOSDriver
CMOSDriver
ER
PSE
PSE PSE
PSE
North
South
EastWestER
PSEPSE
PSEPSE PSEPSE
PSEPSE
North
South
EastWest
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200715
Non-Blocking 4x4 Switch DesignNon-Blocking 4x4 Switch Design
Original switch is internally blockingOriginal switch is internally blocking
Addressed by routing algorithm in original Addressed by routing algorithm in original designdesign
Limited topology choicesLimited topology choices
New designNew design
– Strictly non-blocking*Strictly non-blocking*
– Same number of ringsSame number of rings
– Negligible additional lossNegligible additional loss
– Larger areaLarger area
* U-turns not allowed* U-turns not allowed
W E
N
S
W E
N
S
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200716
Design of Nonblocking Network for CMP NoC
Begin with crossbar -- strictly non-blocking architecture
– Any unoccupied input can transmit to any unoccupied output without altering paths taken by other traffic in network
Connections from every input to every output
– Each node transmits and receives on independent paths ineach dimension
Unidirectional links
1 x 2 Switches
Simple routing algorithm
1 2 3 4 5 6 7 8
8
7
6
5
4
3
2
1
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200717
1 2 3 4
4
3
2
1
5 6 7 8
Bidirectionality provides for independent reception by two nodes from output (Y) dimension
Design of photonic nonblocking mesh
Utilizing nonblocking switch design with increased functionality and bidirectionality enables novel network architecture
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200718
3
4
1
2
Mapping onto a direct network Internalizing nodes in a crossbar (indirect network) produces
mesh/torus (direct network)
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200719
Nonblocking Torus Network Internalizing nodes maintains two nodes per dimension
– There is always an independent path available for a node to transmit/receive on/from in each dimension
4
3
2
1
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200720
Nonblocking Torus Network Internalizing nodes maintains two nodes per dimension
– There is always an independent path available for a node to transmit/receive on/from in each dimension
4
32
1
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200721
Nonblocking Torus Network Each node injects into the network on the X dimension
2
11
2
3
4
1GW
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200722
Nonblocking Torus Network Each node ejects from the network on the Y dimension
2
11
2
3
4
1GW
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200723
Nonblocking Torus Network Folding the torus to maintain equal path lengths
4 × 4 non-blocking photonic switch
1
2
4
3
Non-Blocking 4x4 Design
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200724
Power Analysisstrawman
Modulator power 0.4 pJ/bit
Receiver power 0.4 pJ/bit
On-chip Broadband Router Power 10 mW
Power per laser 10 mW
Number of wavelength channels 250
modulator
modulator
modulator
Off-chip lasers
Rx
am
p
Rx
Rx
?de
mux .
.
.
am
pa
mp
.
.
.
10 mW
. . .
0.4 pJ/bit1 Tb/s
Broadband router
Broadband router
10 mW 10 mW
0.4 pJ/bit
Super-core
Super-core
1 Tb/s
modulator
modulator
modulator
Off-chip lasers
Rx
am
p
Rx
Rx
?de
mux .
.
.
am
pa
mp
.
.
.
10 mW
. . .
0.4 pJ/bit1 Tb/s
Broadband router
Broadband router
10 mW 10 mW
0.4 pJ/bit
Super-core
Super-core
1 Tb/s
Cores per supercore 21
Supercores per chip 36
broadband switch routers 36
Number of lasers 250
Bandwidth per supercore 1 Tb/s
Optical Tx + Rx Power (36 Tb/s) 28.8 W
On-chip Broadband routers (36) 360 mW
External lasers (250) 2.5 W
Total power photonic interconnect 31.7 W
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200725
Performance Analysis
Goal to evaluate performance-per-Watt advantage of CMP system with photonic NoC
Developed network simulator using OMNeT++: modular, open-source, event-driven simulation environment
– Modules for photonic building blocks, assembled in network
– Multithreaded model for complex cores
Evaluate NoC performance under uniform random distribution
Performance-per-Watt gains of photonic NoC on FFT application
Opt
ical
I/O
Opt
ical
I/O
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200726
Multithreaded complex core model Model complex core as multithreaded processor with many
computational threads executed in parallel Each thread independently make a communications request to any core
Three main blocks:– Traffic generator – simulates core
threads data transfer requests, requests stored in back-pressure FIFO queue
– Scheduler – extracts requests from FIFO, generates path setup, electronic interface, blocked requests re-queued, avoids HoL blocking
– Gateway – photonic interface, send/receive, read/write data to local memory
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200727
Throughput per core
Throughput-per-core = ratio of time core transmits photonic message over total simulation time
– Metric of average path setup time
– Function of message length and network topology
Offered load considered when core is ready to transmit
For uncongested network: throughput-per-core = offered load
Simulation system parameters:
– 36 multithreaded cores
– DMA transfers of fixed size messages, 16kB
– Line rate = 960Gbps; Photonic message = 134ns
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200728
Throughput per core for 36-node photonic NoC
Multithreading enables better exploitation of photonic NoC high BWGain of 26% over single-threadNon-blocking mesh, shorter average path, improved by 13% over crossbar
HPEC 2007, Lexington, MA 18-20 September, 200729
FFT Computation Performance We consider the execution of Cooley-Tukey FFT algorithm using 32 of 36
available cores First phase: each core processes: k=m/M sample elements
– m = array size of input samples– M = number of cores
After first phase, log M iterations of computation-step followed by communication-step when cores exchange data in butterfly
Time to perform FFT computation depends on core architecture, time for data movement is function of NoC line rate and topology
Reported results for FFT on Cell processor, 224 samples FFT executes in ~43ms based on Bailey’s algorithm.
We assume Cell core with (2X) 256MB local-store memory, DP Use Bailey’s algorithm to complete first phase of Cooley-Tukey in 43ms Cooley-Tukey requires 5kLogk floating point operations, each iteration after
first phase is ~1.8ms for k= 224 Assuming 960Gbps, CMP non-blocking mesh NoC can execute 229 in 66ms
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200730
FFT Computation Power Analysis
For photonic NoC:
– Hop between two switches is 2.78mm, with average path of 11 hops and 4 switch element turns
– 32 blocks of 256MB and line rate of 960Gbps, each connection is 105.6mW at interfaces and 2mW in switch turns
– total power dissipation is 3.44W
Electronic NoC:
– Assume equivalent electronic circuit switched network
– Power dissipated only for length of optimally repeated wire at 22nm, 0.26pJ/bit/mm
Summary: Computation time is a function of the line rate, independent of medium
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200731
FFT Computation Performance Comparison
FFT computation: time ratio and power ratio as function of line rate
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200732
Performance-per-Watt
To achieve same execution time (time ratio = 1), electronic NoC must operate at the same line rate of 960Gbps, dissipating 7.6W/connection or ~70X over photonic
Total dissipated power is ~244W
To achieve same power (power ratio = 1), electronic NoC must operate at line rate of 13.5Gbps, a reduction of 98.6%.
Execution time will take ~1sec or 15X longer than photonic
Frontiers in Nanophotonics and Plasmonics, Guaruja, SP Brazil November, 200733
Summary
Opt
ical
I/O
Electronic NoCs dissipate prohibitively high power
a technology shift is required
Remarkable advances in Silicon Nanophotonics
Photonic NoCs provide enormous capacity at dramatically low power consumption required for future CMPs, both on- and off-chip
Performance-per-Watt gains on communications intensive applications
CMPs are clearly emerging for power efficient high performance computing capability
Future on-chip interconnects must provide large bandwidth to many cores
top related