monolithic integration of energy-efficient cmos silicon
TRANSCRIPT
Integrated Systems Group
Massachusetts Institute of Technology
Monolithic Integration of
Energy-efficient
CMOS Silicon Photonic Interconnects
Vladimir Stojanović
Manycore SOC roadmap fuels
bandwidth demand
64-tile system (64-256 cores) - 4-way SIMD FMACs @ 2.5 – 5 GHz
- 5-10 TFlops on one chip
- Need 5-10 TB/s of off-chip I/O
- Even higher on-chip bandwidth
2 cm
2 cm
Intel 48 core -Xeon
2
System Bottlenecks
CPU
Cache/
MC
DR
AM
DIM
M
Manycore system
cores
Cache/
MC
DR
AM
DIM
M
Cache/
MC
DR
AM
DIM
M
CPU CPU
Interconnect
Network
Interconnect
Network
Bottlenecks due
to energy and
bandwidth density
limitations
3
Wire and I/O scaling
Increased wire resistivity makes wire caps scale very slowly
Can’t get both energy-efficiency and high-data rate in I/O
On-chip wires
copper resistivity
0
2
4
6
8
10
12
14
16
18
0 5 10 15 20 25
Chip2Chip Backplane
En
erg
y-c
ost
[pJ/b
]Data-rate [Gb/s]
Best electrical links
Loss ~10dB
Loss ~20-25dB
On-chip wires I/O
4
Bandwidth, pin count and power scaling
Need 16k pins
in 2017 for HPC*
1 Byte/Flop
256 cores
2 TFlop/s signal pins @ 20 Gb/s/link
2,4 cores
Pa
cka
ge
pin
co
un
t
*> half pins for power supply
5
Supercomputers
Monolithic CMOS-Photonics in Computer Systems
Embedded apps
Si-photonics in advanced
CMOS and DRAM process
NO costly process changes
6
Many architectural studies show promise
[Shacham’07]
[Petracca’08]
[Vantrease’08]
[Psota’07]
[Kirman’06]
[Joshi’09]
[Pan’09]
[Batten’08] [Kurian’10] [Koka’08-10]
7
Optimization requires full system insight
Developed cross-layer modeling framework Kurian, Chen 2011
Cache & Core
Energy & Area
8
DSENT Electrical and optical link and
network models
Start at the link level:
Jointly optimize circuits and photonic devices
Reg
iste
r
Mu
x
Pre-Driver Mod-DriverReceiver
Front-end
Φ Φ Φ
Φ Φ
+
Samplers &
Monitoring
Dem
ux
Reg
iste
r
PLL or
Opt. Clk
1 2 3 4 in PLL or
Opt. Clk
Phase
Adjust
Reg
iste
r
Mu
x
Pre-Driver Mod-DriverReceiver
Front-end
Φ Φ Φ
Φ Φ
+
Samplers &
Monitoring
Dem
ux
Reg
iste
r
PLL or
Opt. Clk
1 2 3 4 in PLL or
Opt. Clk
Phase
Adjust
Dense WDM – 128 wavelengths/waveguide - >1Tb/s per waveguide
Need 1000’s of transceivers on die with < 100fJ/bit cost at > 10Gb/s !
- Optimized modulator circuits/devices
- Optimized receiver circuits/photo-detector
- Optimized thermal tuning 9
Laser energy increases with data-rate
Limited Rx sensitivity
Modulation more expensive -> extinction ratio / insertion loss trade-off
Tuning costs decrease with data-rate
Moderate data rates most energy-efficient
Reg
iste
r
Mu
x
Pre-Driver Mod-DriverReceiver
Front-end
Φ Φ Φ
Φ Φ
+
Samplers &
Monitoring
Dem
ux
Reg
iste
r
PLL or
Opt. Clk
1 2 3 4 in PLL or
Opt. Clk
Phase
Adjust
Reg
iste
r
Mu
x
Pre-Driver Mod-DriverReceiver
Front-end
Φ Φ Φ
Φ Φ
+
Samplers &
Monitoring
Dem
ux
Reg
iste
r
PLL or
Opt. Clk
1 2 3 4 in PLL or
Opt. Clk
Phase
Adjust
512 Gb/s aggregate throughput
assuming 32nm CMOS
Georgas CICC 2011
Need to optimize carefully
10
DWDM link efficiency optimization
Optimize for min energy-cost
Bandwidth density dominated by circuit and photonics area (not coupler pitch) 10x better than electrical bump limited
200x better than electrical package pin limit
Electrical
bump-pitch
limited to
<1Tb/s/mm2 >10x
Package pin limit
0.05 Tb/s/mm2
11
Photonic DRAM Network Organization
Important Concepts
- Power/message switching (only to active DRAM chip in
DRAM cube/super DIMM)
- Vertical die-to-die coupling (minimizes cabling - 8 dies per
DRAM cube)
-Command distributed
electrically (broadcast)
- Data photonic (single writer
multiple readers)
MC 1
MC 16
Mem
Sch
edu
ler
MC K
CPUDRAM cube 1
DRAM cube 4
Super DIMM
cmdDwr
Drd
( cube 1, die 1)
cmdDwr
Drd
( cube 1, die 8)
Dwr
Drd
DRAM cube 4
Super DIMM K
die-die switch
Laser in
Modulator bank
Receiver/PD bank
Tunable filterbank
Through silicon via
Through silicon via holeBeamer ISCA 2010 Processor die
12
Optimizing DRAM with photonics
Floorplan
Beamer ISCA 2010
P1 P4
13
Laser Power Guiding Effectiveness
Beamer ISCA 2010 14
Enables capacity scaling per channel and significant savings in laser energy
ATAC – On-Chip network Example
1000 core die
64 clusters connected via optical broadcast 15
Average Energy over Splash2 benchmarks
Ring tuning very expensive
Non-gated laser very expensive 16
Including the cores gives the full picture
Energy dominated by cores/caches
Faster network saves overall energy (leakage and clock)
Need aggressive clock-gating and supply/retention scaling
Execution time also matters
18
Feedback to device designers
Waveguide losses up to 2dB/cm o.k.
19
Conclusions
Biggest gains if photonics both on-chip and off-chip
Core-to-MC network
MC-to-DRAM bank network – immediate 10x gains
Need comprehensive modeling framework to see
the full picture
Link-level – tight interaction of circuits and photonics
through good models
System-level – Include all system components – cores,
network, caches, memory
Acknowledgments
Krste Asanović, Rajeev Ram, Miloš Popović, Christopher
Batten, Ajay Joshi
Anant Agarwal, Li-Shiuan Peh, Lionel Kimerling, Jurgen
Michel, Dimitri Antoniadis
Jason Miller, Jeff Shainline
Jason Orcutt, Chen Sun, Ben Moss, Jonathan Leu, Michael
Georgas, Stevan Urosević, Owen Chen, George Kurian,
Yong-Jin Kwon, Scott Beamer
Dr. Jag Shah and Dr. Charles Holland, DARPA
FCRP IFC, NSF
Trusted Foundry, Intel Corporation, APIC