monolithic integration of energy-efficient cmos silicon

Integrated Systems Group

Massachusetts Institute of Technology

Monolithic Integration of

Energy-efficient

CMOS Silicon Photonic Interconnects

Vladimir Stojanović

Manycore SOC roadmap fuels

bandwidth demand

64-tile system (64-256 cores) - 4-way SIMD FMACs @ 2.5 – 5 GHz

- 5-10 TFlops on one chip

- Need 5-10 TB/s of off-chip I/O

- Even higher on-chip bandwidth

2 cm

2 cm

Intel 48 core -Xeon

2

System Bottlenecks

CPU

Cache/

MC

DR

AM

DIM

M

Manycore system

cores

Cache/

MC

DR

AM

DIM

M

Cache/

MC

DR

AM

DIM

M

CPU CPU

Interconnect

Network

Interconnect

Network

Bottlenecks due

to energy and

bandwidth density

limitations

3

Wire and I/O scaling

Increased wire resistivity makes wire caps scale very slowly

Can’t get both energy-efficiency and high-data rate in I/O

On-chip wires

copper resistivity

0

2

4

6

8

10

12

14

16

18

0 5 10 15 20 25

Chip2Chip Backplane

En

erg

y-c

ost

[pJ/b

]Data-rate [Gb/s]

Best electrical links

Loss ~10dB

Loss ~20-25dB

On-chip wires I/O

4

Bandwidth, pin count and power scaling

Need 16k pins

in 2017 for HPC*

1 Byte/Flop

256 cores

2 TFlop/s signal pins @ 20 Gb/s/link

2,4 cores

Pa

cka

ge

pin

co

un

t

*> half pins for power supply

5

Supercomputers

Monolithic CMOS-Photonics in Computer Systems

Embedded apps

Si-photonics in advanced

CMOS and DRAM process

NO costly process changes

6

Many architectural studies show promise

[Shacham’07]

[Petracca’08]

[Vantrease’08]

[Psota’07]

[Kirman’06]

[Joshi’09]

[Pan’09]

[Batten’08] [Kurian’10] [Koka’08-10]

7

Optimization requires full system insight

Developed cross-layer modeling framework Kurian, Chen 2011

Cache & Core

Energy & Area

8

DSENT Electrical and optical link and

network models

Start at the link level:

Jointly optimize circuits and photonic devices

Reg

iste

r

Mu

x

Pre-Driver Mod-DriverReceiver

Front-end

Φ Φ Φ

Φ Φ

+

Samplers &

Monitoring

Dem

ux

Reg

iste

r

PLL or

Opt. Clk

1 2 3 4 in PLL or

Opt. Clk

Phase

Adjust

Reg

iste

r

Mu

x


Front-end

Φ Φ Φ

Φ Φ

+

Samplers &

Monitoring

Dem

ux

Reg

iste

r

PLL or

Opt. Clk

1 2 3 4 in PLL or

Opt. Clk

Phase

Adjust

Dense WDM – 128 wavelengths/waveguide - >1Tb/s per waveguide

Need 1000’s of transceivers on die with < 100fJ/bit cost at > 10Gb/s !

- Optimized modulator circuits/devices

- Optimized receiver circuits/photo-detector

- Optimized thermal tuning 9

Laser energy increases with data-rate

Limited Rx sensitivity

Modulation more expensive -> extinction ratio / insertion loss trade-off

Tuning costs decrease with data-rate

Moderate data rates most energy-efficient

Reg

iste

r

Mu

x


Front-end

Φ Φ Φ

Φ Φ

+

Samplers &

Monitoring

Dem

ux

Reg

iste

r

PLL or

Opt. Clk

1 2 3 4 in PLL or

Opt. Clk

Phase

Adjust

Reg

iste

r

Mu

x


Front-end

Φ Φ Φ

Φ Φ

+

Samplers &

Monitoring

Dem

ux

Reg

iste

r

PLL or

Opt. Clk

1 2 3 4 in PLL or

Opt. Clk

Phase

Adjust

512 Gb/s aggregate throughput

assuming 32nm CMOS

Georgas CICC 2011

Need to optimize carefully

10

DWDM link efficiency optimization

Optimize for min energy-cost

Bandwidth density dominated by circuit and photonics area (not coupler pitch) 10x better than electrical bump limited

200x better than electrical package pin limit

Electrical

bump-pitch

limited to

<1Tb/s/mm2 >10x

Package pin limit

0.05 Tb/s/mm2

11

Photonic DRAM Network Organization

Important Concepts

- Power/message switching (only to active DRAM chip in

DRAM cube/super DIMM)

- Vertical die-to-die coupling (minimizes cabling - 8 dies per

DRAM cube)

-Command distributed

electrically (broadcast)

- Data photonic (single writer

multiple readers)

MC 1

MC 16

Mem

Sch

edu

ler

MC K

CPUDRAM cube 1

DRAM cube 4

Super DIMM

cmdDwr

Drd

( cube 1, die 1)

cmdDwr

Drd

( cube 1, die 8)

Dwr

Drd

DRAM cube 4

Super DIMM K

die-die switch

Laser in

Modulator bank

Receiver/PD bank

Tunable filterbank

Through silicon via

Through silicon via holeBeamer ISCA 2010 Processor die

12

Optimizing DRAM with photonics

Floorplan

Beamer ISCA 2010

P1 P4

13

Laser Power Guiding Effectiveness

Beamer ISCA 2010 14

Enables capacity scaling per channel and significant savings in laser energy

ATAC – On-Chip network Example

1000 core die

64 clusters connected via optical broadcast 15

Average Energy over Splash2 benchmarks

Ring tuning very expensive

Non-gated laser very expensive 16

Including the cores gives the full picture

Energy dominated by cores/caches

Faster network saves overall energy (leakage and clock)

Need aggressive clock-gating and supply/retention scaling

Execution time also matters

18

Feedback to device designers

Waveguide losses up to 2dB/cm o.k.

19

Conclusions

Biggest gains if photonics both on-chip and off-chip

Core-to-MC network

MC-to-DRAM bank network – immediate 10x gains

Need comprehensive modeling framework to see

the full picture

Link-level – tight interaction of circuits and photonics

through good models

System-level – Include all system components – cores,

network, caches, memory

Acknowledgments

Krste Asanović, Rajeev Ram, Miloš Popović, Christopher

Batten, Ajay Joshi

Anant Agarwal, Li-Shiuan Peh, Lionel Kimerling, Jurgen

Michel, Dimitri Antoniadis

Jason Miller, Jeff Shainline

Jason Orcutt, Chen Sun, Ben Moss, Jonathan Leu, Michael

Georgas, Stevan Urosević, Owen Chen, George Kurian,

Yong-Jin Kwon, Scott Beamer

Dr. Jag Shah and Dr. Charles Holland, DARPA

FCRP IFC, NSF

Trusted Foundry, Intel Corporation, APIC

monolithic integration of energy-efficient cmos silicon

Documents