clocking links in multi-chip packages: a case study

<Insert Picture Here>

Clocking links in multi-chip packages: a case study

Tamer Ali1, D. Patil, F. Liu, E. Alon2, J. Lexau, C.-K. K. Yang1, R. HoSun Labs, Oracle1Also with UCLA, 2Also with UC-Berkeley

A disclaimer

2

This is:• Not a talk about upcoming Oracle products

• About research done at Sun Labs and academia• ~100 people in Oracle working on advanced research• Looking at hardware, software, OS, languages, DB…• Working in conjunction with UCLA, UC-Berkeley• Providing a technology toolbox for product groups

This work was supported in part by DARPA under Agreement HR0011-08-09-0001. The contents of this presentation should not be interpreted as representing the official views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense.

It’s all about the wiresFrom chips to data centers

• Markets calling for ever-increasing performance– Databases: transactions to data warehousing analytics• “Customers who bought this also purchased…”

– High-performance and/or ubiquitous computing

• System designers responding with integration– Lots of cores on an individual chip– Lots of socketed chips on a single board– Lots of boards in a single chassis– Lots of chassis in a data center– Of course, they all have to communicate!

3

Multicore CPUs need on-chip networksTo pass messages for cache sharing and coherence, DMAs

• Current examples range from crossbars to meshes

• On-chip wires– Bandwidth is relatively free: no need for overclocking– Parallel wires with global synchrony or simple asynchrony

4

www.amd.comwww.oracle.com

www.intel.com www.nvidia.com

Boards and chassis need traces and backplanesHere the challenge is distance, bit rate, and power

• Impairments: skin+dielectric loss, reflections, xtalk…

• Traces on boards and backplanes– Minimize chip connectors (solder balls) – overclock them– Use parallel clocking (source-synchronous), phase recovery

5

www.oracle.com

Wikimedia commons

Stojanovic, Ph.D. Thesis, 2004

Data centers need lots of cablesOptical or electrical

• Reach, cable management, reliability, replaceability– Full disclosure: “the ugly” was actually a Sun data center

• Long cables between boards and chassis– Minimize cabling (weight, cost) – overclock them– TX/RX have completely different clocks, so need full CDR

6

The good The bad The ugly

Taken from Al Davis, Hot Chips tutorial, 2010

These interconnects are fairly well understoodOr, at least their constraints are

• Long cables between boards and chassis– Minimize cabling (weight, cost) – overclock them– TX/RX have completely different clocks, so need full CDR

• Traces on boards and backplanes– Minimize chip connectors (solder balls) – overclock them– Use parallel clocking (source-synchronous), phase recovery

• On-chip wires– Bandwidth is relatively free: no need for overclocking– Parallel wires with global synchrony or simple asynchrony

7

Recent work in a new layer here

A package-level “macrochip” integrated systemOvercome pin/ball limits for much higher chip-to-chip bandwidth

• Integrate lots of silicon: multi-core CPUs, DRAM– Connect the chips using fine-pitch optical interconnect

• Chips sit face up in a silicon carrier lattice– Bridge chips interface CPU/DRAM to carrier lattice

8

Optical bridgesCPUs or DRAMs

Silicon lattice w/ waveguides

Ho et al, IEEE Design and Test, July 2010

An optical macrochip link

• Transmitter side: WDM allows waveguide sharing• Receiver side: Light is sensed by a diode, converted to

current which gets amplified in the TIA• A clock in the electrical domain at TX and RX synchronizes

communication

9

This is not a talk (or a paper) on packaging or opticsSee Ashok Krishnamoorthy this afternoon for those topics

• Optical links have interesting circuit implications– Small overhead (20um pitch optical connector, 16x WDM)• No need to overclock these links; run them at clk rate• Simplify timing recovery by having a wider bit time

– Target comparable energy/performance to electrical interconnects (≈1mW/Gbps):• Need significant reductions in optical component loss• Need efficient TX and RX circuits– Clocking is a big part of Rx circuits (CDR, phase

adjustment, etc)

• So, a question: how should these links be clocked?

10

Clocking “ground rules” For this case study

• Chips in a macrochip package share a clock source– Typically a slow (MHz) crystal oscillator– Here, chips run at 5 GHz, and links at 10 Gbps

• Two basic choices in clock distribution and usage– Multiply the shared clock at the “root” of the system• Distribute a fast 5 GHz clock optically over the chip– More like on-chip clocking

– Distribute the native slow clock globally• Local chips use their own mPLL at each IO channel– More like board level clocking

– In both cases, do local phase adjustment at each IO channel

• Which one would you choose?

11

Slow Root Clock vs. Fast Root Clock

12

• For slow clock distribution, PLL accumulates jitter for long time constants.• For fast clock distribution, clock buffers dominate jitter– May be we can do something about it

€

σJitter,Total2 = 2σ Jitter,PLL

2 +

σ Jitter,Tx2 +σ Jitter,Rx

2

€

σJitter,Total2 = 2σ Jitter,DLL

2 +

σ Jitter,Tx2 +σ Jitter,Rx

2

LC Tuned Clock Buffer for fast Clock distribution

• Inductor resonates with the capacitive on-chip wire– Maximum gain at clock frequency

• Filters noise (Jitter) elsewhere• Jitter is reduced by sqrt(Q)

13

Clock Buffers Insertion Criteria

• Each buffer filters noise, and reduces jitter– Can we insert multiple buffers to improve jitter further?– Keep the power constant

14

Clock Buffers Insertion Criteria (Contd.)

• Cascading more clock buffers helps up to a point– Scale down the size and the power of each to keep the total

power constant– After three stages, devices get out of saturation, gain drops

• Note, lower frequency operation has more jitter!– Q actually drops at lower frequency– Slope decreases More susceptible to noise➩

15

Devices get out of saturation0.15ps

Let’s Talk Numbers…

• Fast on-chip clock distribution gives 0.25ps rms jitter– 0.15ps for three clock distribution buffers– 0.1ps for phase adjustment DLL– Equivalent to 3.8ps p-p jitter for BER of 10-12

• Slow on-chip clock distribution gives 1.8ps rms jitter– Distribute 500MHz clock, multiply by 10 inside the receiver– Assume no jitter from clock distribution– Equivalent to 25ps p-p jitter for BER 10-12

– This is equivalent to 25% of the UI !!

• Fast clock distribution is a clear winner

16

The Caveat…

• Tuning the clock buffers across the chip is an issue– A varactor or capacitor bank is needed• Reduces Q and degrades performance

– Or use low Q inductor to tolerate the resonance variation• Trades performance with ease of tuning

• Doing dynamic voltage/frequency scaling is tricky– Clock buffers has to be scaled to match the processor clock

• Power supply noise– Dominant unless good regulators are used in the DLLs

• Upside: Improved system tolerance to supply noise– Due to absence of on-chip PLLs– Runs in favor for fast clock distribution

17

Summary

• Improve perf/power with many-core macro-chip – Point-point optical links pose interesting clocking challenges

• On-board clocking paradigm results in high jitter– Multiplying PLLs accumulate jitter– Instead, distribute a global fast clock optically to all chips

• Tuned on-chip clock buffers are the key enablers– Small bandwidth filters noise and hence jitter– Almost 10x better than using PLLs

• Caveats– Need to tune the buffers– Need to tame power supply noise in DLLs

18


Questions?

19


Questions?

20

clocking links in multi-chip packages: a case study

Documents

higher chip

chip bandwidthintegrate

meshesonchip wiresbandwidth

wiresfrom chips

data centersmarkets

single boardlots of

data centerlong cables

chip networksto pass