elastic circuits jordi cortadella universitat politècnica de catalunya, barcelona emicro 2013
TRANSCRIPT
Elastic circuits
Jordi CortadellaUniversitat Politècnica de Catalunya, Barcelona
EMicro 2013
Goals• Convince ourselves that:
– designing an asynchronous circuit is easy– synchronous and asynchronous circuits are similar– asynchronous circuits bring new advantages
• Not to cover exotic asynchronous schemes
• Elasticity can also be synchronous
EMicro 2013 Elastic circuits 2
Clocking
EMicro 2013 Elastic circuits
Nvidia KeplerTM GK110
• How to distribute the clock?
• How to determine the clockfrequency?
• How to implement robustcommunications?
• How to reduce and manageenergy?
3
28nm, 7.1B transistors, 550mm2, 2688 CUDA cores,Base clock: 836MHz, Memory clock: 6GHz
EMicro 2013 Elastic circuits 4
Outline• Synchronous and Source-synchronous circuits• Completion detection• Handshaking• Performance analysis• Why asynchronous?• Design automation• Synchronous elasticity• Globally-asynchronous Locally-synchronous
EMicro 2013 Elastic circuits 5
Synchronous andSource-Synchronous
Synchronous circuit
EMicro 2013 Elastic circuits
PLL
7
12112
Synchronous circuit
EMicro 2013 Elastic circuits
CL
Two competing paths:• Launching path• Capturing path
Launching path < Capturing path + Period
CLKtree + CL < CLKtree + Period
CL < Period (no clock skew)
2PLL
8
Source-synchronous
EMicro 2013 Elastic circuits
CLKgen matched delay matched delay matched delay
• No global clock required
• More tolerance to PVT variations
• Period > longest combinational path
• Good for acyclic pipelines
Launching path
Capturing path
9
CLKgen
?
Source-synchronous with forks and joins
EMicro 2013 Elastic circuits
How to synchronize incoming events?
10
C element (Muller 1959)
EMicro 2013 Elastic circuits
CA
BC
A
B
C
A B C0 0 00 1 C1 0 C1 1 1
11
C element (Muller 1959)
EMicro 2013 Elastic circuits
A
B C
A
B
C
A B C0 0 00 1 C1 0 C1 1 1
MAJ
12
(many implementations exist)
Completion detection
Completion detection
EMicro 2013 Elastic circuits
CLKgen
fixed delay
The fixed delay must be longer than theworst-case logic delay (plus variability)
Q: could we detect when a computation has completed ASAP ?
14
A 1 SP 0 SP 1 SP 1 SP
Delay-insensitive codes: Dual Rail• Dual rail: every bit encoded with two signals
EMicro 2013 Elastic circuits
A.t A.f A0 0 Spacer0 1 01 0 11 1 Not used
A.t
A.f
15
Dual Rail AND gate
EMicro 2013 Elastic circuits
A B C
SP SP SP
0 - 0
- 0 0
SP 1 SP
1 SP SP
1 1 1
A
BC
A.t
A.f
B.t
B.f
C.t
C.f
16
Dual Rail Inverter
EMicro 2013 Elastic circuits
A Z
SP SP
0 1
1 0
A.t
A.f
Z.t
Z.f
17
Dual Rail AND/OR gate
EMicro 2013 Elastic circuits
A
BC
A.t
A.f
B.t
B.f
C.t
C.f
A
BC
A.f
A.t
B.f
B.t
C.f
C.tA
BC
18
Dual rail: completion detection
Dual-rail logic
•••
•••
C done
Completion detection tree
EMicro 2013 Elastic circuits 19
Multi-input C element
EMicro 2013 Elastic circuits
C
C
C
C
C
C
a1
a2
a3
a4
a5
a6
a7
c
20
Dual rail: completion detection
EMicro 2013 Elastic circuits
AND
OR
INV
AND
CLKgen
21
Dual rail: completion detection
EMicro 2013 Elastic circuits
AND
OR
INV
AND
CCLKgen
22
Dual rail: operation
EMicro 2013 Elastic circuits
AND
OR
INV
AND
CCLKgen
ResetComputeComputeComputeCompute
For a correct operation, all internal signals should be reset before the compute phase:• Use a more complex implementation of dual-rail (e.g., DIMS), or• Have internal completion detection, or• Use timing assumptions
23
Other DI codes• There are many DI codes:
– k-out-of n, Berger, Knuth, …
• Example: 1-out-of-4
– 2 bits with 4 wires– Same wire efficiency as DR– Less power consuming– Good for communication– Bad for logic
EMicro 2013 Elastic circuits
Wires Value0000 Spacer0001 00010 10100 21000 3
others not used
24
Single rail data vs. dual railSome back-of-the-envelope estimations:
EMicro 2013 Elastic circuits
Single rail Dual RailArea 1 2Delay 1 << 1Static power 1 2Dynamic power < 0.2 2
Dual rail:• Good for speed• Large area• High power comsumption
25
Handshaking
Handshaking
EMicro 2013 Elastic circuits
CLKgen unknown delay
Assume that the source module can provide data at any rate:
• When should the CLK generator send an event if the
internal delays of the circuit are unknown?
Solution: handshaking
27
Handshaking
EMicro 2013 Elastic circuits
I have data
I want data
Data
Request
Acknowledge
28
Asynchronous elastic pipeline
C
ReqIn ReqOut
AckIn AckOut
C C C
• David Muller’s pipeline (late 50’s)• Sutherland’s Micropipelines (Turing award, 1989)
EMicro 2013 Elastic circuits 29
Multiple inputs and outputs
EMicro 2013 Elastic circuits 30
Multiple inputs and outputs
EMicro 2013 Elastic circuits
delay
31
Mulitple inputs and outputs
EMicro 2013 Elastic circuits
C
Req
Ack Req
Ack
32
Channel-based communication• A channel contains data and handshake wires
EMicro 2013 Elastic circuits
Single-Rail DataReq
Ack
Dual-Rail DataAck
33
Push/pull channels
• Push: the sender initiates the communication• Pull: the receiver initiates the communication
EMicro 2013 Elastic circuits
Sender Receiver
Single-Rail DataReq (push)
Ack
Single-Rail DataAck
Req (pull)
34
Four-phase protocol
• Valid data on the active edge of Req• Req/Ack must return to zero before the next transfer• Different variations of the 4-phase protocol exist
EMicro 2013 Elastic circuits
Data 1 Data 2 Data 3
Req
Ack
Data
Data transfer Data transfer
35
Two-phase protocol
• Every edge is active• It may require double-edge triggered flip-flops or
pulse generators
EMicro 2013 Elastic circuits
Data 1 Data 2 Data 3
Req
Ack
Data
Data transfer Data transfer
36
How to memorize?
EMicro 2013 Elastic circuits
CombinationalLogic LL
delay
CC
? ?
2-phase or 4-phase ?
37
How to memorize?
EMicro 2013 Elastic circuits
CombinationalLogic LL
delay
CC
Pulsegenerator
2-phase
38
How to memorize?
EMicro 2013 Elastic circuits
CombinationalLogic LL
delay
CC 4-phase
39
Performance analysis
Ring oscillators
EMicro 2013 Elastic circuits
CC
CC
C
• Every ring requires an odd number of inverters
• The cycle period is determined by the slowest ring
• The cycle period is adapted to the operating conditions(temperature, voltage)
41
1
2 3 4
5
6 7
Global Rings
EMicro 2013 Elastic circuits 43
C
C C
C
CC
Global Rings
EMicro 2013 Elastic circuits
Th = 1 / 6
• Ramamoorthy and Ho, 1980Performance evaluation of asynchronous concurrent systems with Petri nets
• T. Williams et al., A self-timed chip for division, 1987• Greenstreet and Steiglitz, Bubbles can make self-timed pipelines fast, 1990• Manohar and Martin, Slack elasticity in concurrent computing, 1998.
44
Global Rings
EMicro 2013 Elastic circuits
Th = 2 / 6
• Ramamoorthy and Ho, 1980Performance evaluation of asynchronous concurrent systems with Petri nets
• T. Williams et al., A self-timed chip for division, 1987• Greenstreet and Steiglitz, Bubbles can make self-timed pipelines fast, 1990• Manohar and Martin, Slack elasticity in concurrent computing, 1998.
45
Global Rings
EMicro 2013 Elastic circuits
Th = 3 / 6
• Ramamoorthy and Ho, 1980Performance evaluation of asynchronous concurrent systems with Petri nets
• T. Williams et al., A self-timed chip for division, 1987• Greenstreet and Steiglitz, Bubbles can make self-timed pipelines fast, 1990• Manohar and Martin, Slack elasticity in concurrent computing, 1998.
46
Global Rings
EMicro 2013 Elastic circuits
Th = 1 / 6
• Ramamoorthy and Ho, 1980Performance evaluation of asynchronous concurrent systems with Petri nets
• T. Williams et al., A self-timed chip for division, 1987• Greenstreet and Steiglitz, Bubbles can make self-timed pipelines fast, 1990• Manohar and Martin, Slack elasticity in concurrent computing, 1998.
47
Global Rings
EMicro 2013 Elastic circuits
0 NN/2tokens
Th
1/2
• Ramamoorthy and Ho, 1980Performance evaluation of asynchronous concurrent systems with Petri nets
• T. Williams et al., A self-timed chip for division, 1987• Greenstreet and Steiglitz, Bubbles can make self-timed pipelines fast, 1990• Manohar and Martin, Slack elasticity in concurrent computing, 1998.
Tokenlimited
Bubblelimited
48
A latch-based view of synchronous circuits
EMicro 2013 Elastic circuits
Filp-flop =Master + Slave
49
Multiple Rings
EMicro 2013 Elastic circuits
2 / 4 2 / 42 / 5
5 / 7 ?It’s bubblelimited !!!2 / 7
50
Slack matching
EMicro 2013 Elastic circuits
2 / 4 2 / 42 / 5
2 / 7 ?4 / 9
• We can add as many bubbles as we want (but not tokens!)• Slack matching can be solved optimally in polynomial time• Slack matching is conceptually equivalent to buffer (FIFO) sizing or recycling
51
Performance analysis
EMicro 2013 Elastic circuits 52
C
C C
C
CC
(Mean Cycle Ratio)
Latch-based design
EMicro 2013 Elastic circuits
L3L2L1 L4
L1
L2
L3
L4
53
Launching path
Capturing path
Matched delays can be adjustable
EMicro 2013 Elastic circuits
L3L2L1 L4
54
delayselection
Delays can be adjusted:
• At testing/boot time (to adjust to static variability)
• At runtime (to compensate dynamic variability)
Why asynchronous?
Exploiting elasticity
CLK
Rigidclock
Highperformance
LowenergyEMicro 2013 Elastic circuits 56
Highperformance
Exploiting elasticity
Vo
ltage
Performance
1 VRigid
2 GHz1 GHz500 MHz
Lowenergy
0.9 V
0.8 V
0.7 V
Rigidclock
Highperformance
Lowenergy
Voltagescaling
EMicro 2013 Elastic circuits 57
Voltage scaling and power savings
-24%-14%
3 ARM926 coreson the same die
EMicro 2013 Elastic circuits 58
Tracking variability
EMicro 2013 Elastic circuits 59
matched delay
Tracking variability
delay
best typ worst
multi-corner matched delay
critical paths
Good correlation for:
• Process variability (systematic)
• Global voltage fluctuations
• Temperature
• Aging (partially)EMicro 2013 Elastic circuits 60
Margins
Gate and wire delays (typ) P V T AgingPLLJitter
Skew
Rigid Clocks:
Cycle period
Gate and wire delays (typ) P V TA
gin
g
Elastic Clocks:
Skew
Cycle period
Margin reduction
Speed-up / Power savings
EMicro 2013 Elastic circuits 61
wasted timecomputation time
Rigid clock
computation time
Cycle period
Cycle period
Elastic clock
Clock elasticity
EMicro 2013 Elastic circuits 62
Design Automation
Design automation paradigms• Synthesis of asynchronous controllers
– Logic synthesis from Petri nets or asynchronous FSMs
• Syntax-directed translation– Correct-by-construction composition of handshake
components
• De-synchronization– Automatic transformation from synchronous to
asynchronousEMicro 2013 Elastic circuits 64
Synthesis of asynchronous controllers
EMicro 2013 Elastic circuits
DeviceLDS
LDTACK
D
DSr
DSw
DTACK
VME BusController
DataTransceiver
BusDSr
LDS
LDTACK
D
DTACK
Read Cycle
65
Synthesis of asynchronous controllers
EMicro 2013 Elastic circuits
LDS+ LDTACK+ D+ DTACK+ DSr- D-
DTACK-
LDS-LDTACK-
DSr+
LDS
LDTACK
D
DSr
DTACK
VME BusController
Signal Transition Graph
66
Synthesis of asynchronous controllers
EMicro 2013 Elastic circuits
DTACKD
DSr
LDS
LDTACK
LDS+ LDTACK+ D+ DTACK+ DSr- D-
DTACK-
LDS-LDTACK-
DSr+
Cortadella et al., Petrify67
Syntax-directed translation
EMicro 2013 Elastic circuits
→
SEQ
xR
R
RWMUX
→
yR
R
RWMUX
*
DMX-
DMX-
DMX <>
DMX <
do
→→ @
áá ññ→
out
int = type [0..255]& gcd: main proc (in? chan <<int,int>> & out! chan int)begin x, y: var int| forever do in?<<x,y>>
; do x <> y then if x < y then y:=y-x else x:=x-y fi od
; out!x odend
Sources:
J. Kessels and A. Peeters.DESCALE: A Design Experiment for a SmartCard Application Consuming Low Energy,in Principles of Asynchronous Circuit Design, A Systems Perspective,Eds., J. Sparso and S. Furber, Kluwer Academic Publishers, 2001.
P.A.Beerel, R.O. Ozdag and M. Ferretti.A Designer’s Guide to Asynchronous VLSI,Cambridge University Press, 2010. 68
De-synchronization• Strategy: substitute the clock tree
by local clocks and handshakes
• Combinational logic and latches are not modified
• More tolerance to variability– Similar area, less power and/or more speed
• Cortadella, Kondratyev, Lavagno and Sotiriou. Desynchronization: Synthesis of asynchronous circuits from synchronous specifications.IEEE TCAD, Oct 2006.
EMicro 2013 Elastic circuits 69
Synchronous operation
EMicro 2013 Elastic circuits
CLKgen
Transforming a synchronous circuit into asynchronous (automatically)
70
De-synchronization
EMicro 2013 Elastic circuits
Transforming a synchronous circuit into asynchronous (automatically)
72
System-level de-synchronization
EMicro 2013 Elastic circuits 74
CLK
System-level de-synchronization
EMicro 2013 Elastic circuits 75
System-level de-synchronization
EMicro 2013 Elastic circuits 76
Synchronous elasticity
Different flavors of elasticity
EMicro 2013 Elastic circuits
+147 … 348201…
…Rigid
+e48…147…
201… 3
Elastic
79
4 38+s …147
201… Synchronous Elastic
Carloni et al., Latency-insensitive systems.
Asynchronous elasticity
req
ack
EMicro 2013 Elastic circuits 80
Synchronous elasticity
valid
stop
Ring oscillator
CLK
PLL
EMicro 2013 Elastic circuits 81
Latch-based elasticity
sender receiver
V V V V
En En En En
Data
Valid
Stop
Data
Valid
Stop
EMicro 2013 Elastic circuits 82
Elastic netlists
ForkJoin
Join / Fork
EB
EBEB
EB
Enable signalto data latches
EMicro 2013 Elastic circuits 83
Variable Latency Units
EMicro 2013 Elastic circuits
[0 - k] cycles
[0 - k] cycles
donego clear
84
V/S V/S
Globally-asynchronousLocally-synchronous
GALS
SoC design with GALS• Most IPs are synchronous
• Different components may have different operating frequencies
• Some components have variable latencies (e.g., cache hit/miss latency)
• Multiple clock domains are essential
EMicro 2013 Elastic circuits 86
Bridge
CDC
DSP
P
Fast Bus
Slow Bus
Bridge
CDC
Mem
CLK2
CLK1
CLK3
Multiple clock domains
EMicro 2013 Elastic circuits
CLK
Single clock(mesochronous)
f1/f0
f2/f0
f3/f0
CLK(f0)
Rational clockfrequencies
CLK
1C
LK2
CLK
3
CLK
0
Independent clocks
(controllable skew)
87
Synchronous handshakes
EMicro 2013 Elastic circuits
CLK1 CLK2
Data
Sender ReceiverValid
Ack
• The arrival of data is unpredictable• Handshakes solve the problem
88
The problem: metastability
EMicro 2013 Elastic circuits
D Q
ФT
D Q
?
D
Q
ФRФR
setup hold
89
How long does it take to resolve metastability?
EMicro 2013 Elastic circuits
Metastability
MTBF: Mean Time Between Failures
90
Classical synchronous solution
EMicro 2013 Elastic circuits
D Q D Q D Q D Q
ФT ФR
Wffe
D
rtMTBF
2
Mean Time Between Failures fФ: frequency of the clock fD: frequency of the data tr: resolve time available W: metastability window : resolve time constant
# FFs MTBF
1 FF 15 min
2 FF 9 days
3 FF 23 years
Example
91
Handshake with synchronizers
EMicro 2013 Elastic circuits
CLK1 CLK2
Data
Sender ReceiverValid
Ack
• Simple solution• Throughput can be highly degraded:
a long round trip for every transaction
92
Asynchronous FIFOs
EMicro 2013 Elastic circuits
Circular buffer
Valid Valid
Ack Ack
Data Data
Clk In Clk Out
FIFO control
• Ack is issued as soon as data has been delivered
• No impact on throughput (1 token/cycle)
• Min latency determined by the internal synchronizers
• Some tricky structures for the FIFO pointers (e.g. Grey encoding)
93
SoC design with GALS
EMicro 2013 Elastic circuits
Bridge
CDC
DSP
P
Fast Bus
Slow Bus
Bridge
CDC
Mem
CLK2
CLK1
CLK3
• Bridges for Clock Domain Crossing usually contain asynchronous FIFOs
• Latency cost only when interfacing with synchronous domains
• No latency penalty between asynchronous domains
94
Conclusions• Elasticity offers flexibility in time
– Modularity– Dynamic adaptability– Tolerance to variability
• Better optimization of power/performance
• Why isn’t it an important trend in circuit design?– Lack of commercial EDA support (timing sign-off)– Designers do not feel comfortable with “unpredictable” timing– Other aspects: testing, verification, …
• De-synchronization might be a viable solutionEMicro 2013 Elastic circuits 95
Bibliography• Carmona, Cortadella, Kishinevsky and Taubin,
Elastic Circuits, IEEE Trans. On CAD, Oct. 2009.
• Beerel, Ozdag and Ferreti, A Designer’s Guide to Asynchronous VLSI, Cambridge 2001.
• Sparso and Furber, Principles of Asynchronous Circuit Design: A Systems Perspective,Kluwer 2001.
• Myers, Asynchronous Circuit Design,John Wiley&Sons, 2001
EMicro 2013 Elastic circuits 96
EMicro 2013 Elastic circuits 97