relative timing driven multi-synchronous design: enabling

36
Relative Timing Driven Multi-Synchronous Design: Enabling Order-of-Magnitude Energy Reduction Kenneth S. Stevens University of Utah Granite Mountain Technologies 27 March 2013 UofU and GMT 1

Upload: others

Post on 28-Dec-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Relative Timing Driven Multi-Synchronous Design: Enabling

Relative Timing Driven Multi-SynchronousDesign: Enabling Order-of-Magnitude

Energy Reduction

Kenneth S. Stevens

University of UtahGranite Mountain Technologies

27 March 2013 UofU and GMT 1

Page 2: Relative Timing Driven Multi-Synchronous Design: Enabling

Learn from Prof. Kajitana

l Think differently and deeply

l Apply thought to current challenges

Then collaborate

Goals of Presentation:

1. Define and propose “rule breaker” idea

2. Request support from physical design community

27 March 2013 UofU and GMT 2

Page 3: Relative Timing Driven Multi-Synchronous Design: Enabling

Multi-Synchronous Advantage

1. Efficiency in power and performance is new game in town

2. Multi-synchronous design provides optimization opportunity

3. New (asynchronous) timing model is one excellent path

4. Produces average 10× eτ2 improvement

l Pentium: eτ2 = 17.5×l FFT: eτ2 = 16.9×

5. But ... need improved physical design support

Design Energy Area Freq. Latency AggregatePentium F.E. 2.05 0.85 2.92 2.38 12.11×64-pt FFT 3.95 2.83 2.07 3.37 77.98×

27 March 2013 UofU and GMT 3

Page 4: Relative Timing Driven Multi-Synchronous Design: Enabling

Timing is a Key Issue

Multi-synchronous design produces best results

6

-s?

-

-�

6 6

Single frequency, low skew(small blocks, standard CAD)1. global block frequencies

2. higher clock power

3. clock design, distribution

Synchronous Clock at 1.8GHz

Synchronousvariable freq. Pausable

1.7GHz clk

Synchronous3.0GHz clk

Asynccircuit

Synchronous Clock at 1.5GHz

6

-s?

-

-�

6 6

Multiple frequencies(SoC reality – localization)1. blocks operate at best frequency

2. network not synchronized

3. synchronizing FIFOs

27 March 2013 UofU and GMT 4

Page 5: Relative Timing Driven Multi-Synchronous Design: Enabling

Energy Efficient DesignWine goblet model:

l Energy efficiency has two primary sources

u System architectureu Physical design

l Methodology and CAD unify sources

Best realization:

l Multi-synchronous

u Defined by system’s critical pathu Then optimal local power-delayu Asynchronous best methodology:

n no synchronization cost

arch

pd

27 March 2013 UofU and GMT 5

Page 6: Relative Timing Driven Multi-Synchronous Design: Enabling

Interface Matters!

Clocked design requires synchronizers when crossing all domains.

IP Clock Domain Network Clock Domain

data

clk sr S S

S S

Major location for buffering in a design.

27 March 2013 UofU and GMT 6

Page 7: Relative Timing Driven Multi-Synchronous Design: Enabling

Interface Matters!

No synchronization required into async domain.

IP Clock Domain Network Clock Domain

data

clk sr

S S

Improves power, performance, and modularity

27 March 2013 UofU and GMT 7

Page 8: Relative Timing Driven Multi-Synchronous Design: Enabling

Timed Asynchronous Designs

27 March 2013 UofU and GMT 8

Page 9: Relative Timing Driven Multi-Synchronous Design: Enabling

Multi-Synchronous Architecture

1. Make architectural bottleneck as fast as possible.

2. Make the rest of the design match bottleneck

l . . . normally as slow as possible

3. Optimize locally for power/performance.

tagin7

tagin1

irdyackbufreq

bufackirdy L1 L7

tagout7

tagout1

Asynchronous Pentium bottleneck circuit

27 March 2013 UofU and GMT 9

Page 10: Relative Timing Driven Multi-Synchronous Design: Enabling

Concurrency and Time

Architectural level timing experiment: Pentium front end

Output Buffer

Tag Units

Output Buffer

Tag Units

Output Buffer

Tag Units

Output Buffer

Tag Units

Len. Decoders

Cache Latch

Row

0R

ow1

Row

2R

ow3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Column

27 March 2013 UofU and GMT 10

Page 11: Relative Timing Driven Multi-Synchronous Design: Enabling

Concurrency and Time

Architectural level timing experiment: Pentium front end

Output Buffer

Tag Units

Output Buffer

Tag Units

Output Buffer

Tag Units

Output Buffer

Tag Units

Len. Decoders

Cache Latch

Target

3 3 9 4 1 7 2 1 6 3 5 3 5 1 3 4

1

2

27 March 2013 UofU and GMT 11

Page 12: Relative Timing Driven Multi-Synchronous Design: Enabling

Concurrency and Time

Architectural level timing experiment: Pentium front end

Output Buffer

Tag Units

Output Buffer

Tag Units

Output Buffer

Tag Units

Output Buffer

Tag Units

Len. Decoders

Cache Latch

3 1 7 2 1 6 3 5 3 5 1 3 4

1

2

3

27 March 2013 UofU and GMT 12

Page 13: Relative Timing Driven Multi-Synchronous Design: Enabling

Concurrency and Time

Architectural level timing experiment: Pentium front end

Output Buffer

Tag Units

Output Buffer

Tag Units

Output Buffer

Tag Units

Output Buffer

Tag Units

Len. Decoders

Cache Latch

3 2 1 4 7 2 1 6 3 5 3 5 1 3 4

1

2

3

4

27 March 2013 UofU and GMT 13

Page 14: Relative Timing Driven Multi-Synchronous Design: Enabling

Concurrency and Time

Architectural level timing experiment: Pentium front end

Output Buffer

Tag Units

Output Buffer

Tag Units

Output Buffer

Tag Units

Output Buffer

Tag Units

Len. Decoders

Cache Latch

3 2 1 4 2 5 1 3 4

5

2

3

4

27 March 2013 UofU and GMT 14

Page 15: Relative Timing Driven Multi-Synchronous Design: Enabling

Concurrency and Time

Architectural level timing experiment: Pentium front end

Output Buffer

Tag Units

Output Buffer

Tag Units

Output Buffer

Tag Units

Output Buffer

Tag Units

Len. Decoders

Cache Latch

2 1 4 2 3 1 7 9 4 2 3

5

6 2

3

4

27 March 2013 UofU and GMT 15

Page 16: Relative Timing Driven Multi-Synchronous Design: Enabling

Timing and Sequencing

Traditional representation of timing:

l Metric values

u On an IC we measure it to picosecondsu In track and ski racing, we measure it to milliseconds

But what do we really care about?

l it isn’t the number on the stop watch. . .

27 March 2013 UofU and GMT 16

Page 17: Relative Timing Driven Multi-Synchronous Design: Enabling

Timing and Sequencing

Traditional representation of timing:

l Metric values

u On an IC we measure it to picosecondsu In track and ski racing, we measure it to milliseconds

But what do we really care about?

l it isn’t the number on the stop watch. . .

We care about who wins!!

The key: Timing results in sequencing

Relative Timing formally represents the signal sequencingproduced by circuit timing

27 March 2013 UofU and GMT 17

Page 18: Relative Timing Driven Multi-Synchronous Design: Enabling

New Formal Abstract Model: Relative Timing

l Timing is both the technology differentiator and barrierl Relative Timing is the generalized solutionl The key property of time is the sequencing it imposes

Sequence gives winner, performance, etc.

l true in semiconductors as well as sportsl absolute stopwatch value is auxiliary

Novel relativistic formal logicrepresentation of time (relative timing):

pod 7→ poc1 ≺ poc2

Sequencing relative to common referencel can now evaluate sequencingl can now control sequencing

27 March 2013 UofU and GMT 18

Page 19: Relative Timing Driven Multi-Synchronous Design: Enabling

Relative Timing1. Relative Timing

l Sequences signals at poc (point of convergence)l Requires a common timing reference: pod (point of divergence)

2. Formal representation: pod 7→ poc1 + margin ≺ poc2

3. RT models timing in ALL systems

l Clocked: pod = clock poc = flopsl Async: pod = request poc = latches

4. RT enables direct commercial CAD support of general timing requirements

l formal RT constraints mapped to sdc constraints

POD POC

A

B

POD

POC0

POC1

FFi FFi+1

data

clk

i i+1

clk

data

m

27 March 2013 UofU and GMT 19

Page 20: Relative Timing Driven Multi-Synchronous Design: Enabling

Relative Timed Design: Bundled Data

Bundled data design is much like clocked.

CL CLFFi FFi+1 FFi+2n n

clock network

Frequency based (clocked) design.Clock frequency and datapath delay offirst pipeline stage is constrained byLi/clk↑i 7→ Li+1/d+s ≺ Li+1/clk↑i+1

CL CLLi Li+1 Li+2n n

Ctli Ctli+1 Ctli+2

reqiacki

reqi+1acki+1

reqi+2acki+2

reqi+3acki+3

delay delay

Timed (bundled data) handshakedesign. Delay element sized byRT constraint:reqi↑ 7→ Li+1/d+s ≺ Li+1/clk↑

Clocked physical design directly supports the clocked RelativeTiming constraints. The asynchronous circuit constraints must beprovided as min and max constraints, and are not well supported

27 March 2013 UofU and GMT 20

Page 21: Relative Timing Driven Multi-Synchronous Design: Enabling

Relative Timing Driven Flowset d0 fdel 0.600set d0 fdel margin [expr $d0 fdel + 0.050]set d0 bdel 0.060

set size only -all instances [find -hier cell lc1]set size only -all instances [find -hier cell lc3]set size only -all instances [find -hier cell lc4]

set disable timing -from A2 -to Y [find -hier cell lc1]set disable timing -from B1 -to Y [find -hier cell lc1]set disable timing -from A2 -to Y [find -hier cell lc3]set disable timing -from B1 -to Y [find -hier cell lc3]

set max delay $d0 fdel -from a -to l0/dset max delay $d0 fdel -from b -to l0/dset min delay $d0 fdel margin -from lr -to l0/clkset max delay $d0 bdel -from lr -to la#margin 0.050 -from a -to l0/d -from lr -to l0/clk#margin 0.050 -from b -to l0/d -from lr -to l0/clk

27 March 2013 UofU and GMT 21

Page 22: Relative Timing Driven Multi-Synchronous Design: Enabling

Multi-rate 64-Point FFT Architecture

Initial design target: high performance military applications

l Mathematically based on WN = e− j2πN notation

l Hierarchical multi-rate design: N = N1N2

l Decimate frequency (↓) by N2

u operate on N2 low frequency streams

l Transmute data & frequency to N1 low frequency streams

l Expand (↑) by N1 to reconstruct original frequency stream

27 March 2013 UofU and GMT 22

Page 23: Relative Timing Driven Multi-Synchronous Design: Enabling

Design Models

Hierarchical derivation of multi-frequency design:

Xm1(m2) = ∑N2−1n2=0

[W m1n2

N ∑N1−1n1=0 xn2(n1)W

m1n1N1

]W m2n2

N2

l N2 FFTs using N1 values as the inner summation

l Scaled and used to produce N1 FFTs of N2 values

Hierarchically scale design

l Base case when N = 4, X(m) =W 4x(n)

l 4-point FFT performed without multiplication

u Multiplication constants W 4 become ±1

27 March 2013 UofU and GMT 23

Page 24: Relative Timing Driven Multi-Synchronous Design: Enabling

FFT-64

Implemented on IBM’s 65nm 10sf process, Artisan academic library

Three design blocks:

l FFT-4

l FFT-16 N1,N2 = 4

l FFT-64 N1 = 16, N2 = 4

Two designs:

l Clocked Multi-Synchronous

l Relative Timed Multi-Synchronous

u near identical architecturesu additional RT area / pipeline optimized version for FFT-64

27 March 2013 UofU and GMT 24

Page 25: Relative Timing Driven Multi-Synchronous Design: Enabling

General Multi-rate FFT Architecture

1.25GHz 313MHz 313MHz to 78MHz

x(n) -t -x0(n1) - -

?

↓ N2 N1-pt. FFT

N1 Constants

������@@

?

-t -x1(n1) - -

?

↓ N2 N1-pt. FFT

N1 Constants

������@@

- -xN2−1(n1) - -

?

↓ N2 N1-pt. FFT

N1 Constants

������@@qqq qqq qqq qqq qqq

z−1

z−1

z−1

x0(0)

x1(0)

xN2−1(0)

x0(1)

e j 2π

N x1(1)

e j2π(N1−1)N xN2−1(1)

x0(N1−1)

e j 2π(N1−1)N x1(N1−1)

e j2π(N2−1)(N1−1)N xN2−1(N1−1)q q q

q q q

q q q

N2-pt. FFT

N2-pt. FFT

N2-pt. FFT

��

��

��

↑ N1

↑ N1

↑ N1

�X(m)

6t

6

qqq qqqz−1

z−1

z−1

1.25GHz 78MHz ASIC tool flow, 65nm technology

27 March 2013 UofU and GMT 25

Page 26: Relative Timing Driven Multi-Synchronous Design: Enabling

FFT-4 Building Block

Data flow graph of pipelined 4-Point FFT design:

Re{x[0]} + + Re{X[0]}

Im{x[0]} + + Im{X[0]}

Re{x[1]} + - Re{X[1]}

Im{x[1]} + - Im{X[1]}

Re{x[2]} - + Re{X[2]}

Im{x[2]} - + Im{X[2]}

Re{x[3]} - - Re{X[3]}

Im{x[3]} - - Im{X[3]}

27 March 2013 UofU and GMT 26

Page 27: Relative Timing Driven Multi-Synchronous Design: Enabling

Pipelined Asynchronous 4-Point Architecture

l Operates at 1/4 the input frequency

l Synchronization occurs between decimated rows

u Fast internal pipeline stages essential

LC0

LC13

LC12

LC11

LC10

LC23

LC22

LC21

LC20

LC33

LC32

LC31

LC30

LC43

LC42

LC41

LC40

LC5Dec4 Exp4

f 3

f 2

f 1

f 0

j3

j2

j1

j0

f 7

f 6

f 5

f 4

j7

j6

j5

j4

f 11

f 10

f 9

f 8

j11

j10

j9

j8

lrla

rrra

add/sub add/sub

Fork Join Fork Join Fork Join

27 March 2013 UofU and GMT 27

Page 28: Relative Timing Driven Multi-Synchronous Design: Enabling

Decimator-4 Design Comparison

l Clocked block requires pipeline to change frequency

l Async block latency combinational and concurrent

Shi f tReg

R0

R1

R2

R3

R4

R5

R6

R7Din

D1

D2

D3

D4

clk

clk/4 Shi f tReg

ri r1

r2

r3

r4

Din D1D2D3D4

a1a2a3a4

ai

Multi-Synchronous asynchronous design smaller, faster, lower power

27 March 2013 UofU and GMT 28

Page 29: Relative Timing Driven Multi-Synchronous Design: Enabling

ResultsThe 16-point FFT Comparison Result (* values are scaled ideally to 65 nm technology)

Points Word Time for 1K-point Clock Tech. Energy/point Area Power Energy Area Throughput

bits µs MHz nm pJ/data− point mW Benefit Benefit Benefit

Our Design(Async) 16-1024 32 0.83 1274 65 25.05 54 Kgates 30.9 8.01 2.77 8.32

Our Design(clock) 16-1024 32 1.73 588 65 41.83 71 Kgates 24.7 4.8 2.07 3.98

Guan [1] 16-1024 16 6.91∗ 653∗ 130 200.68 147 Kgates 29.7∗ 1 1 1

The 64-point FFT Comparison Result (* values are scaled ideally to 65 nm technology)

Points Word Time for 1K-point Clock Tech. Energy/point Area Power Energy Area Throughput

bits µs MHz nm pJ/data− point mW Benefit Benefit Benefit

Our Design(Async-opt) 64-1024 32 0.93 1284 65 62.41 0.41 mm2 68.5 6.1 0.46 30.16

Our Design(Async) 64-1024 32 0.84 1366 65 59.94 0.50 mm2 72.9 6.35 0.38 33.42

Our Design(clock) 64-1024 32 3.13 588 65 246.75 1.16 mm2 80.7 1.54 0.16 8.99

Baireddy [2] 64-4096 - 28.14∗ 514∗ 90 380.88 0.19 mm2∗ 13.86∗ 1 1 1

The 64-point async-opt design contains 229k gates, our clocked 454k.∗ For comparison, these designs were scaled to a 65nm process by scaling frequency, power, and area in

the 130nm technology by 2.0, 0.5, 0.25×, and in the 90nm design by 1.43, 0.7, and 0.49× respectively.

[1] X. Guan, Y. Fei, and H. Lin, “Hierarchical Design of an Application-Specific Instruction Set Processor for High-Throughput and ScalableFFT Processing” in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 20, No. 3, pp. 551–563, march 2012.

[2] V. Baireddy, H. Khasnis, and R. Mundhada, “A 64-4096 point FFT/IFFT/Windowing Processor for Multi Standard ADSL/VDSLApplications”, in IEEE International symposium on Signals, Systems and Electronics (ISSSE’07), pp. 403–405, 2007.

27 March 2013 UofU and GMT 29

Page 30: Relative Timing Driven Multi-Synchronous Design: Enabling

Multi-Synchronous Advantage

1. Efficiency in power and performance is new game in town

2. Multi-synchronous design provides optimization opprotunity

3. New (asynchronous) timing model is one excellent path

4. Produces average 10× eτ2 improvement

l Pentium: eτ2 = 17.5×l FFT: eτ2 = 16.9×

5. But ... need improved physical design support

Design Energy Area Freq. Latency AggregatePentium F.E. 2.05 0.85 2.92 2.38 12.11×64-pt FFT 3.95 2.83 2.07 3.37 77.98×

27 March 2013 UofU and GMT 30

Page 31: Relative Timing Driven Multi-Synchronous Design: Enabling

RT Physical Design Optimization

Timing, power, and performance optimizations driven by relativetiming constriants.

CL CLLi Li+1 Li+2n n

Ctli Ctli+1 Ctli+2

reqi

acki

reqi+1

acki+1

reqi+2

acki+2

reqi+3

acki+3delay delay

reqi↑ 7→ Li+1/d+m ≺ Li+1/clk↑Mapped to set max delay and set min delay constraints

Clock frequency determines min delay, async adds “hold time”

27 March 2013 UofU and GMT 31

Page 32: Relative Timing Driven Multi-Synchronous Design: Enabling

RT Physical Design Problems

CL CLLi Li+1 Li+2n n

Ctli Ctli+1 Ctli+2

reqi

acki

reqi+1

acki+1

reqi+2

acki+2

reqi+3

acki+3delay delay

1. Inconsistency between operation and results

l supported pins & formats, synthesis vs place and route, etc.2. Min-delay constraints not well supported

l Treated as “hold time fixing”l Create arbitrarily large delays

u Degrades performanceu Required matching max-delay constraint to bound delay

3. Poor job of optimizing competing constraints4. Placement can be substantially improved

27 March 2013 UofU and GMT 32

Page 33: Relative Timing Driven Multi-Synchronous Design: Enabling

RT Physical Design Problems

Simple experiment with inverterswith endpoints mapping either tomodule pin or library gate pin:

module i0 module i1

A B C D E F

Design Compiler SoC EncounterPath Result Iterations Type Result type

A→ E Yes 5 buffers No –A→ F Yes 5 buffers No –B→ E Yes 1 Dly Elts No –B→ F Yes 1 Dly Elts Yes Dly EltsC→ E Yes 1 Dly Elts No –C→ F Yes 1 Dly Elts Yes Dly EltsD→ E No – – No –D→ F No – – No –

Paths use both max and min delay constraints

27 March 2013 UofU and GMT 33

Page 34: Relative Timing Driven Multi-Synchronous Design: Enabling

RT Physical Design Problems

LC0

LC13

LC12

LC11

LC10

LC23

LC22

LC21

LC20

LC33

LC32

LC31

LC30

LC43

LC42

LC41

LC40

LC5Dec4 Exp4

f 3

f 2

f 1

f 0

j3

j2

j1

j0

f 7

f 6

f 5

f 4

j7

j6

j5

j4

f 11

f 10

f 9

f 8

j11

j10

j9

j8

lrla

rrra

add/sub add/sub

Fork Join Fork Join Fork Join

Min-delay constraints get dropped, even in relatively small design!

Design Compiler SoC SoC - timing closureModel #iter cyc. time #iter cyc. time energy/op #iter cyc. time energy/opwl0.5 9 738ps 1 728ps 5.16pJ 70 785ps 4.85pJwl0 7 666ps 1 764ps 5.07pJ 16 763ps 4.87pJ

27 March 2013 UofU and GMT 34

Page 35: Relative Timing Driven Multi-Synchronous Design: Enabling

RT Physical Design Potential

CL CLLi Li+1 Li+2n n

Ctli Ctli+1 Ctli+2

reqi

acki

reqi+1

acki+1

reqi+2

acki+2

reqi+3

acki+3delay delay

1. Low hanging fruit for performance improvements

2. Force directed algorithms

l Combine power/placement optimizationsl Drive cell clusteringl Drive pipeline/repeater placement and wire optimization

3. Tool performance: Convergence and run-time

27 March 2013 UofU and GMT 35

Page 36: Relative Timing Driven Multi-Synchronous Design: Enabling

Multi-Synchronous Advantage

1. Efficiency in power and performance is new game in town

2. Multi-synchronous design provides optimization opprotunity

3. New (asynchronous) timing model is one excellent path

4. Produces average 10× eτ2 improvement

l Pentium: eτ2 = 17.5×l FFT: eτ2 = 16.9×

5. But ... need improved physical design support

Design Energy Area Freq. Latency AggregatePentium F.E. 2.05 0.85 2.92 2.38 12.11×64-pt FFT 3.95 2.83 2.07 3.37 77.98×

27 March 2013 UofU and GMT 36