1 folding technique: compromising in special purpose hardware design ivan milentijevic faculty of...

1

Folding Technique: Compromising in Special Purpose Hardware Design

Ivan MilentijevicFaculty of Electronic Engineering

University of NisSerbia and Montenegro

[email protected]

2

Outline

• Special purpose hardware design and DSP• DSP application demands and technologies• Representations of DSP algorithms and architectures• Compromising• Folding technique• Simple example – ad hoc folding• Folding equation and mathematical background• Preparation of source architecture for folding – problems• Case study 1: Folded Bit-Serial Multiplier• Case study 2: Configurable Folded FIR Filter Architecture

3

Special purpose hardware design and DSP

• DSP systems can be realized using programmable processors or custom designed hardware circuits fabricated using very-large-scale-integrated (VLSI) circuit technology

• Two import features that distinguish DSP from other general purpose computations are real-time throughput requirement and data driven property.

5

DSP application demands and technologies

6

Representations of DSP algorithms and architectures

System architecture can be described by

Behavioral languages Graphical representations

Applicative

- set of equations (not actions)

Prescriptive

- describe assigments

Descriptive

-VHDL, Verilog,...

- BD

- SFG

- DFG

- DG

DSP algorithms are initially described by mathematical formulas.

7


Block diagram of a 3-tap FIR filter

8


Signal Flow Graph of a 3-tap FIR filter

9


Data Flow Graph of a 3-tap FIR filter

10


Dependence Graph of a 3-tap FIR filter

11

Compromising

• Area – Time – Power

• Area – Time

• Goal:– to achieve time (throughput) requirements with

optimal chip area or optimal using of chip resources

12

Folding technique

• Performances and cost of any digital circuit depend on circuit design style. Therefore, creating a given architecture, to establish optimal area-time-power tradeoff, a careful choice of circuit design style to use is necessary.

• In synthesizing DSP architectures, it is important to minimize the silicon area of the integrated circuits, which is achieved by reducing the number of functional units (such as multipliers and adders), registers, multiplexers, and interconnection wires.

13

Folding technique

• How?

• By executing multiple algorithm operations on a single functional unit, the number of functional units in the implementation is reduced, resulting in integrated circuit with low silicon area.

14

Simple example – ad hoc folding

Two addition operations are folded to a single adder:

Folding factor* N=2

*Folding factor - the number of algorithm operations folded to a single functional unit

15

Clk L.input Up.input Output

0 a(0) b(0) -

1 a(0)+b(0) c(0) -

2 a(1) b(1) a(0)+b(0)+c(0)

3 a(1)+b(1) c(1) -

4 a(2) b(2) a(1)+b(1)+c(1)

5 a(2)+b(2) c(2) -

16

• K. K. Parhi, VLSI Digital Signal Processing Systems (Design and Implementation), John Wiley & Sons, In., New York, 2000.

• T. C. Denk, K. K. Parhi, Synthesis of Folded Pipelined Architectures for Multirate DSP Algorithms, IEEE Transaction on Very Large Scale Integration (VLSI) Systems, Vol. 6, No. 4, Dec. 1998, pp. 595-607.

The folding transformation provides a systematic technique for designing of control circuits in folded systems.

Folding equation and math. background

17


An edge with w(e) delays The corresponding folded data path

The data begin at the functional unit , which has pipelining stages, pass through

delays, and are switched into the functional unit at the time instances , where N is the number of operations folded to a single functional unit (folding factor), while u and v are the folding orders of nodes U and V that satisfy .

uvPeNwVUD uF

uH uP

vHvNl

0,1 vuN

18

• A folding set, S, is defined as an ordered set of operations,

which contains N entries, executed by the same functional unit.

• For a folded system to be realizable

must hold for all of the edges in the DFG. 0 VUDF


19

Important question:

How to prepare the source architecture / DFG for the successful application of folding technique?

Preparation of source architecture for folding – problems

20

• Once valid folding sets have been assigned, retiming can be used to satisfy this property or determine that the folding sets

are not feasible. • Retiming is a transformation technique used to change the

locations of delay elements without affecting the I/O characteristics of the circuit.

• Retiming in synchronous circuit design can be directed towards:– Reducing the clock period,

– Reducing the number of registers,

– Reducing the power consumption, etc.


21

• Using folding equations, a set of retiming inequalities can be obtained– Solution for architecture retiming can be found by

mapping of set of inequalities onto constraint graph.

– Algorithms: Bellman-Ford or Floyd-Warshall

• Assigment of folding sets on functional units of retimed graph– Rechecking of folding condition


22

Case study 1: Folded Bit-Serial Multiplier

• Public-key cryptography – special features are required for multiplier units.

– RSA encryption and decryption, large integers (typically 1024 bits or more) must be multiplied,

• Elliptic curve cryptosystems, – a multiplication in finite fields is required.

23

Source architecture: basic serial-parallel-serial multiplier


24

a3b0 a2b0 a1b0 a0b0


25

a3b0 a2b0 a1b0 a0b0

a3b1 a2b1 a1b1 a0b1

p0

+ + +


26

a3b2 a2b2 a1b2 a0b2

a3b0 a2b0 a1b0

a3b1 a2b1 a1b1 a0b1

p1

++ +

+ + +


27

a3b2 a2b2 a1b2 a0b2

a3b0 a2b0

a3b1 a2b1 a1b1

p2

a3b3 a2b3 a1b3 a0b3

+ + +

++ +

++


28


29

(Sk-1|N-1) (Sk-1|N-2) (Sk-1|0)

(Sk-2|N-1) (S0|1) (S0|0)

Case study 1: Folding Set Assigment

30

Df(ii+1)=N1-0+0-1=N-1, i= N-1, 2N-1, … , La-N-1

Df(ii+1)=N1-0+(N-1)-0=2N-1, i N-1, 2N-1, … , La-N-1

• Two neighboring nodes that are folded onto one node

• Neighboring nodes that are folded onto different nodes

Folded architecture contains max(N-1,2N-1)=2N-1 latches between nodes j and j+1 that will be used for data buffering

Df(UV)0

• Carry data paths

Df(ii)=N1-0+0-1=N-1, i= 0, 1, … , La-1

Case study 1: Folding Equations

31

Case study 1: Folded Architecture

32

Case study 1: Functional Description for case N=2, k=2

33

0 0a2b0 0 a0b0 0

Case study 1: Functional Description

34

a3b0 a2b0 0 a1b0 a0b0 0


35

a3b0 a1b0

a2b1

+a3b0

a2b0

a0b1

+a1b0

a0b0

p0


36

a3b1

a2b1

+a3b0

a3b0

a1b1

+a2b0

a0b1

+a1b0

a1b0


37

a3b1

a2b1

+a3b0

a2b0

+a1b1

a0b1

+a1b0

a2b2

+a3b1

a0b2

+a2b0

+a1b1

p1


387.84112264

7.85512432

8.056168168.129128128128

7.0036232

7.2028416

7.2118886.682646464

4.3674216

4.214448

4.5908844.580323232

4.502228

4.785444

4.8988824.706161616

3.803224

4.105442

4.0888813.957888

Clockperiod [ns]

Slices used

No. of PEs

Foldingfactor

Clockperiod [ns]

Slices used

No. of PEs

Op.length

Basic multiplier Folded multiplier

Case study 1: Implementation of Folded Architecture

(Spartan II xc2s2000-5pq208 )

39

• It provides the finding of optimal area-time solution for the given requirements.

• Saprtan II “shift register” property was used to relax the constraints caused by relatively large number of lathes in folded architecture.

• Generated architecture has kept almost all desirable features of source Bit-Serial architecture. The hardware reduction of active arithmetic elements for the factor N is done at the cost of execution time.

Case study 1: Conclusions

40

• Cellular-phone technology is changing rapidly. There is an increasing number of wireless-communications standards, including variants of the IEEE 802.11 wireless LAN specification, etc…– Traditionally, devices need a separate chip to work with each

standard.

• Providers differentiate themselves by offering new features, such as multimedia capabilities. – Providing each feature typically requires a separate chip, or

essence, multiple circuitry systems physically joined on a peace of silicon

Case study 2:

41

• The additional circuitry adds cost, takes up space, increases power usage in mobile devices, and increase product-design time.

Case study 2:

42

• The synthesis of configurable folded bit-plane architecture for FIR filtering.

• Why?Wider application areaFinding of suitable A-T tradeoffs Increasing of versatility of folded systems

Case study 2: Configurable Folded FIR Filter Architecture

43

Output words {yi} of FIR filter are computed as

where are coefficients while {xi} are input

words.

m – coefficient word length,

k – number of taps,

– bit of coefficient (with weight )

n – input word length.

1i-kk-1i-11i0i xcxcxcy ...

110 ...,,, kccc

jic ic j2

The BPA is obtained by resorting of partial products of different multipliers.

Case study 2: FIR filtering

45

• highly regular architecture

• allows extensive pipelining

• regular layout

• high computational throughput

• truncation of LSBs of intermediate results without any loss of accuracy

• programmability of coefficients

[Noll 1986], [Reuver & Klar 1992]

Case study 2: Bit-plane FIR filter architecture

46

The DFG for the basic BPA with k=3 and m=4

Case study 2: DFG for the source BPA for case k=3 and m=4

47

(Ss , r)s= p mod

kr= p mod

N

x

0

2 2 2 2 2 2 2 2 2 20 k -1 k N -1 N m -1 m -1 mc cc ik -1 ik

0 k - 1 m - 1c cN - 1 i k - 1k mN i k L - 1y

c c c c c cc c c c0 k -1 N -1 m -1 ik -1 m -1k N m ikc c c c

c cc c c

ck -1 k -1 k -1 k -1 j 0k -1 k -1 k -1 j

(S ,

0)

(S ,

k)

0

c

0

(S

,k-1

)

(S

,0)

(S

,N-1

)

(S

,(m

-1)

mod

k)

(S

,k

)

(S

,N-1

)

k-1

N m

od k

(N-1

) m

od k

c(m

-1)

mod

k

cm

mod

k

k-1

(S ,i

k m

od N

)0

(S

,(ik

-1)

mod

N)

k-1

Case study 2: Assignment of folding sets

48

• Folding set assignment enables the changing of operations in folding sets.

• Different operations can be mapped onto the different hardware units in fixed array structure.

• There are k folding sets where each folding set contains N operations.

• For the coefficients, kc, and the coefficient length, mc, the total number of operations, L, is:

L=kc mc=k N

Case study 2: Assignment of folding sets

49

• General form of Folding Equations:

Df (pp+1)=Nw(e)-0+[(p+1) mod N]-[p mod N] =

=

otherwise,

0mmod1)(p1,N

0Nmod1)(p1),(N

c

1

The condition Df (UV) 0 is not satisfied for neighboring nodes U and V when for the position p of node U the following is valid p mod (N-1) = 0 or p mod N = 0.

Case study 2: Folding equations and retiming

50

The general form of solution for r(p) is:

N

pL

m

pLr(p)

c

The existence of this solution provides the retiming of DFG and allows the application of folding technique.

otherwise0,

0mmod1)(p1,

0Nmod1)(p1,

1)r(pr(p) c

General form of retiming inequalities:

Case study 2: Folding equations and retiming

51

p p

r(p )

a ) b )

r(p )

k -k

k -k

k -k

-1

k -k

-1

c cc c0 01 1

N N

m

m

c

c

c

c

2m

3m

2N 2N

i N

j N

(i+ 1) N

(j+ 1) N

iN

(i+ 2)N

(i+ 1)N

(k-1) N(k-1) N

(k-2) N

Case study 2: Graphical representations of retiming

b) kc=3, mc=L/3a) kc=1, mc=L;

52

0N

2N

mcm +N

m +2N

2m2m +N

2m +2N

L=k *m

(k -1)m k - 1

k + 1

k (k -1)m +N

(k -1)m +2N

xx

x

xx

x

xx

x

xx

x

1

1

1

c

c

c

c

c

cc

c

cc

c

c c

c

c

c

2

2

3

3

3

2

Case study 2: Life cycle analysis

53

012

k-1k

N

iN

(i+1)N

N+1N+2

mm +1

k*N-1

(k-1)N(k-1)N+1(k-1)N+2

xx

x

xx

input D D D

x

xx

xxx

xx

xxx

x

xxx

xx

x

x

xx

0

0

0 0

0 0

0

0

1

1 1

1

i

ic

c i i

i

k-1

c

c

c

c

c

c

c

c

c

c

(k -1)m /N

(k -1)m /N

(k -1)m /N

(k -1)m /N

(k -1)m /N

i

1

0

0

1 k-1

Case study 2: General allocation table

54

{T= lN + 0}{T= i*m }

l= 1 ,2 ,3 ,...i= 1 ,2 ,3 ,...

{T /= lN + 0}{T /= i*m }

x

S S S2 2 20 1 k -1

c

c

D 0 D 1 Dk - 1

Case study 2: Module for input data entering

55

{T= l N + 0}{T= i m }

l= 1 ,2 ,3 ,...i= 1 ,2 ,3 ,...

{T = l N + 0}{T = i m }

{N -1}

{0 ,1 ,2 ,...,N -2 }

x

y0

2

S S S

2 2

2 2 20 1 k -1

c c c

c

N

{0 } {0 } {0 }

{1 } {1 } {1 }

{N -1 } {N -1 } {N -1 }

ck c -1

c

c

m c-1

k c -1

k c -1

k c -1

Case study 2: Folded FIR filter architecture

56

00

0

0

Ad d e r

00 000 00 000 00 000 00 000 0

x4

x3

x2

x1

x0

y4

y5

y6

y7

y8

y9

y1 0

1

2

3

y2

y3

y1

y0

c lk

c k1c k0

N= 4

m = 6c

p a ra lle l in

CBS

M

0 1 00

11c k0

c k1

s

s

c

c

a

a

x

x

y

y

b

b

k=3, N=4, kc=2 and mc=6

Case study 2: Functional block diagram

57

k=3, N=4, kc =2, mc=6

y1= 20 c10x0 + 21 c1

1x0 +22 c12x0 + 23 c1

3x0 +24 c14x0 + 25 c1

5x0 +20 c00x1 + 21c0

1x1

+22c02x1 + 23 c0

3x1 +24 c04x1 + 25 c0

5x1 = c1x0 +c0x1

y0= 20 c00x0 + 21 c0

1x0 +22 c02x0 + 23 c0

3x0 +24 c04x0 + 25 c0

5x0 = c0x0

y2= 20 c10x1 + 21 c1

1x1 +22 c12x1 + 23 c1

3x1 +24 c14x1 + 25 c1

5x1

+ … + = c1x1+ c0x2+24 c1

4x1 + 25 c15x1

y3= 20 c00x2 + 21 c0

1x2 +22 c02x2 + 23 c0

3x2+ … + = c0x2+ c1x3

Case study 2: Data flow for folded architecture

58

Case study 2: Implementation - Chip occupation as a function

of maximal folding factor Nmax(Spartan II xc2s2000-5pq208 )

N m a x6 4

k = 4

k = 8k = 1 6k = 3 2

N u m b e r o f u s e d s l i c e s

k = 6 4

5 63 22 41 684 4 0 4 8

1 0 0 0

1 5 0 0

2 0 0 0

2 5 0 0

5 0 0

0

59

Case study 2: Implementation - Throughput as a function of

chosen folding factor

N

T h r o u g h p u t [ n s ]

2 181 01 21 41 6 6 4

4 0

6 0

8 0

1 0 0

2 0

0

60

• Folding set assignment supports the changing of operations in folding sets.

• The prerequisites for application of folding technique are satisfied.

• Using of proposed folding set assignment, different operations can be mapped onto the different hardware units in the fixed structure array.


61

• The derived folded processing array can be configured to perform FIR filtering with different number of taps and length of coefficients.

• Synthesized architecture has kept desirable features of source architecture such as extensive pipelining, high regularity, truncation of LSBs of intermediate results without any loss of accuracy.

• The number of basic cells is reduced to the number of basic cells in one plane of source architecture.

• The obtained folded semi-systolic architecture is presented by DFG, allocation table, and data flow diagram.


1 folding technique: compromising in special purpose hardware design ivan milentijevic faculty of...

Documents