1 folding technique: compromising in special purpose hardware design ivan milentijevic faculty of...
TRANSCRIPT
1
Folding Technique: Compromising in Special Purpose Hardware Design
Ivan MilentijevicFaculty of Electronic Engineering
University of NisSerbia and Montenegro
2
Outline
• Special purpose hardware design and DSP• DSP application demands and technologies• Representations of DSP algorithms and architectures• Compromising• Folding technique• Simple example – ad hoc folding• Folding equation and mathematical background• Preparation of source architecture for folding – problems• Case study 1: Folded Bit-Serial Multiplier• Case study 2: Configurable Folded FIR Filter Architecture
3
Special purpose hardware design and DSP
• DSP systems can be realized using programmable processors or custom designed hardware circuits fabricated using very-large-scale-integrated (VLSI) circuit technology
• Two import features that distinguish DSP from other general purpose computations are real-time throughput requirement and data driven property.
4
5
DSP application demands and technologies
6
Representations of DSP algorithms and architectures
System architecture can be described by
Behavioral languages Graphical representations
Applicative
- set of equations (not actions)
Prescriptive
- describe assigments
Descriptive
-VHDL, Verilog,...
- BD
- SFG
- DFG
- DG
DSP algorithms are initially described by mathematical formulas.
7
Representations of DSP algorithms and architectures
Block diagram of a 3-tap FIR filter
8
Representations of DSP algorithms and architectures
Signal Flow Graph of a 3-tap FIR filter
9
Representations of DSP algorithms and architectures
Data Flow Graph of a 3-tap FIR filter
10
Representations of DSP algorithms and architectures
Dependence Graph of a 3-tap FIR filter
11
Compromising
• Area – Time – Power
• Area – Time
• Goal:– to achieve time (throughput) requirements with
optimal chip area or optimal using of chip resources
12
Folding technique
• Performances and cost of any digital circuit depend on circuit design style. Therefore, creating a given architecture, to establish optimal area-time-power tradeoff, a careful choice of circuit design style to use is necessary.
• In synthesizing DSP architectures, it is important to minimize the silicon area of the integrated circuits, which is achieved by reducing the number of functional units (such as multipliers and adders), registers, multiplexers, and interconnection wires.
13
Folding technique
• How?
• By executing multiple algorithm operations on a single functional unit, the number of functional units in the implementation is reduced, resulting in integrated circuit with low silicon area.
14
Simple example – ad hoc folding
Two addition operations are folded to a single adder:
Folding factor* N=2
*Folding factor - the number of algorithm operations folded to a single functional unit
15
Clk L.input Up.input Output
0 a(0) b(0) -
1 a(0)+b(0) c(0) -
2 a(1) b(1) a(0)+b(0)+c(0)
3 a(1)+b(1) c(1) -
4 a(2) b(2) a(1)+b(1)+c(1)
5 a(2)+b(2) c(2) -
16
• K. K. Parhi, VLSI Digital Signal Processing Systems (Design and Implementation), John Wiley & Sons, In., New York, 2000.
• T. C. Denk, K. K. Parhi, Synthesis of Folded Pipelined Architectures for Multirate DSP Algorithms, IEEE Transaction on Very Large Scale Integration (VLSI) Systems, Vol. 6, No. 4, Dec. 1998, pp. 595-607.
The folding transformation provides a systematic technique for designing of control circuits in folded systems.
Folding equation and math. background
17
Folding equation and math. background
An edge with w(e) delays The corresponding folded data path
The data begin at the functional unit , which has pipelining stages, pass through
delays, and are switched into the functional unit at the time instances , where N is the number of operations folded to a single functional unit (folding factor), while u and v are the folding orders of nodes U and V that satisfy .
uvPeNwVUD uF
uH uP
vHvNl
0,1 vuN
18
• A folding set, S, is defined as an ordered set of operations,
which contains N entries, executed by the same functional unit.
• For a folded system to be realizable
must hold for all of the edges in the DFG. 0 VUDF
Folding equation and math. background
19
Important question:
How to prepare the source architecture / DFG for the successful application of folding technique?
Preparation of source architecture for folding – problems
20
• Once valid folding sets have been assigned, retiming can be used to satisfy this property or determine that the folding sets
are not feasible. • Retiming is a transformation technique used to change the
locations of delay elements without affecting the I/O characteristics of the circuit.
• Retiming in synchronous circuit design can be directed towards:– Reducing the clock period,
– Reducing the number of registers,
– Reducing the power consumption, etc.
Preparation of source architecture for folding – problems
21
• Using folding equations, a set of retiming inequalities can be obtained– Solution for architecture retiming can be found by
mapping of set of inequalities onto constraint graph.
– Algorithms: Bellman-Ford or Floyd-Warshall
• Assigment of folding sets on functional units of retimed graph– Rechecking of folding condition
Preparation of source architecture for folding – problems
22
Case study 1: Folded Bit-Serial Multiplier
• Public-key cryptography – special features are required for multiplier units.
– RSA encryption and decryption, large integers (typically 1024 bits or more) must be multiplied,
• Elliptic curve cryptosystems, – a multiplication in finite fields is required.
23
Source architecture: basic serial-parallel-serial multiplier
Case study 1: Folded Bit-Serial Multiplier
24
a3b0 a2b0 a1b0 a0b0
Case study 1: Folded Bit-Serial Multiplier
25
a3b0 a2b0 a1b0 a0b0
a3b1 a2b1 a1b1 a0b1
p0
+ + +
Case study 1: Folded Bit-Serial Multiplier
26
a3b2 a2b2 a1b2 a0b2
a3b0 a2b0 a1b0
a3b1 a2b1 a1b1 a0b1
p1
++ +
+ + +
Case study 1: Folded Bit-Serial Multiplier
27
a3b2 a2b2 a1b2 a0b2
a3b0 a2b0
a3b1 a2b1 a1b1
p2
a3b3 a2b3 a1b3 a0b3
+ + +
++ +
++
Case study 1: Folded Bit-Serial Multiplier
28
Case study 1: Folded Bit-Serial Multiplier
29
(Sk-1|N-1) (Sk-1|N-2) (Sk-1|0)
(Sk-2|N-1) (S0|1) (S0|0)
Case study 1: Folding Set Assigment
30
Df(ii+1)=N1-0+0-1=N-1, i= N-1, 2N-1, … , La-N-1
Df(ii+1)=N1-0+(N-1)-0=2N-1, i N-1, 2N-1, … , La-N-1
• Two neighboring nodes that are folded onto one node
• Neighboring nodes that are folded onto different nodes
Folded architecture contains max(N-1,2N-1)=2N-1 latches between nodes j and j+1 that will be used for data buffering
Df(UV)0
• Carry data paths
Df(ii)=N1-0+0-1=N-1, i= 0, 1, … , La-1
Case study 1: Folding Equations
31
Case study 1: Folded Architecture
32
Case study 1: Functional Description for case N=2, k=2
33
0 0a2b0 0 a0b0 0
Case study 1: Functional Description
34
a3b0 a2b0 0 a1b0 a0b0 0
Case study 1: Functional Description
35
a3b0 a1b0
a2b1
+a3b0
a2b0
a0b1
+a1b0
a0b0
p0
Case study 1: Functional Description
36
a3b1
a2b1
+a3b0
a3b0
a1b1
+a2b0
a0b1
+a1b0
a1b0
Case study 1: Functional Description
37
a3b1
a2b1
+a3b0
a2b0
+a1b1
a0b1
+a1b0
a2b2
+a3b1
a0b2
+a2b0
+a1b1
p1
Case study 1: Functional Description
387.84112264
7.85512432
8.056168168.129128128128
7.0036232
7.2028416
7.2118886.682646464
4.3674216
4.214448
4.5908844.580323232
4.502228
4.785444
4.8988824.706161616
3.803224
4.105442
4.0888813.957888
Clockperiod [ns]
Slices used
No. of PEs
Foldingfactor
Clockperiod [ns]
Slices used
No. of PEs
Op.length
Basic multiplier Folded multiplier
Case study 1: Implementation of Folded Architecture
(Spartan II xc2s2000-5pq208 )
39
• It provides the finding of optimal area-time solution for the given requirements.
• Saprtan II “shift register” property was used to relax the constraints caused by relatively large number of lathes in folded architecture.
• Generated architecture has kept almost all desirable features of source Bit-Serial architecture. The hardware reduction of active arithmetic elements for the factor N is done at the cost of execution time.
Case study 1: Conclusions
40
• Cellular-phone technology is changing rapidly. There is an increasing number of wireless-communications standards, including variants of the IEEE 802.11 wireless LAN specification, etc…– Traditionally, devices need a separate chip to work with each
standard.
• Providers differentiate themselves by offering new features, such as multimedia capabilities. – Providing each feature typically requires a separate chip, or
essence, multiple circuitry systems physically joined on a peace of silicon
Case study 2:
41
• The additional circuitry adds cost, takes up space, increases power usage in mobile devices, and increase product-design time.
Case study 2:
42
• The synthesis of configurable folded bit-plane architecture for FIR filtering.
• Why?Wider application areaFinding of suitable A-T tradeoffs Increasing of versatility of folded systems
Case study 2: Configurable Folded FIR Filter Architecture
43
Output words {yi} of FIR filter are computed as
where are coefficients while {xi} are input
words.
m – coefficient word length,
k – number of taps,
– bit of coefficient (with weight )
n – input word length.
1i-kk-1i-11i0i xcxcxcy ...
110 ...,,, kccc
jic ic j2
The BPA is obtained by resorting of partial products of different multipliers.
Case study 2: FIR filtering
44
45
• highly regular architecture
• allows extensive pipelining
• regular layout
• high computational throughput
• truncation of LSBs of intermediate results without any loss of accuracy
• programmability of coefficients
[Noll 1986], [Reuver & Klar 1992]
Case study 2: Bit-plane FIR filter architecture
46
The DFG for the basic BPA with k=3 and m=4
Case study 2: DFG for the source BPA for case k=3 and m=4
47
(Ss , r)s= p mod
kr= p mod
N
x
0
2 2 2 2 2 2 2 2 2 20 k -1 k N -1 N m -1 m -1 mc cc ik -1 ik
0 k - 1 m - 1c cN - 1 i k - 1k mN i k L - 1y
c c c c c cc c c c0 k -1 N -1 m -1 ik -1 m -1k N m ikc c c c
c cc c c
ck -1 k -1 k -1 k -1 j 0k -1 k -1 k -1 j
(S ,
0)
(S ,
k)
0
c
0
(S
,k-1
)
(S
,0)
(S
,N-1
)
(S
,(m
-1)
mod
k)
(S
,k
)
(S
,N-1
)
k-1
N m
od k
(N-1
) m
od k
c(m
-1)
mod
k
cm
mod
k
k-1
(S ,i
k m
od N
)0
(S
,(ik
-1)
mod
N)
k-1
Case study 2: Assignment of folding sets
48
• Folding set assignment enables the changing of operations in folding sets.
• Different operations can be mapped onto the different hardware units in fixed array structure.
• There are k folding sets where each folding set contains N operations.
• For the coefficients, kc, and the coefficient length, mc, the total number of operations, L, is:
L=kc mc=k N
Case study 2: Assignment of folding sets
49
• General form of Folding Equations:
Df (pp+1)=Nw(e)-0+[(p+1) mod N]-[p mod N] =
=
otherwise,
0mmod1)(p1,N
0Nmod1)(p1),(N
c
1
The condition Df (UV) 0 is not satisfied for neighboring nodes U and V when for the position p of node U the following is valid p mod (N-1) = 0 or p mod N = 0.
Case study 2: Folding equations and retiming
50
The general form of solution for r(p) is:
N
pL
m
pLr(p)
c
The existence of this solution provides the retiming of DFG and allows the application of folding technique.
otherwise0,
0mmod1)(p1,
0Nmod1)(p1,
1)r(pr(p) c
General form of retiming inequalities:
Case study 2: Folding equations and retiming
51
p p
r(p )
a ) b )
r(p )
k -k
k -k
k -k
-1
k -k
-1
c cc c0 01 1
N N
m
m
c
c
c
c
2m
3m
2N 2N
i N
j N
(i+ 1) N
(j+ 1) N
iN
(i+ 2)N
(i+ 1)N
(k-1) N(k-1) N
(k-2) N
Case study 2: Graphical representations of retiming
b) kc=3, mc=L/3a) kc=1, mc=L;
52
0N
2N
mcm +N
m +2N
2m2m +N
2m +2N
L=k *m
(k -1)m k - 1
k + 1
k (k -1)m +N
(k -1)m +2N
xx
x
xx
x
xx
x
xx
x
1
1
1
c
c
c
c
c
cc
c
cc
c
c c
c
c
c
2
2
3
3
3
2
Case study 2: Life cycle analysis
53
012
k-1k
N
iN
(i+1)N
N+1N+2
mm +1
k*N-1
(k-1)N(k-1)N+1(k-1)N+2
xx
x
xx
input D D D
x
xx
xxx
xx
xxx
x
xxx
xx
x
x
xx
0
0
0 0
0 0
0
0
1
1 1
1
i
ic
c i i
i
k-1
c
c
c
c
c
c
c
c
c
c
(k -1)m /N
(k -1)m /N
(k -1)m /N
(k -1)m /N
(k -1)m /N
i
1
0
0
1 k-1
Case study 2: General allocation table
54
{T= lN + 0}{T= i*m }
l= 1 ,2 ,3 ,...i= 1 ,2 ,3 ,...
{T /= lN + 0}{T /= i*m }
x
S S S2 2 20 1 k -1
c
c
D 0 D 1 Dk - 1
Case study 2: Module for input data entering
55
{T= l N + 0}{T= i m }
l= 1 ,2 ,3 ,...i= 1 ,2 ,3 ,...
{T = l N + 0}{T = i m }
{N -1}
{0 ,1 ,2 ,...,N -2 }
x
y0
2
S S S
2 2
2 2 20 1 k -1
c c c
c
N
{0 } {0 } {0 }
{1 } {1 } {1 }
{N -1 } {N -1 } {N -1 }
ck c -1
c
c
m c-1
k c -1
k c -1
k c -1
Case study 2: Folded FIR filter architecture
56
00
0
0
Ad d e r
00 000 00 000 00 000 00 000 0
x4
x3
x2
x1
x0
y4
y5
y6
y7
y8
y9
y1 0
1
2
3
y2
y3
y1
y0
c lk
c k1c k0
N= 4
m = 6c
p a ra lle l in
CBS
M
0 1 00
11c k0
c k1
s
s
c
c
a
a
x
x
y
y
b
b
k=3, N=4, kc=2 and mc=6
Case study 2: Functional block diagram
57
k=3, N=4, kc =2, mc=6
y1= 20 c10x0 + 21 c1
1x0 +22 c12x0 + 23 c1
3x0 +24 c14x0 + 25 c1
5x0 +20 c00x1 + 21c0
1x1
+22c02x1 + 23 c0
3x1 +24 c04x1 + 25 c0
5x1 = c1x0 +c0x1
y0= 20 c00x0 + 21 c0
1x0 +22 c02x0 + 23 c0
3x0 +24 c04x0 + 25 c0
5x0 = c0x0
y2= 20 c10x1 + 21 c1
1x1 +22 c12x1 + 23 c1
3x1 +24 c14x1 + 25 c1
5x1
+ … + = c1x1+ c0x2+24 c1
4x1 + 25 c15x1
y3= 20 c00x2 + 21 c0
1x2 +22 c02x2 + 23 c0
3x2+ … + = c0x2+ c1x3
Case study 2: Data flow for folded architecture
58
Case study 2: Implementation - Chip occupation as a function
of maximal folding factor Nmax(Spartan II xc2s2000-5pq208 )
N m a x6 4
k = 4
k = 8k = 1 6k = 3 2
N u m b e r o f u s e d s l i c e s
k = 6 4
5 63 22 41 684 4 0 4 8
1 0 0 0
1 5 0 0
2 0 0 0
2 5 0 0
5 0 0
0
59
Case study 2: Implementation - Throughput as a function of
chosen folding factor
N
T h r o u g h p u t [ n s ]
2 181 01 21 41 6 6 4
4 0
6 0
8 0
1 0 0
2 0
0
60
• Folding set assignment supports the changing of operations in folding sets.
• The prerequisites for application of folding technique are satisfied.
• Using of proposed folding set assignment, different operations can be mapped onto the different hardware units in the fixed structure array.
Case study 2: Conclusions
61
• The derived folded processing array can be configured to perform FIR filtering with different number of taps and length of coefficients.
• Synthesized architecture has kept desirable features of source architecture such as extensive pipelining, high regularity, truncation of LSBs of intermediate results without any loss of accuracy.
• The number of basic cells is reduced to the number of basic cells in one plane of source architecture.
• The obtained folded semi-systolic architecture is presented by DFG, allocation table, and data flow diagram.
Case study 2: Conclusions