ch09 pipeline and parallel recursive filterssocdsp.ee.nchu.edu.tw › class › download ›...
TRANSCRIPT
VLSI DSP 2010 Y.T. Hwang 9-1
Chapter 9Chapter 9Pipeline and Parallel Recursive Pipeline and Parallel Recursive and Adaptive Filtersand Adaptive Filters
Yin-Tsung Hwang
VLSI DSP 2010 Y.T. Hwang 9-2
Introduction (1)Introduction (1)
Any required digital filter spectrum can be realized using FIR or IIR filters
IIR filters require lower tap order but have potential stability problem
adaptive filters are used in time varying systems
coefficients of the adaptive digital filters are adapted at each iteration
recursive and adaptive filters cannot be easily pipelined or processed in parallel due to the feedback loops
VLSI DSP 2010 Y.T. Hwang 9-3
Introduction (2)Introduction (2)
Parallelizing and pipelining recursive digital filterslook ahead computation
incremental block processing
relaxed look ahead transform
VLSI DSP 2010 Y.T. Hwang 9-4
Pipeline Interleaving in Digital FiltersPipeline Interleaving in Digital Filters
Three forms of pipeline interleavinginefficient single/multi-channel interleaving
efficient single-channel interleaving
efficient multi-channel interleaving
inefficient single/multi-channel interleavingconsider a 1st order LTI recursion
y(n+1)=a*y(n)+b*u(n)
VLSI DSP 2010 Y.T. Hwang 9-5
Pipeline Interleaving in Digital FiltersPipeline Interleaving in Digital Filters
(a) a simple 1st order recursion
(b) inserting M-1 delays
(c) 5-way interleaving
VLSI DSP 2010 Y.T. Hwang 9-6
Single/MultiSingle/Multi--Channel Interleaving (1)Channel Interleaving (1)
Iteration period in (a) is Tm+Ta
M-stage pipelined version in (b), where M-1 additional latches are added
clock period can be reduced by M times
sample period must be increased M times for correct operation
M times rescaling
effective initiation interval and computing latency remains unchanged
overhead of M-1 additional latches
VLSI DSP 2010 Y.T. Hwang 9-7
Single/MultiSingle/Multi--Channel Interleaving (2)Channel Interleaving (2)
in (c), M independent time series are operated on the same hardware
may correspond to each cascade stage of a large filter
or independent channels requiring identical filtering operations
also known as M-slow circuit
potential drawbackssample rate is M times slower than the clock rate
inefficient processor utilization if M independent computing series are not available
VLSI DSP 2010 Y.T. Hwang 9-8
Efficient Single Channel Interleaving (1)Efficient Single Channel Interleaving (1)
1-step Look ahead transformationy(n+1)=a*y(n)+b*u(n)
y(n+2)=a[a*y(n)+b*u(n)]+b*u(n+1)
⇒ iteration bound = 2(Tm+Ta)/2
y(n+2)=a2*y(n)+a*b*u(n)+b*u(n+1)
⇒ iteration bound = (Tm+Ta)/2
M-step look ahead transformation
coefficients can be pre-computed
∑−
=
−−+⋅⋅+=+1
0
)1()()(M
i
iM iMnubanyaMny
VLSI DSP 2010 Y.T. Hwang 9-9
Efficient Single Channel Interleaving (2)Efficient Single Channel Interleaving (2)
(a) equivalent realization of unfolding 1st order IIR filter 1 time
(b) equivalent 1st order recursion after look ahead transform
VLSI DSP 2010 Y.T. Hwang 9-10
Efficient Single Channel Interleaving (3)Efficient Single Channel Interleaving (3)
M-step look ahead transformationallows a single serial computation transformed into M independent concurrent computations
loop delay is M ⇒ computation is completed in M cycles
iteration bound is (Tm+Ta) / M
hardware complexity and computing latency are linearly increased
M times higher sampling rate than than the original graph
initial conditions
y(-i)=a-i * y(0), i = 1,2, …., M-1
VLSI DSP 2010 Y.T. Hwang 9-11
Efficient Single Channel Interleaving (4)Efficient Single Channel Interleaving (4)
(a) equivalent 1st order recursion after (M-1) step look ahead transform
(b) a partial scheduling for M = 5
VLSI DSP 2010 Y.T. Hwang 9-12
Efficient MultiEfficient Multi--Channel Interleaving Channel Interleaving
Look ahead transform + pipeline interleavingassume P independent time series (channels)
loop is M-stage pipelined
⇒ the recursion must be iterated (Q -1) times, where M = P*Q
Example for M = 6, P = 2
a partial schedule for a 2-channel with 6 pipeline stages obtained using 2-step LA
2,1 ),2()1()()()3( 23 =+++++=+ inbunabunbuanyany iiiii
VLSI DSP 2010 Y.T. Hwang 9-13
Pipelining in 1st Order IIR FiltersPipelining in 1st Order IIR Filters
Look ahead transform introduces canceling poles and zeros
if the introduced poles and zeros do not cancel each other exactly due to e.g. finite precision HW limitation
the poles outside unit circle will cause stability problem
pipelined realization for a 1st order IIR filter is always stable provided the original is stable
decomposition techniqueto implement the non-recursive portion due to look ahead transform
to achieve logarithmic HW complexity increase w.r.t. the No. of pipeline stages
VLSI DSP 2010 Y.T. Hwang 9-14
Look Ahead for 1st order IIR FiltersLook Ahead for 1st order IIR Filters
Consider a first order filtercontains a pole at z = a, a ≤ 1
a 3-stage pipelined version filtercontains added poles and zeros at
because the introduced poles have magnitude always less than 1
⇒ the filter is always stable even the added poles are not exactly canceled out
11
1)( −⋅−=
zazH
33
221
1
1)( −
−−
⋅−++
=za
zaazzH
)3/2( πjaez ±=
VLSI DSP 2010 Y.T. Hwang 9-15
Look Ahead Pipelining with Power of 2 Look Ahead Pipelining with Power of 2 Decomposition (1)Decomposition (1)
Non-recursive part of an M (power of 2) -stage pipeline is decomposed into log2M sets of transformation
consider a 1st order recursive filter
has a single pole at z = a
after look ahead
adding (M - 1) poles and zeros at identical locations
has poles at locations
1
1
1)( −
−
⋅−⋅
=za
zbzH
MM
M
i
za
zazbzH
ii
−
−
=−−
⋅−
⋅+⋅= ∏
1
)1()(
1log
0
221 2
MMjMjMj aeaeaea /)2)(1(/)2(2/2 ,, , , πππ −
VLSI DSP 2010 Y.T. Hwang 9-16
Power of 2 Decomposition (2)Power of 2 Decomposition (2)
The decomposition of the canceling zeros
the i-th stage implements 2 i zeros located at
requires a single pipelined multiplication operation independent of the stage number i.
A total of (log2M+2) multiplication is needed
)12(,0,1, , 2/)12( −== + inj naeziπ
VLSI DSP 2010 Y.T. Hwang 9-17
Power of 2 Decomposition (3)Power of 2 Decomposition (3)
(a) pole representation of a 1st order recursive filter
(b) pole zero representation of a 1st order LTI recursive system with 8 loop pipelining stages
(c) decomposition based pipeline implementation for M =8
VLSI DSP 2010 Y.T. Hwang 9-18
Power of 2 Decomposition (4)Power of 2 Decomposition (4)
Time domain explanation
for M=8 example,where
∑−
=
−−+⋅⋅+⋅=+⇒
⋅+⋅=+1
0
)1()()(
)()()1(M
i
iM iMnubanyaMny
nubnyany
∑
∑
∑
∑
=
=
=
=
−+⋅+⋅=
−+⋅+⋅=
−+⋅+⋅=
−+⋅⋅+⋅=+
1
02
48
3
01
28
7
00
8
7
0
8
)47()(
)27()(
)7()(
)7()()8(
i
i
i
i
i
i
i
i
infanya
infanya
infanya
inubanyany
)()2()(
)()1()(
)()(
112
2
001
0
nfnfanf
nfnfanf
nubnf
+−⋅=
+−⋅=⋅=
VLSI DSP 2010 Y.T. Hwang 9-19
Power of 2 Decomposition (5)Power of 2 Decomposition (5)
In finite precision implementation, the poles are located at
where Δ corresponds to finite precision error in representing aM
the deviation becomes severe when |a| <<1
not a problem
inexact pole-zero cancellationleads to phase and magnitude error
can be reduced by increasing the word length
)1()( /1M
MM
aMaap
⋅Δ
+≈Δ+=
VLSI DSP 2010 Y.T. Hwang 9-20
Power of 2 Decomposition (6)Power of 2 Decomposition (6)
Inexact pole zero cancellation due to finite precision implementation
VLSI DSP 2010 Y.T. Hwang 9-21
General Decomposition (1)General Decomposition (1)
If M=M1M2···Mp
⇒ the non-recursive stages implement (M1-1), M1(M2-1), ··· , M1M2···(Mp -1) zeros
a 12-stage pipeline decomposition example12 = 2 × 3 × 2
1st stage implements the zero at -a
2nd stage implements 4 zeros at
3rd stage implements 6 zeros at
1212
6644221
1212
11
0
1
)1)(1)(1(
1)(
−
−−−−
−=
−
−++++
=
−
⋅= ∑
za
zazazaaz
za
zazH i
ii
3/23/ , ππ jj eaea ±± ⋅⋅6/56/ , , ππ jj eaeaja ±± ⋅⋅±
VLSI DSP 2010 Y.T. Hwang 9-22
General Decomposition (2)General Decomposition (2)
(a) pole zero location of a 12-stage pipelined 1st order recursive filter
(b) decomposition of the zeros of the pipelined filter for a 2 × 3 × 2 decomposition
VLSI DSP 2010 Y.T. Hwang 9-23
General Decomposition (3)General Decomposition (3)
An alternative 2 × 2 × 3 pipeline decomposition
1st stage: -a 2nd stage: ±ja
3rd stage:
An alternative 3 × 2 × 2 pipeline decomposition
1st stage: 2nd stage:
3rd stage:
1212
8844221
1
)1)(1)(1()( −
−−−−
−++++
=za
zazazaazzH
6/53/23/6/ , , , ππππ jjjj eaeaeaea ±±±± ⋅⋅⋅⋅
1212
6633221
1
)1)(1)(1()( −
−−−−
−++++
=za
zazazaazzH
3/2πjea ±⋅ 3/ , πjeaa ±⋅−6/56/ , , ππ jj eajaea ±± ⋅±⋅
VLSI DSP 2010 Y.T. Hwang 9-24
Pipelining in High Order IIR FiltersPipelining in High Order IIR Filters
Clustered look ahead transformslinear hardware complexity w.r.t. pipeline stages
but not always guaranteed to be stable
scattered look ahead transformsguaranteed to be stable
higher hardware complexity, i.e. N*M
both transforms reduce to the same form in the 1st order IIR filter case
VLSI DSP 2010 Y.T. Hwang 9-25
Clustered LookClustered Look--Ahead Pipelining (1)Ahead Pipelining (1)
Consider an N-th order direct form IIR filter
the sample rate is limited by the throughput of 1 multiplication and 1 addition
)()(
)()()(
1)(
1
01
1
0
nzinya
inubinyany
za
zbzH
N
i i
N
i i
N
i i
N
i
ii
N
i
ii
+−=
−+−=
⋅−
⋅=
∑∑∑
∑∑
=
==
=−
=−
VLSI DSP 2010 Y.T. Hwang 9-26
Clustered LookClustered Look--Ahead Pipelining (2)Ahead Pipelining (2)
adding canceling poles and zeros such that the coefficients of z-1, ··· ,z-(M-1) in the denominator becomes zero
look ahead the cluster of past N outputs by M-1 steps
y(n-1), y(n-2), ··· , y(n-N)
⇒ y(n-M), y(n-M-1), ··· , y(n-M-N+1)
the critical loop contains M delay elements and a single multiplication
sample rate can be increased by a factor M
M-stage clustered look-ahead pipelining
VLSI DSP 2010 Y.T. Hwang 9-27
Clustered LookClustered Look--Ahead Pipelining (3)Ahead Pipelining (3)
Consider an all-pole 2nd order IIRwith poles at z = 1/2 and z = 3/4
original transfer function
for 2-stage pipelining, the coefficient of z-1 must be nullified
multiply the numerator and denominator by (1+5/4z-1)
I.e. adding canceling pole and zero at z = -5/4
the transformed transfer function
21
83
45
1
1)(
−− +−=
zzzH
32
1
3215
1619
1
45
1)(
−−
−
+−
+=
zz
zzH
VLSI DSP 2010 Y.T. Hwang 9-28
Clustered LookClustered Look--Ahead Pipelining (4)Ahead Pipelining (4)
3-stage pipeliningmultiply the numerator and denominator by
(1+5/4z-1+19/16z-2)
the transformed transfer function
general form of multiplying factors
, where
43
21
12857
6465
1
1619
45
1)(
−−
−−
+−
++=
zz
zzzH
∑ −
=−⋅
1
0
M
i
ii zr
0for ,
1
1,,2,1for ,0
1
0
>=
=−==
∑ = −
−
irar
r
Nir
N
k kiki
i
VLSI DSP 2010 Y.T. Hwang 9-29
Clustered LookClustered Look--Ahead Pipelining (5)Ahead Pipelining (5)
Hardware implementation complexitycoefficients of the filters are pre-computed off-line
numerator (non-recursive) part requires N+M MPYs
denominator (recursive) part requires N MPYs
total (N + N + M) multiplications ⇒ linear HW implementation complexity
Filter stability problemadding poles may lie outside the unit circle
example: assume the 2nd order IIR has poles at z = 0.5 and z = 0.75
2-stage pipelining introduce an additional pole at
z = -1.25
VLSI DSP 2010 Y.T. Hwang 9-30
Clustered LookClustered Look--Ahead Pipelining (6)Ahead Pipelining (6)
(a) pole zero representation of a stable 2nd order recursive filter
(b) poles & zeros after 2-stage pipelining using clustered look ahead
(c) poles & zeros after 3-stage pipelining using clustered look ahead
VLSI DSP 2010 Y.T. Hwang 9-31
Clustered LookClustered Look--Ahead Pipelining (7)Ahead Pipelining (7)
Filter stability problem (cont.)3-stage pipelining introduce two additional poles at
z = 0.625 ± j 0.893
note that the original filter is stable!
VLSI DSP 2010 Y.T. Hwang 9-32
Stable Clustered Look ahead Filter Stable Clustered Look ahead Filter Design (1)Design (1)
Consider a stable recursive digital filter
P(z) represents superfluous poles needed to create the desired pipeline delay M
for M > Mc (some critical delay), the filter is always stable
numerical search methods are needed to obtain optimal pipelining level M and P(z)
Example: consider
Mc = 5 and for M = 6,
)(1
)()(
)()(
)()()(
)(
)()(
zQz
zPzN
zPzD
zPzNzH
zD
zNzH M−−
==⇒=
21 6889.05336.11
1)( −− ⋅+⋅−=
zzzH
76
11111
5011.03265.11
7275.01454.14939.1663.15336.11)( −−
−−−−−
⋅+⋅−⋅+⋅+⋅+⋅+⋅+
=zz
zzzzzzH
VLSI DSP 2010 Y.T. Hwang 9-33
Stable Clustered Look ahead Filter Stable Clustered Look ahead Filter Design (2)Design (2)
(a) original poles
(b) poles & zeros of a 6-stage pipelined filter using a stable clustered look ahead
when the no. of denominator multipliers is larger, pipelined filter suffers from large round-off noise
the denominator cannot be decomposed in order to maintain M level of pipelining
VLSI DSP 2010 Y.T. Hwang 9-34
Scattered Look Ahead Pipelining (1)Scattered Look Ahead Pipelining (1)
The denominator of H(z) is transformed to containing N terms, Z-M, Z-2M, ··· , Z-NM
equivalently, y(n) is computed in terms of N past scattered states y(n-M), y(n-2M), ··· , y(n-NM)
for each pole, (M-1) canceling poles and zeros are introduced
with equal angular spacing
same distance from the origin as that of the original pole
e.g. if the original pole is z = p
added poles and zeros are )1(,,2,1for ,/2 −== Mkpez Mkj π
VLSI DSP 2010 Y.T. Hwang 9-35
Scattered Look Ahead Pipelining (2)Scattered Look Ahead Pipelining (2)
Assume the denominator of H(z) can be factorized as
after scattered look-ahead transform
consider the 2nd order filter, for M = 3
∏=
−⋅−=N
ii zpzD
1
1)1()(
)('
)('
)1(
)1()(
)(
)()(
1
1
0
1/2
1
1
1
1/2
MN
i
M
k
Mkji
N
i
M
k
Mkji
ZD
zN
zep
zepzN
zD
zNzH =
−
−==
∏ ∏∏ ∏=
−
=−
=
−
=−
π
π
632
321
31
422
321
22
21
11
22
11
)3(1
)(1)(
1
1)(
−−
−−−−
−−
−+−+−+++
=
⇒−−
=
zazaaa
zazaazaazazH
zazazH
VLSI DSP 2010 Y.T. Hwang 9-36
Scattered Look Ahead Pipelining (3)Scattered Look Ahead Pipelining (3)
Consider the 2nd-order filter with complex conjugate poles at
for 3-stage pipelining
when θ=2π/3, only 1 pole & zero at z = r are added
θjerz ±⋅=
221cos21
1)( −− ⋅+⋅⋅−=
zrzrzH
θ
6633
4433221
3cos21
cos2)2cos21(cos21)( −−
−−−−
⋅+⋅⋅−⋅+⋅+⋅++⋅⋅+
=zrzr
zrzrzrzrzH
θθθθ
33
1
1221
1
1
1
)1)(1(
)1()( −
−
−−−
−
⋅−⋅−
=⋅−⋅+⋅+
⋅−=
zr
zr
zrzrzr
zrzH
VLSI DSP 2010 Y.T. Hwang 9-37
Scattered Look Ahead Pipelining (4)Scattered Look Ahead Pipelining (4)
Consider the 2nd-order filter with poles at z = r1, r2
for 3-stage pipelining adding poles at
221
121 )(1
1)( −− ⋅+⋅+−=
zrrzrrzH
3/22
3/21 , ππ jj erzerz ±± ⋅=⋅=
632
31
332
31
422
21
32121
22221
21
121
)(1
)()()(1)( −−
−−−−
⋅+⋅+−⋅+⋅++⋅+++⋅++
=zrrzrr
zrrzrrrrzrrrrzrrzH
VLSI DSP 2010 Y.T. Hwang 9-38
Scattered Look Ahead Pipelining (5)Scattered Look Ahead Pipelining (5)
Pole zero representation of a 3-stage pipelined equivalent stable filter using scattered look ahead
VLSI DSP 2010 Y.T. Hwang 9-39
Power of 2 Decomposition in Scattered Power of 2 Decomposition in Scattered Look Ahead Scheme (1)Look Ahead Scheme (1)
The multiplication complexity in scattered look ahead transform
non-recursive portion : (NM+1)
recursive portion is : N
much greater than that in clustered look ahead scheme
consider a recursive filter
by multiplying in both numerator and denominator
)(
)(
1)('
1
0
1
zD
zN
za
zbzH N
i
ii
N
i i =⋅−
⋅=
∑∑
=−
=−
))1(1(1∑ =
−−−N
i
ii
i za
VLSI DSP 2010 Y.T. Hwang 9-40
Power of 2 Decomposition (2)Power of 2 Decomposition (2)
For 2-stage pipelining
the denominator contains terms of z-2, z-4, ··· ,z-2N
subsequent transforms can be applied to obtain 4,8 and 16 stage pipelined implementations
log2M sets of transforms are needed to achieve M-stage pipelining
each transform double the speed but also increase the hardware complexity by N
)('
)('
])1(1][1[
))1(1()('
2
1 1
0 1
1
zD
zN
zaza
zazbzH N
i
N
i
ii
iii
N
i
N
i
ii
ii =
⋅−−⋅−
⋅−−⋅=
∑ ∑∑ ∑
= =−−
= =−−
VLSI DSP 2010 Y.T. Hwang 9-41
Power of 2 Decomposition (3)Power of 2 Decomposition (3)
Applying (log2M-1) sets of transforms
leads to M-stage pipelining
requires (2N+N* log2M+1) multiplications
logarithmic complexity w.r.t. M (speed up)
total number of delays ≈NM*(log2M+1)
NM delays are used in the non-recursive portion
NM*log2M delays are required to pipeline each
N*log2M multipliers by M stages
VLSI DSP 2010 Y.T. Hwang 9-42
Power of 2 Decomposition (4)Power of 2 Decomposition (4)
N(M-1) poles and zeros are addedN(M-1) zeros are decomposed and implemented in log2M stages1st stage realizes N-th order nonrecursive sectionfollowing stages implements 2N, 4N, ··· , NM/2-order non-recursive sectionseach section requires N multiplication
due to the symmetry of coefficientsindependent of the order of the section
VLSI DSP 2010 Y.T. Hwang 9-43
Power of 2 Decomposition (5)Power of 2 Decomposition (5)
Consider a 2nd-order recursive filter example
the poles are located at
the pipelined function
the 2M poles are located at
implementation complexity is (2log2M+5) multiplications
221
22
110
cos21)(
)()( −−
−−
+⋅⋅−++
==zrzr
zbzbb
zU
zYzH
θ
θjre±
∏∑ −
=
−−−−
=−
++
+⋅⋅+×+⋅⋅−
=1log
0
222222
2
02
11
)2cos21(cos21
)()(
M
i
iMMMM
i
ii iiii
zrzrzrzMr
zbzH θ
θ
)1(,,2,1,0 ,))/2(( −== +± Mirez Mij πθ
VLSI DSP 2010 Y.T. Hwang 9-44
Power of 2 Decomposition (6)Power of 2 Decomposition (6)
(a) pole diagram of the 2nd order filter
(b) poles & zeros of the pipelined 2nd order filter with 8 loop pipelining stages
(c) decomposition of poles and zeros of the pipelined filter
VLSI DSP 2010 Y.T. Hwang 9-45
Power of 2 Decomposition (7)Power of 2 Decomposition (7)
Implementation of the original and the pipelined scattered look-ahead recursive filters
VLSI DSP 2010 Y.T. Hwang 9-46
Scattered LA Pipelining with General Scattered LA Pipelining with General decompositiondecomposition
For an N-th order filter with M levels of pipeliningN(M-1) canceling zeros
for M=M1M2 ···Mp decomposition
the 1st stage implements N(M1-1) zeros
the 2nd stage implements NM1 (M2-1) zeros
the Pth stage implements NM1M2 ···Mp-1(Mp-1) zeros
the non-recursive portion requires NM delays and
multipliers
generation decomposition of high order IIR filterdecompose into cascade of 1st order sections
apply general decomposition to each 1st order section
∑ =−⋅
P
i iMN1
)1(
VLSI DSP 2010 Y.T. Hwang 9-47
Parallel Processing for IIR Filters (1)Parallel Processing for IIR Filters (1)
1st order IIR filter casewhere |a| ≤ 1
y(n+1) = ay(n)+u(n)
for a 4-parallel architecture implementation
y(n+4)=a4y(n)+a3u(n)+a2u(n+1)+au(n+2)+u(n+3)
by substituting n = 4k
y(4k+4)=a4y(4k)+a3u(4k)+a2u(4k+1)+au(4k+2)+u(4k+3)
pole of the original system is at z = a
pole of the parallel system is at z = a4
pole is moved toward the origin ⇒ improved robustness to the round-off error
1
1
1)( −
−
−=
az
zzH
VLSI DSP 2010 Y.T. Hwang 9-48
Parallel Processing for IIR Filters (2)Parallel Processing for IIR Filters (2)
Note that in pipelined processing case, canceling poles and zeros are introduced
substituting n with 4k+4, 4k+5, 4k+6, 4k+7 leading to 4 parallel processing blocks
VLSI DSP 2010 Y.T. Hwang 9-49
Parallel Processing for IIR Filters (3)Parallel Processing for IIR Filters (3)
require L2 MAC operations
one delay element represents L sample delays
VLSI DSP 2010 Y.T. Hwang 9-50
Parallel Processing for IIR Filters (4)Parallel Processing for IIR Filters (4)
Incremental block processinguse y(4k) to compute y(4k+1),
i.e. y(4k+1) = a·y(4k) + u(4k)
and then use y(4k+1) to compute y(4k+2), then y(4k+3)
the hardware complexity is reduced from L2 to 2L-1
increased computation time for y(4k+1), y(4k+2), y(4k+3)
VLSI DSP 2010 Y.T. Hwang 9-51
Parallel Processing for IIR Filters (5)Parallel Processing for IIR Filters (5)
Incremental block filter structure with L = 4
VLSI DSP 2010 Y.T. Hwang 9-52
Parallel Processing for IIR Filters (6)Parallel Processing for IIR Filters (6)
A second order IIR example for block processing
3-parallel systemcompute y(3k+3) and y(3k+4) using y(3k) and y(3k+1)
for y(3k), y(3k+1) →y(3k+3) ⇒ 2-stage clustered LA
for y(3k), y(3k+1) →y(3k+4) ⇒ 3-stage clustered LA
)2()1(2)( where
),()2(8
3)1(
4
5)(
83
45
1
)1()(
21
21
−+−+
+−−−=
+−
+=
−−
−
nununu
nfnynyny
zz
zzH
VLSI DSP 2010 Y.T. Hwang 9-53
Parallel Processing for IIR Filters (7)Parallel Processing for IIR Filters (7)
2 & 3-stage clustered look ahead
2-loop update equation
)43()33(4
5)23(
16
19)3(
128
57)13(
64
65)43(
)33()23(4
5)3(
32
15)13(
16
19)33(
++++++−+=+
++++−+=+
kfkfkfkykyky
kfkfkykyky
)()1(4
5)3(
32
15)2()4(
8
3)3(
4
5
16
19
)()2(8
3)1()3(
8
3)2(
4
5
4
5
)()2(8
3)1(
4
5)(
nfnfnynfnyny
nfnynfnyny
nfnynyny
+−+−−⎥⎦⎤
⎢⎣⎡ −+−−−=
+−−⎥⎦⎤
⎢⎣⎡ −+−−−=
+−−−=
VLSI DSP 2010 Y.T. Hwang 9-54
Parallel Processing for IIR Filters (8)Parallel Processing for IIR Filters (8)
Pole zero plots of the transfer function
loop update for block size = 3
relationship of recursion outputs
VLSI DSP 2010 Y.T. Hwang 9-55
Parallel Processing for IIR Filters (9)Parallel Processing for IIR Filters (9)
y(3k+2) can be obtained incrementally
)23()3(8
3)13(
4
5)23( ++−+=+ kfkykyky
VLSI DSP 2010 Y.T. Hwang 9-56
CommentsComments
The original system has 2 poles at 1/2 and 3/4
transformed system
the matrix has eigenvalues (1/2)3 and (3/4)3
the parallel system is more stable
the parallel system has the same number of poles as the original system
hardware complexityfor a 2nd order IIR filter (N=2)
3L + [(L-2) + (L-1)] + 4 + 2(L-2) = 7L -3 MPY operations
⎟⎟⎠
⎞⎜⎜⎝
⎛+⎟⎟
⎠
⎞⎜⎜⎝
⎛+⎟⎟
⎠
⎞⎜⎜⎝
⎛−−
=⎟⎟⎠
⎞⎜⎜⎝
⎛++
2
1
6465
12857
1619
3215
)13(
)3(
)43(
)33(
f
f
ky
ky
ky
ky
VLSI DSP 2010 Y.T. Hwang 9-57
Combined Parallel & Pipelining Combined Parallel & Pipelining Processing (1)Processing (1)
To achieve a speed up L × ML is the block size and M is the no. of pipelining stage
for H(z)=1/(1-a·z-1) with M = 4 and L = 3
H(z) is 1st order ⇒ 1 loop update equation is required and the other two can be computed incrementally
M = 4 ⇒ loop contains 4 delay elements
L = 3 ⇒ y(3k+12) ← y(3k)
VLSI DSP 2010 Y.T. Hwang 9-58
Combined Parallel & Pipelining Combined Parallel & Pipelining Processing (2)Processing (2)
12-step look ahead transform
substituting n=3k+12
)()10()11()12(
)()1()(101112 nunuanuanya
nunayny
++−+−+−=
+−=
)123()93()63()3(
)123()113()103(
)93()83()73(
)63()53()43(
)33()23()13(
)3()123(
113
2612
2
345
678
91011
12
++++++=
++++++
++++++
++++++
++++++
=+
kfkfakfakya
kukaukua
kuakuakua
kuakuakua
kuakuakua
kyaky
VLSI DSP 2010 Y.T. Hwang 9-59
Combined Parallel & Pipelining Combined Parallel & Pipelining Processing (3)Processing (3)
Where
)123()93()123(
)123()113()103()123(
113
2
21
+++=+
+++++=+
kfkfakf
kukaukuakf
VLSI DSP 2010 Y.T. Hwang 9-60
Combined Parallel & Pipelining Combined Parallel & Pipelining Processing (4)Processing (4)
Commentshas 4 poles a3, -a3, ja3, -ja3
decomposition is used for the non-recursive portion
multiplication complexity is 2L-1+log2M
2nd order filter example with L = 3 and M = 2
loop update operations
original poles 1/2, 3/4
new poles ±(1/2)3, ±(3/4)3
21
21
83
45
1
)1()(
−−
−
+−
+=
zz
zzH
VLSI DSP 2010 Y.T. Hwang 9-61
CommentsComments
Systematic approach to compute pole locationswrite loop update equations using LM-level look ahead
write the state space representation of the parallel pipelined filter, where state matrix AN×N
compute the eienvalues λi of matrix A, 1 ≤ I ≤ N
NM poles of the new parallel pipelined system correspond to the M-th roots of the eigenvalues of A, M
i
1
λ
VLSI DSP 2010 Y.T. Hwang 9-62
Low Power IIR Filter Design Low Power IIR Filter Design
Parallel and pipelined processing can be used for low power design
consider a 4-th order Chebyshev low-pass filter
assume multiplier capacitance is dominant
assume Vdd = 5V and Vt = 1v
using scattered look ahead with power of 2 decomposition
)8482.04996.11)(6493.05548.11(
)1(001836.0)(
2121
41
−−−−
−
+−+−+
=zzzz
zzH
VLSI DSP 2010 Y.T. Hwang 9-63
Pipelined processing to reduce power (1)Pipelined processing to reduce power (1)
4-level pipelinable transfer function N(z)/D(z)
pipelined design insert delays into the recursive
shorter critical path after retiming
smaller amount of charging/discharging capacitance in each pipeline stage
use lower power supply yet maintain the original speed
the pipelined supplied βV0, with β < 1
)5175.01337.11)(1777.04085.01()(
)7194.05524.01)(4216.01188.11(
)8482.04996.11)(6493.05548.11(
)1(001836.0)(
8484
4242
2121
41
−−−−
−−−−
−−−−
−
+++−=
++++
×++++
×+=
zzzzzD
zzzz
zzzz
zzN
VLSI DSP 2010 Y.T. Hwang 9-64
Pipelined processing to reduce power (2)Pipelined processing to reduce power (2)
propagation delaysequential system is
4-level pipelined system
⇒ β=0.476 and new Vdd = 2.38V
power dissipationthe sample period is equal to Tpd
sequential system is
4-level pipelined system
2arg
2arg
)15(
5
)( −×
=−×
=k
C
VVk
VCT ech
tdd
ddechpd
2arg
)15(
5
4 −=
ββ
k
CT ech
pd
sMseqpd
seqtotal
seq fCmT
CP 2
2)(
55
=×
=
sMpippd
piptotal
pip fCmT
CP 2
2)(
38.238.2
=×
=
VLSI DSP 2010 Y.T. Hwang 9-65
Pipelined processing to reduce power (3)Pipelined processing to reduce power (3)
Power rationumber of multiplications, mseq=5, mpip = 13
Example: a second order IIR filter
5891.05
38.2
5
132
=⎟⎠⎞
⎜⎝⎛⎟⎠⎞
⎜⎝⎛=
seq
pip
P
PRatio
)2()1(2)()2(8
3)1(
4
5)( −+−++−−−= nunununynyny
VLSI DSP 2010 Y.T. Hwang 9-66
Parallel processing to reduce the power (1)Parallel processing to reduce the power (1)
3-parallel filter design ⇒ 3 outputs per clock cycle
clock cycle can be three times the sample cycle
the critical path of the sequential filter is 1 MPY + 1 Add
feedforward part of the pipelined filter design is retimed
VLSI DSP 2010 Y.T. Hwang 9-67
Parallel processing to reduce the power (2)Parallel processing to reduce the power (2)
Propagation delayThe critical path of the parallel design is 1 MPY + 2 Add
assume the charge/discharge capacitances and the delays are dominated by the multipliers
assume the parallel filter supplied βV0, with β < 1 and the threshold voltage Vt=0.6V
propagation delay (clock cycle) for sequential filter is
propagation delay (clock cycle) for parallel filter is
20
0
)( t
Mseq VVk
VCT
−=
20
0
)( t
Mpar VVk
VCT
−=
ββ
VLSI DSP 2010 Y.T. Hwang 9-68
Choose Tpar = 3 Tseq ⇒
we get β=0.4673 and βV0=2.3365V
power consumption
comments
the parallel or pipelined design can be used either to achieve high speed or low power operations
Parallel processing to reduce the power (3)Parallel processing to reduce the power (3)
20
02
0
0
)()(3
tt VV
V
VV
V
−=
−⋅
ββ
%116.293
4
43
)(12
3
2
20
220
20
===
==
=
β
ββ
seq
par
sMs
Mpar
sMseq
P
PRatio
fVCf
VCP
fVCP
VLSI DSP 2010 Y.T. Hwang 9-69
Pipelined Adaptive Digital Filters (1)Pipelined Adaptive Digital Filters (1)
Adaptive digital filters are difficult becausepresence of long feedback loop
look ahead transform leads to prohibitively large hardware overhead
note that look ahead transform maintains exact input-output mapping
since coefficients of the adaptive filters continue to change ⇒ exact input-output mapping is not necessary
adaptive filter measurement indexesmis-adjustment error
adaptation time constant
VLSI DSP 2010 Y.T. Hwang 9-70
Pipelined Adaptive Digital Filters (2)Pipelined Adaptive Digital Filters (2)
Relaxed look-ahead transformaims to pipeline adaptive filter with little hardware overhead and at the expense of marginal degradation of adaptation behavior
product relaxation
sum relaxation
delay relaxation
the adaptation behavior of the pipelined adaptive filters should be analyzed using stochastic techniques
VLSI DSP 2010 Y.T. Hwang 9-71
Relaxed Look Ahead (1)Relaxed Look Ahead (1)
Consider the 1st order time varying recursiony(n+1)=a(n)y(n)+u(n), where coefficient a(n) is time varying
using look ahead transform
for M = 4, the iteration period is limited to (Tm+Ta)/4
under certain circumstances, we can substitute approximation expressions to simplify the terms in the RHS
( )[ ] )1()1()1(
)()1()(
1
1
1
0
1
0
−++−−+⋅−−++
⋅−−+=+
∑ ∏∏
−
=
−
=
−
=
MnuiMnujMna
nyiMnaMny
M
i
i
j
M
i
VLSI DSP 2010 Y.T. Hwang 9-72
Relaxed Look ahead Transform (2)Relaxed Look ahead Transform (2)
A 4-level look head pipelined 1st order
VLSI DSP 2010 Y.T. Hwang 9-73
Product Relaxation (1)Product Relaxation (1)
Assumptionsa(n) is close to 1 and a(n) = (1-ε(n)), where ε(n) is close to zero
if u(n+i) is close to zero, then
))1(1(1)1(1
))1(1()1()(1
0
−+−−=−+−=
−+−=−+≈+∏ −
=
MnaMMNM
MnMnainaM
i
MM
ε
ε
[ ]∑
∏−
=
−
=
−−++−+−−=+⇒
−−+≈−−+⋅−−+1
0
1
0
)1()()))1(1(1()(
)1()1()1(
M
i
i
j
iMnunyMnaMMny
iMnuiMnujMna
VLSI DSP 2010 Y.T. Hwang 9-74
Product Relaxation (2)Product Relaxation (2)
A 4-level product relaxed lookahead pipelined 1st order recursion
VLSI DSP 2010 Y.T. Hwang 9-75
Product Relaxation (3)Product Relaxation (3)
Another approximationa(n) is assumed to be close to 1
∑∏
−
=
−
=
−−++−+=+⇒
−+≈−−+1
0
1
0
)1()()1()(
)1()1(
M
i
i
j
iMnunyMnaMny
MnajMna
VLSI DSP 2010 Y.T. Hwang 9-76
Sum Relaxation (1)Sum Relaxation (1)
If the input u(n) varies slowly over M cycles
if u(n) is close to zero, then M u(n) ≈ u(n)
[ ])()()1(
)1()1()1(
)()1()(
1
0
1
1
1
0
1
0
nMunyiMna
MnuiMnujMna
nyiMnaMny
M
i
M
i
i
j
M
i
+−−+≈
−++−−+⋅−−++
−−+=+
∏∑ ∏∏
−
=
−
=
−
=
−
=
)()()1()(1
0nunyiMnaMny
M
i+−−+≈+ ∏ −
=
VLSI DSP 2010 Y.T. Hwang 9-77
Sum Relaxation (2)Sum Relaxation (2)
A 4-level sum-relaxed look-ahead pipelined 1st-order recursion
VLSI DSP 2010 Y.T. Hwang 9-78
Delay Relaxation (1)Delay Relaxation (1)
Consider the recursion y(n)=y(n-1)+a(n)u(n)
M-level look-ahead pipelined version
delay relaxation involves the use of delayed input u(n-M’) and delayed coefficient a(n-M’)
assume the product a(n)u(n) is more or less constant over M’ samples
a reasonable assumption for stationary or slowly varying product a(n)u(n)
)()()()(1
0inuinaMnyny
M
i−−+−= ∑ −
=
)'()'()()(1
0iMnuiMnaMnyny
M
i−−−−+−≈ ∑ −
=
VLSI DSP 2010 Y.T. Hwang 9-79
Comments of relaxation techniquesComments of relaxation techniques
Three relaxation techniques can be applied individually or in different combinations to derive a rich variety of architectures
each derived architecture has different convergence characteristics
the relaxed look-ahead is a transform in the stochastic sense
VLSI DSP 2010 Y.T. Hwang 9-80
Pipelined LMS Adaptive Filter (1)Pipelined LMS Adaptive Filter (1)
LMS algorithm
2 recursive loops in the LMS architecture weight update loop
error feedback loop
direct look ahead
M2 is the delay in the weight update loop
)()()1()(
)()1()()(ˆ)()(
nUnenWnW
nUnWndndndne T
⋅⋅+−=−−=−=
μ
∑ −
=−⋅−⋅+−=
⋅⋅+−=1
022 )()()(
)()()1()(M
iinUineMnW
nUnenWnW
μ
μ
VLSI DSP 2010 Y.T. Hwang 9-81
Pipelined LMS Adaptive Filter (2)Pipelined LMS Adaptive Filter (2)
By delay relaxationassume the gradient estimate e(n)U(n) varies slowly
the hardware complexity is N(M2-1) MACs
further sum relaxation taking only the first M’2 sum terms, where M’2 ≤ M2
∑∑
−
=
−
=
−−⋅−−⋅+−=
−⋅−⋅+−=1
0 112
1
02
2
2
)()()(
)()()()(M
i
M
i
iMnUiMneMnW
inUineMnWnW
μ
μ
)()]1()1(
)1([)(
)()1()()(
1
1
0 1
2
'2 nUiMnUiMne
MnWnd
nUnWndne
M
i
T
T
−−−−−−+
−−−=
−−=
∑ −
=μ
VLSI DSP 2010 Y.T. Hwang 9-82
Pipelined LMS Adaptive Filter (3)Pipelined LMS Adaptive Filter (3)
Assume μ is sufficiently small and replacing W(n-M2-1) by W(n-M2)
)()()()( 2 nUMnWndne T −−=