fast fourier transform - mit opencourseware
TRANSCRIPT
6.973 Communication System Design – Spring 2006Massachusetts Institute of Technology
Vladimir Stojanović
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
Fast Fourier Transform:Practical aspects and Basic Architectures
Lecture 9
Multiplication complexity per output point
CTFFT and SRFFT
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design 2
CTFFT and SRFFT
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
3 4 5 6 7 8 9 10n = log2 N
M/N
- radix 2- radix 4- split-radix
- lower boundMr
Mc
Figure by MIT OpenCoursWare.
3
Multiplies and adds
Real multiplies
Real adds
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design
N Radix 2
Radix 2
Radix 4
Radix 4
SRFFT
SRFFT
PFA
PFA
Winograd
Winograd
16
32
64
128
256
512
10242048
16
32
64
128
256
512
10242048
30
60
120
240
504
1008
2520
30
60
120
240
504
1008
2520
24
88
264
712
1800
4360
1024823560
20
208
1392
7856
20
6868
196
516
1284
3076
717216388
136
276
632
1572
3548
9492
N
152
408
1032
2504
5896
13566
3072868616
148
976
5488
28336
148
388
964
2308
5380
12292
2765261444
384
888
2076
4812
13388
29548
84076
384
888
2076
5016
14540
34668
99628
100
200
460
1100
2524
5804
17660
Figure by MIT OpenCourseWare.
Structural considerationsHow to compare different FFT algorithms?Many metrics to choose from
The ease of obtaining the inverse FFTIn-place computationRegularity
ComputationInterconnect
Parallelism and pipeliningQuantization noise
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design 4
Inverse FFTFFTs often used for computing FIR filtering
Fast convolution (FFT + pointwise multiply + IFFT)In some applications (like 802.11a)
Can reuse FFT block to do the IFFT (half-duplex scheme)Simple trick [Duhamel88]
Swap the real and imaginary inputs and outputsIf FFT(xR,xI,N) computes the FFT of sequence xR(k)+jxI(k)Then FFT(xI,xR,N) computes the IFFT of jxR(k)+xI(k)
Exchange the real and imag part
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design 5
In-place computationMost algorithms allow in-place computation
Cooley-Tukey, SRFFT, PFANo auxilary storage of size dependent on N is neededWFTA (Winograd Fourier Transform Algorithm) does not allow in-place computation
A drawback for large sequences
Cooley-Tukey and SRFFT are most compatible with longer size FFTs
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design 6
Regularity, parallelismRegularity
Cooley-Tukey FFT very regularRepeat butterflies of same type
Sums and twiddle multipliesSRFFT slightly more involved
Different butterfly types in parallele.g. radix-2 and radix-4 used in parallel on even/odd samples
PFA even more involvedRepetitive use of more complicated modules (like cyclic convolution, for prime length DFTs)
WFTA most involvedRepetition of parts of the cyclic conv. modules from PFA
ParallelizationFairly easy for C-TFFT and SRFFT
Small modules applied on sets of data that are separable and contiguousMore difficult for PFA
Data required for each module not in contiguous locations
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design 7
6.973 Communication System Design 8
Quantization noiseRoundoff noise generated by finite precision of operations inside FFT (adds, multiplies)CTFFT (lengths 2n)
Four error sources per butterfly (variance 2-2B/12)Total variance per butterfly 2-2B/3
Each output node receives signals from a total of N-1 butterflies in the flow graph (N/2 from the first stage, N/4 in the second, …)
Total variance for each output ~ N/3*2-2B
Assuming input power 1/3N2 (|x(n)|<1/N to avoid overflow)Output power is 1/3N
Error-to-signal ratio is then N22-2B (needs 1 additional bit per stage to maintain SER)Since a maximum magnitude increases by less than 2x from stage to stage we can prevent the overflow by requiring that |x(n)|<1 and scaling by ½ from stage-to-stage
The output will be 1/N of the previous case, but the input magnitude can be Nx larger, improving the SERError-to-signal ratio is then 4N*2-2B (needs 1/2 additional bit per stage to maintain SER)
Radix-4 and SRFFT generate less roundoff noise than radix-2WFTA
Fewer multiplications (hence fewer noise sources)More difficult to include proper rescaling in the algorithm
Error-to-signal ratio is higher than in CTFFT or SRFFTTwo more bits are necessary to represent data in WFTA for same error order as CTFFT
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
Particular casesDFT algorithms for real data sequence xk
Xk has Hermitian symmetry (XN-k=Xk*)
X0 is real, and when N even, XN/2 real as wellN input values map to
2 real and N/2-1 complex conjugate values when N even1 real and (N-1)/2 complex conjugate values when N odd
Can exploit the redundancyReduce complexity and storage by a factor of 2
If take the real DFT of xR and xI separately2N additions are sufficient to obtain complex DFTGoal to obtain real DFT with half multiplies and half adds
Example DIF SRFFTX2k requires half-length DFT on real dataThen b/c of Hermitian symmetry X4k+1=X*
4(N/4-k-1)+3
Only need to compute one DFT of size N/4 (not two)
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design 9
6.973 Communication System Design 10
DFT pruningIn practice, may only need to compute a few tones
Or only a few inputs are different from zeroTypical cases: spectral analysis, interpolation, fast convComputing a full FFT can be wasteful
Goertzel algorithmCan be obtained by simply pruning the FFT flow graphAlternately, looks just like a recursive 1-tap filter for each tone
x(n)
WN-k
X(k)
z-1
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
Related transformsMostly focused on efficient matrix-vector product involving Fourier matrix
No assumption made on the input/output vectorSome assumptions on these leads to related transformsDiscrete Hartley Transform (DHT)Discrete Cosine (and Sine) Transform (DCT, DST)
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design 11
6.973 Communication System Design 12
Related transforms: DHT
Proposed as an alternative to DFTInitial (false) claims of improved arithmetic complexity
Real-valued FFT complexity is equivalent
Self-inverseProvided that X0 further weighted by 1/sqrt(2)Inverse real DFT on Hermitian data
Same complexity as the real DFT so no significant gain from self-inverse property of DHT
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
Related transforms: DCT
Lots of applications in image and video processingScale factor of 1/sqrt(2) for X0 left out
Formula above appears as a sub-problem in length-4N real DFTMultiplicative complexity can be related to real DFT
Practical algorithms depend on the transform lengthN odd: Permutations and sign changes map to real DFTN even: Map into same length real DFT + N/2 rotations
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design 13
Relationship with FFTDHT, DCT, DST and related transforms can all be mapped to DFT
All transforms use split-radix algorithms Figure by MIT OpenCourseWare.
For minimum number of operationsCite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.
MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.973 Communication System Design 14
RSDFT
RDFTCDFT
ODFT DHT
PT
A
DCT2
43
5
6
b b
bb
ba a
a a
a1
(2D)
1
2
3
4
5
6
a
b
a
b
a
b
a
b
a
b
Complex DFT 2n
Complex DFT 2nx2n
Odd DFT 2n-1
1 odd DFT 2n-1+ 1 complex DFT 2n-1
1 real DFT 2n
Complex DFT 2n
Real DFT 2n
Real DFT 2n
Real DFT 2n
1 real DFT 2n-1 + 1 complex DFT 2n-2
1 real DFT 2n-1 + 2 DCTs 2n-2
Real DFT 2n
Real symmDFT 2n
DCT 2n
DHT 2n
+2n+1 - 4 additions
+ (3.2n-2 - 4) multiplications + (2n +3.2n-2 -n) additions
+ (3.2n-1 -2) multiplications + (3.2n-1 -3) additions
+ 3.2n-1 -2 additions
+2n+1 additions2 complex DFT's 2n-2
+ 2(3.2n-2-4) multiplications + (2n+3.2n-1 -8) additions
1 DHT 2n
-2 additions1 real DFT 2n
+2 additions
3.2n-1 odd DFT 2n-1 + 1 complex DFT 2n-1x2n-1
+ n.2n additions
1 real symmetric DFT 2n + 1 real antisymmetric DFT 2n
+ (6n+10), 4n-1 additions1 real symmetric DFT 2n-1 + 1 inverse real DFT
+ 3(2n-3-1)+1 multiplications + (3n-4).2n-3+1 additions
2 real DFT's 2n
Implementation issues
General purpose computersDigital signal processorsVector/multi processorsVLSI ASICs
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design 15
6.973 Communication System Design
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
16
Implementation on general purpose computersFFT algorithms built by repetitive use of basic building blocks
CTFFT and SRFFT butterflies are small – easily optimizablePFA/WFTA blocks are larger
More time is spent on load/store operationsThan in actual arithmetic (cache miss and memory access latency problem)Locality is of utmost importanceThis is the reason why PFA and WFTA do not meet the performance expected from their computation complexity!PFA drawback partially compensated since only a few coefficients have to be stored
Compilers can optimize the FFT code by loop-unrolling (lots of parallelism) and tailoring to cache size (aspect ratio)
Digital Signal Processors
Built for multiply/accumulate based algorithmsNot matched by any of the FFT algorithms
Sums of products changed to fewer but less regular computations
Today’s DSPs take into account some FFT requirements
Modulo counters (a power of 2 for CTFFT and SRFFT)Bit-reversed addressing
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design 17
18
Vector and multi-processorsMust deal with two interconnected problems
The vector size of the data that can be processed at the maximalrate
Has to be full as often as possibleLoading of the data should be made from data available inside the cache memory to save time
In multi-processors performance dependent on interconnection network
Since FFT deterministic, resource allocation can be solved off-lineArithmetic units specialized for butterfly operationsArrays with attached shuffle networksPipelines of arithmetic units with intermediate storage and reorderingMostly favor CTFFTs
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design
ASICsArea and throughput are important
A – area, T- time between two successive DFT computationsAsymptotic lower bound for AT2
Achieved by several micro-architecturesShuffle-exchange networksSquare grids
Outperform the more traditional micro-architectures only for very large NCascade connection with variable delay
Dedicated chips often based on traditional micro-architectures efficiently mapped to layout
Cost dominated by number of multiplies but also by cost of communicationCommunication cost very hard to estimate
Dedicated arithmetic unitsButterfly unitCORDIC unit
Still, many heuristics and local tricks to reduce complexity and improve communication
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design 19
Architectures
1, N, N2 cell type – direct transformCascade (pipelined) FFTFFT networkPerfect-shuffle FFTCCC network FFTThe Mesh FFT
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design 20
The naive approach
Compute all terms in the matrix-vector productN2 multiplications required
Three degrees of parallelismCalculate on one multiply-add cellOn N multiply-add cellsOn N2 multiply-add cells
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design 21
1 multiply-add cellPerformance O(N2logN)
For large FFTs storage of intermediate results is a problemN-long FFT requires
N/r*logrN, radix-r butterfly operations2N*logrN read or write RAM accesses
E.g. to do the 8K FFT in 1ms, need to access internal RAM every 9ns, using radix-4
To speed upEither use higher radix (to reduce the overall number of memory accesses at the price of increase in arithmetic complexity)Or partition the memory to r banks accessed simultaneously (morecomplex addressing and higher area)
Need a very high rate clockCite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.
MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.973 Communication System Design 22
Figure by MIT OpenCourseWare.
COEFROM
INPUTBUFFER
FFTRAMBUTTERFLY
N CONTROL
DATA IN DATA OUT
Cascade FFTCascade of logN multiply-add cells
Nicely suited for decimation in frequency FFT
E.g. 8pt DIF FFT
Produces the output values in bit-reversed orderCite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.
MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.973 Communication System Design 23
x y
Figure by MIT OpenCourseWare.
DFTN = 4
DFTN = 2
DFTN = 2
DFTN = 2
DFTN = 2
X0 X0
X4
X4
X1
X1
X5
X5
X2
X2
X6
X6X3
X3
X7 X7
w18
w28
w38
DFT of2
DFT ofN / 2
Multiplicationby twiddle
factors
{ }x2k
DFTN = 4
{ }x2k+1
Figure by MIT OpenCourseWare.
FFT networkOne of the most obvious implementations
Provide a multiply-add cell for each execution statementEach cell also has a register holding a particular value of zj (twiddle factor)How many such cells do we need for length-N (radix-2 DIT)?
One possible layoutlogN rows, N/2 cells each row
Pipelined performance O(logN)A new problem instance can enter the network as soon as the previous one has left the first rowDelay limited by cell’s multiply-add and long-wire driver to the next row O(logN)
Total network delay is O(log2N)Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.
MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.973 Communication System Design 24
X0X4 X1X5 X2X6
X0X2 X1X3 X6X4
X0X1 X3X2 X4X5
y0 y4 y2 y6 y1 y5 y3 y7
X7X6
X7X5
X3X7
Figure by MIT OpenCourseWare.
25
FFT networkInputs are in “bit-shuffled” order (decimated)Outputs are in “bit-reversed” order
Minimizes the amount of interconnects
General scheme for interconnectionsNumber the cells naturally
0 to N/2-1, from left to rightCell i in the first row is connected to two cells in
d
X0X4 X1X5 X2X6
X0X2 X1X3 X6X4
X0X1 X3X2 X4X5
y0 y4 y2 y6 y1 y5 y3 y7
X7X6
X7X5
X3X7
the second rowCell i and (i+N/4) mod N/2
Cell i in the second row is connected to cellsi and floor(i/(N/4))+((i+N/8) mod N/4) in the thirdrow
Cell i in the k-th row (k=1,…logN-1) is connecteto (k+1)-th row
Cell i and cell floor(i/(N/2k))+((i+N/2k+1) mod N/2k)
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design
Figure by MIT OpenCourseWare.
The perfect-shuffle networkN/2 element network perfectly suited for FFT, radix-2 DIT
26
Each multiply-add cell associated with xk and xk+1 (k- even number between 0 and N-1)A connection from cell with xk to cell with xj when j=2k mod N-1 (this mapping is one-to-one)
Represents “circular left shift” of the logN-bit binary representation of k
First the xk values are loaded into cellsIn each iteration, output values are shuffled among cellsAt the end of logN steps, final data is in cell registers in bit-reversed order
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design
X0,X1 X2,X3 X4,X5 X6,X7
Figure by MIT OpenCourseWare.
Cube-Connected-Cycles (CCC) network
N cells capable of performing N-element FFT in O(logN) stepsClosely related to the FFT network
Just has circular connections between first and last rows (and uses N instead of N/2logN cells)
Does not exist for all N (only for N=(K/2)*logK for integer KCite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.
MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.973 Communication System Design 27
Figure by MIT OpenCourseWare.
28
The Mesh implementationApproximately sqrt(N) rows and columnsN-long FFT in logN steps
View as time-multiplexed version of the FFT network
In each step, N/2 nodes take the role of N/2 cells in FFT networkOther half routes the data other nodes
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design
Figure by MIT OpenCourseWare.
Performance summary
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design 29
Design Area Time DelayArea*Time2
1-cell DFT
N-cell DFT
N log N
N log NN log N
N5log3N
N3log3N
N2log N
N2log N
N2log N
Perfect Shuffle N2 / log2 N N2log2N log2N log2 N
CCC log2NN2log2NN2 / log2 N log2 N
1-proc FFT N3log5N N log2NN log2 NN log N
Cascade N3log3N N log2NN log NN log N
FFT Network N2log2N log2Nlog NN2
N2-cell DFT log N N2log3N N2log NN2log N
Mesh N2log2NN log2 N N N
Figure by MIT OpenCourseWare.
Cascade FFT has the best trade-offLess complicated wiring and NlogN delay
FFT network is as fast as N2 cell FFT but much less area (only N/2logN cells)Perfect-Shuffle and CCC use less cells than FFT network, but take a bit more time
30
Readings[1] C.D. Thompson "Fourier Transforms in VLSI," no. UCB/CSD-82-105, 1982.
Same, but hard to find publication[2] C.D. Thompson "Fourier Transforms in VLSI," IEEE Trans. Computers vol. 32, no. 11, pp. 1047-1057, 1983.
[3] P. Duhamel and M. Vetterli "Fast fouriertransforms: a tutorial review and a state of the art," Signal Process. vol. 19, no. 4, pp. 259-299, 1990.
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.
Downloaded on [DD Month YYYY].
6.973 Communication System Design