fast fourier transform - mit opencourseware

6.973 Communication System Design – Spring 2006Massachusetts Institute of Technology

Vladimir Stojanović

Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.

Downloaded on [DD Month YYYY].

Fast Fourier Transform:Practical aspects and Basic Architectures

Lecture 9

Multiplication complexity per output point

CTFFT and SRFFT



6.973 Communication System Design 2

CTFFT and SRFFT

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

3 4 5 6 7 8 9 10n = log2 N

M/N

- radix 2- radix 4- split-radix

- lower boundMr

Mc

Figure by MIT OpenCoursWare.

3

Multiplies and adds

Real multiplies

Real adds



6.973 Communication System Design

N Radix 2

Radix 2

Radix 4

Radix 4

SRFFT

SRFFT

PFA

PFA

Winograd

Winograd

16

32

64

128

256

512

10242048

16

32

64

128

256

512

10242048

30

60

120

240

504

1008

2520

30

60

120

240

504

1008

2520

24

88

264

712

1800

4360

1024823560

20

208

1392

7856

20

6868

196

516

1284

3076

717216388

136

276

632

1572

3548

9492

N

152

408

1032

2504

5896

13566

3072868616

148

976

5488

28336

148

388

964

2308

5380

12292

2765261444

384

888

2076

4812

13388

29548

84076

384

888

2076

5016

14540

34668

99628

100

200

460

1100

2524

5804

17660

Figure by MIT OpenCourseWare.

Structural considerationsHow to compare different FFT algorithms?Many metrics to choose from

The ease of obtaining the inverse FFTIn-place computationRegularity

ComputationInterconnect

Parallelism and pipeliningQuantization noise




Inverse FFTFFTs often used for computing FIR filtering

Fast convolution (FFT + pointwise multiply + IFFT)In some applications (like 802.11a)

Can reuse FFT block to do the IFFT (half-duplex scheme)Simple trick [Duhamel88]

Swap the real and imaginary inputs and outputsIf FFT(xR,xI,N) computes the FFT of sequence xR(k)+jxI(k)Then FFT(xI,xR,N) computes the IFFT of jxR(k)+xI(k)

Exchange the real and imag part




In-place computationMost algorithms allow in-place computation

Cooley-Tukey, SRFFT, PFANo auxilary storage of size dependent on N is neededWFTA (Winograd Fourier Transform Algorithm) does not allow in-place computation

A drawback for large sequences

Cooley-Tukey and SRFFT are most compatible with longer size FFTs




Regularity, parallelismRegularity

Cooley-Tukey FFT very regularRepeat butterflies of same type

Sums and twiddle multipliesSRFFT slightly more involved

Different butterfly types in parallele.g. radix-2 and radix-4 used in parallel on even/odd samples

PFA even more involvedRepetitive use of more complicated modules (like cyclic convolution, for prime length DFTs)

WFTA most involvedRepetition of parts of the cyclic conv. modules from PFA

ParallelizationFairly easy for C-TFFT and SRFFT

Small modules applied on sets of data that are separable and contiguousMore difficult for PFA

Data required for each module not in contiguous locations





Quantization noiseRoundoff noise generated by finite precision of operations inside FFT (adds, multiplies)CTFFT (lengths 2n)

Four error sources per butterfly (variance 2-2B/12)Total variance per butterfly 2-2B/3

Each output node receives signals from a total of N-1 butterflies in the flow graph (N/2 from the first stage, N/4 in the second, …)

Total variance for each output ~ N/3*2-2B

Assuming input power 1/3N2 (|x(n)|<1/N to avoid overflow)Output power is 1/3N

Error-to-signal ratio is then N22-2B (needs 1 additional bit per stage to maintain SER)Since a maximum magnitude increases by less than 2x from stage to stage we can prevent the overflow by requiring that |x(n)|<1 and scaling by ½ from stage-to-stage

The output will be 1/N of the previous case, but the input magnitude can be Nx larger, improving the SERError-to-signal ratio is then 4N*2-2B (needs 1/2 additional bit per stage to maintain SER)

Radix-4 and SRFFT generate less roundoff noise than radix-2WFTA

Fewer multiplications (hence fewer noise sources)More difficult to include proper rescaling in the algorithm

Error-to-signal ratio is higher than in CTFFT or SRFFTTwo more bits are necessary to represent data in WFTA for same error order as CTFFT



Particular casesDFT algorithms for real data sequence xk

Xk has Hermitian symmetry (XN-k=Xk*)

X0 is real, and when N even, XN/2 real as wellN input values map to

2 real and N/2-1 complex conjugate values when N even1 real and (N-1)/2 complex conjugate values when N odd

Can exploit the redundancyReduce complexity and storage by a factor of 2

If take the real DFT of xR and xI separately2N additions are sufficient to obtain complex DFTGoal to obtain real DFT with half multiplies and half adds

Example DIF SRFFTX2k requires half-length DFT on real dataThen b/c of Hermitian symmetry X4k+1=X*

4(N/4-k-1)+3

Only need to compute one DFT of size N/4 (not two)





DFT pruningIn practice, may only need to compute a few tones

Or only a few inputs are different from zeroTypical cases: spectral analysis, interpolation, fast convComputing a full FFT can be wasteful

Goertzel algorithmCan be obtained by simply pruning the FFT flow graphAlternately, looks just like a recursive 1-tap filter for each tone

x(n)

WN-k

X(k)

z-1



Related transformsMostly focused on efficient matrix-vector product involving Fourier matrix

No assumption made on the input/output vectorSome assumptions on these leads to related transformsDiscrete Hartley Transform (DHT)Discrete Cosine (and Sine) Transform (DCT, DST)





Related transforms: DHT

Proposed as an alternative to DFTInitial (false) claims of improved arithmetic complexity

Real-valued FFT complexity is equivalent

Self-inverseProvided that X0 further weighted by 1/sqrt(2)Inverse real DFT on Hermitian data

Same complexity as the real DFT so no significant gain from self-inverse property of DHT



Related transforms: DCT

Lots of applications in image and video processingScale factor of 1/sqrt(2) for X0 left out

Formula above appears as a sub-problem in length-4N real DFTMultiplicative complexity can be related to real DFT

Practical algorithms depend on the transform lengthN odd: Permutations and sign changes map to real DFTN even: Map into same length real DFT + N/2 rotations




Relationship with FFTDHT, DCT, DST and related transforms can all be mapped to DFT

All transforms use split-radix algorithms Figure by MIT OpenCourseWare.

For minimum number of operationsCite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.

MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].


RSDFT

RDFTCDFT

ODFT DHT

PT

A

DCT2

43

5

6

b b

bb

ba a

a a

a1

(2D)

1

2

3

4

5

6

a

b

a

b

a

b

a

b

a

b

Complex DFT 2n

Complex DFT 2nx2n

Odd DFT 2n-1

1 odd DFT 2n-1+ 1 complex DFT 2n-1

1 real DFT 2n

Complex DFT 2n

Real DFT 2n

Real DFT 2n

Real DFT 2n

1 real DFT 2n-1 + 1 complex DFT 2n-2

1 real DFT 2n-1 + 2 DCTs 2n-2

Real DFT 2n

Real symmDFT 2n

DCT 2n

DHT 2n

+2n+1 - 4 additions

+ (3.2n-2 - 4) multiplications + (2n +3.2n-2 -n) additions

+ (3.2n-1 -2) multiplications + (3.2n-1 -3) additions

+ 3.2n-1 -2 additions

+2n+1 additions2 complex DFT's 2n-2

+ 2(3.2n-2-4) multiplications + (2n+3.2n-1 -8) additions

1 DHT 2n

-2 additions1 real DFT 2n

+2 additions

3.2n-1 odd DFT 2n-1 + 1 complex DFT 2n-1x2n-1

+ n.2n additions

1 real symmetric DFT 2n + 1 real antisymmetric DFT 2n

+ (6n+10), 4n-1 additions1 real symmetric DFT 2n-1 + 1 inverse real DFT

+ 3(2n-3-1)+1 multiplications + (3n-4).2n-3+1 additions

2 real DFT's 2n

Implementation issues

General purpose computersDigital signal processorsVector/multi processorsVLSI ASICs







16

Implementation on general purpose computersFFT algorithms built by repetitive use of basic building blocks

CTFFT and SRFFT butterflies are small – easily optimizablePFA/WFTA blocks are larger

More time is spent on load/store operationsThan in actual arithmetic (cache miss and memory access latency problem)Locality is of utmost importanceThis is the reason why PFA and WFTA do not meet the performance expected from their computation complexity!PFA drawback partially compensated since only a few coefficients have to be stored

Compilers can optimize the FFT code by loop-unrolling (lots of parallelism) and tailoring to cache size (aspect ratio)

Digital Signal Processors

Built for multiply/accumulate based algorithmsNot matched by any of the FFT algorithms

Sums of products changed to fewer but less regular computations

Today’s DSPs take into account some FFT requirements

Modulo counters (a power of 2 for CTFFT and SRFFT)Bit-reversed addressing




18

Vector and multi-processorsMust deal with two interconnected problems

The vector size of the data that can be processed at the maximalrate

Has to be full as often as possibleLoading of the data should be made from data available inside the cache memory to save time

In multi-processors performance dependent on interconnection network

Since FFT deterministic, resource allocation can be solved off-lineArithmetic units specialized for butterfly operationsArrays with attached shuffle networksPipelines of arithmetic units with intermediate storage and reorderingMostly favor CTFFTs




ASICsArea and throughput are important

A – area, T- time between two successive DFT computationsAsymptotic lower bound for AT2

Achieved by several micro-architecturesShuffle-exchange networksSquare grids

Outperform the more traditional micro-architectures only for very large NCascade connection with variable delay

Dedicated chips often based on traditional micro-architectures efficiently mapped to layout

Cost dominated by number of multiplies but also by cost of communicationCommunication cost very hard to estimate

Dedicated arithmetic unitsButterfly unitCORDIC unit

Still, many heuristics and local tricks to reduce complexity and improve communication




Architectures

1, N, N2 cell type – direct transformCascade (pipelined) FFTFFT networkPerfect-shuffle FFTCCC network FFTThe Mesh FFT




The naive approach

Compute all terms in the matrix-vector productN2 multiplications required

Three degrees of parallelismCalculate on one multiply-add cellOn N multiply-add cellsOn N2 multiply-add cells




1 multiply-add cellPerformance O(N2logN)

For large FFTs storage of intermediate results is a problemN-long FFT requires

N/r*logrN, radix-r butterfly operations2N*logrN read or write RAM accesses

E.g. to do the 8K FFT in 1ms, need to access internal RAM every 9ns, using radix-4

To speed upEither use higher radix (to reduce the overall number of memory accesses at the price of increase in arithmetic complexity)Or partition the memory to r banks accessed simultaneously (morecomplex addressing and higher area)

Need a very high rate clockCite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.




COEFROM

INPUTBUFFER

FFTRAMBUTTERFLY

N CONTROL

DATA IN DATA OUT

Cascade FFTCascade of logN multiply-add cells

Nicely suited for decimation in frequency FFT

E.g. 8pt DIF FFT

Produces the output values in bit-reversed orderCite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.



x y


DFTN = 4

DFTN = 2

DFTN = 2

DFTN = 2

DFTN = 2

X0 X0

X4

X4

X1

X1

X5

X5

X2

X2

X6

X6X3

X3

X7 X7

w18

w28

w38

DFT of2

DFT ofN / 2

Multiplicationby twiddle

factors

{ }x2k

DFTN = 4

{ }x2k+1


FFT networkOne of the most obvious implementations

Provide a multiply-add cell for each execution statementEach cell also has a register holding a particular value of zj (twiddle factor)How many such cells do we need for length-N (radix-2 DIT)?

One possible layoutlogN rows, N/2 cells each row

Pipelined performance O(logN)A new problem instance can enter the network as soon as the previous one has left the first rowDelay limited by cell’s multiply-add and long-wire driver to the next row O(logN)

Total network delay is O(log2N)Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.



X0X4 X1X5 X2X6

X0X2 X1X3 X6X4

X0X1 X3X2 X4X5

y0 y4 y2 y6 y1 y5 y3 y7

X7X6

X7X5

X3X7


25

FFT networkInputs are in “bit-shuffled” order (decimated)Outputs are in “bit-reversed” order

Minimizes the amount of interconnects

General scheme for interconnectionsNumber the cells naturally

0 to N/2-1, from left to rightCell i in the first row is connected to two cells in

d

X0X4 X1X5 X2X6

X0X2 X1X3 X6X4

X0X1 X3X2 X4X5

y0 y4 y2 y6 y1 y5 y3 y7

X7X6

X7X5

X3X7

the second rowCell i and (i+N/4) mod N/2

Cell i in the second row is connected to cellsi and floor(i/(N/4))+((i+N/8) mod N/4) in the thirdrow

Cell i in the k-th row (k=1,…logN-1) is connecteto (k+1)-th row

Cell i and cell floor(i/(N/2k))+((i+N/2k+1) mod N/2k)





The perfect-shuffle networkN/2 element network perfectly suited for FFT, radix-2 DIT

26

Each multiply-add cell associated with xk and xk+1 (k- even number between 0 and N-1)A connection from cell with xk to cell with xj when j=2k mod N-1 (this mapping is one-to-one)

Represents “circular left shift” of the logN-bit binary representation of k

First the xk values are loaded into cellsIn each iteration, output values are shuffled among cellsAt the end of logN steps, final data is in cell registers in bit-reversed order




X0,X1 X2,X3 X4,X5 X6,X7


Cube-Connected-Cycles (CCC) network

N cells capable of performing N-element FFT in O(logN) stepsClosely related to the FFT network

Just has circular connections between first and last rows (and uses N instead of N/2logN cells)

Does not exist for all N (only for N=(K/2)*logK for integer KCite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006.




28

The Mesh implementationApproximately sqrt(N) rows and columnsN-long FFT in logN steps

View as time-multiplexed version of the FFT network

In each step, N/2 nodes take the role of N/2 cells in FFT networkOther half routes the data other nodes





Performance summary




Design Area Time DelayArea*Time2

1-cell DFT

N-cell DFT

N log N

N log NN log N

N5log3N

N3log3N

N2log N

N2log N

N2log N

Perfect Shuffle N2 / log2 N N2log2N log2N log2 N

CCC log2NN2log2NN2 / log2 N log2 N

1-proc FFT N3log5N N log2NN log2 NN log N

Cascade N3log3N N log2NN log NN log N

FFT Network N2log2N log2Nlog NN2

N2-cell DFT log N N2log3N N2log NN2log N

Mesh N2log2NN log2 N N N


Cascade FFT has the best trade-offLess complicated wiring and NlogN delay

FFT network is as fast as N2 cell FFT but much less area (only N/2logN cells)Perfect-Shuffle and CCC use less cells than FFT network, but take a bit more time

30

Readings[1] C.D. Thompson "Fourier Transforms in VLSI," no. UCB/CSD-82-105, 1982.

Same, but hard to find publication[2] C.D. Thompson "Fourier Transforms in VLSI," IEEE Trans. Computers vol. 32, no. 11, pp. 1047-1057, 1983.

[3] P. Duhamel and M. Vetterli "Fast fouriertransforms: a tutorial review and a state of the art," Signal Process. vol. 19, no. 4, pp. 259-299, 1990.




fast fourier transform - mit opencourseware

Documents