a fast fourier transform compiler - columbia universityaho/cs6998/lectures/11-10-25_le_fftw.pdf ·...

38
A Fast Fourier Transform Compiler Matteo Frigo MIT Laboratory for Computer Science February 16, 1999 Presented by Tam Le October 25, 2011 Matteo Frigo A Fast Fourier Transform Compiler

Upload: hatram

Post on 19-May-2019

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

A Fast Fourier Transform Compiler

Matteo Frigo

MIT Laboratory for Computer Science

February 16, 1999

Presented by Tam Le

October 25, 2011

Matteo Frigo A Fast Fourier Transform Compiler

Page 2: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

The Fast Fourier Transform

“The FFT has been called the most importantnumerical algorithm of our lifetime...” [Ken02]

Matteo Frigo A Fast Fourier Transform Compiler

Page 3: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

The Discrete Fourier Transform Defined

I The forward DFT:

Y [i ] =n−1∑j=0

X [j ]ω−ijn

where ωn = e2π√−1/n and 0 ≤ i < n

I In case where X is real, the transform Y hashermitian symmetry :

Y [n − i ] = Y ∗[i ]

where Y ∗[i ] is the complex conjugate

Matteo Frigo A Fast Fourier Transform Compiler

Page 4: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

The Discrete Fourier Transform Defined

I The backward DFT flips the sign in theexponent of ωn and is defined as:

Y [i ] =n−1∑j=0

X [j ]ωijn

I Backward DFT is the “scaled inverse” of theforward DFT, i.e. backward transform offorward transform computes the original arraymultiplied by n

Matteo Frigo A Fast Fourier Transform Compiler

Page 5: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Fast Fourier Transform Algorithms

Cooley-Tukey [CT65]

I If n can be factored to n = n1n2, rewrite DFT:

Y [i1 + i2n1] =

n2−1∑j2=0

n1−1∑j1=0

X [j1n2 + j2]ω−i1j2n1

ω−i1j2n

ω−i2j2n2

where j = j1n2 + j2 and i = i1 + i2n1

I Divide and conquer scheme recursively breaksdown DFT of size n into smaller DFTs of sizesn1 and n2

I ω−i1j2n1called twiddle factors

Matteo Frigo A Fast Fourier Transform Compiler

Page 6: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Fast Fourier Transform Algorithms

Prime Factor [OS89]

I Works for n = n1n2 when gcd(n1, n2) = 1

I Avoids recursive multiplication of twiddle factorsin place of more involved computations ofindices

Matteo Frigo A Fast Fourier Transform Compiler

Page 7: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Fast Fourier Transform Algorithms

Split-Radix [DV90]

I Works for n = 4k

I Can lead to some saving of operations whencompared with Cooley-Tukey

Matteo Frigo A Fast Fourier Transform Compiler

Page 8: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Fast Fourier Transform Algorithms

Rader’s [Rad68]

I Works when n is prime

I Re-expresses DFT as “cyclic convolution” of sizen − 1

I A special case of Winograd algorithm [Win78]

Matteo Frigo A Fast Fourier Transform Compiler

Page 9: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Computational Bounds

I Calculating DFT using straight-forwardapplication of definition requires O(n2)arithmetical operations

I Calculating DFT using FFT algorithms haveupper bound time complexities of O(n log n)

Matteo Frigo A Fast Fourier Transform Compiler

Page 10: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

The Fastest Fourier Transform in the West (FFTW)

I Original 1999 paper covers FFTW revision 2.0

I Latest version (3.0) will be discussed later

I Website: http://www.fftw.org/

Matteo Frigo A Fast Fourier Transform Compiler

Page 11: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

What is FFTW?

I Software library of fast C routines to computeone and multi-dimensional real and complexDFTs of arbitrary size

I Currently fastest FFT algorithm available upheldby regular benchmarks

I Speed advantage due to two distinguishingfeatures:

I FFTW’s computational routines adapts automaticallyto the hardware providing for portability and speed

I Inner loop of FFTW generated by a special-purposecompiler written in Objective Caml

Matteo Frigo A Fast Fourier Transform Compiler

Page 12: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

genfft: a domain-specific FFT compiler

I genfft compiler is magic behind FFTW

I Written in Objective Caml 2.0

I From a complex number FFT algorithm,automatically derives a real number algorithm[Soren87]

I Automatic generation of inner loop of FFTWwhich comprises 95% of total code base

Matteo Frigo A Fast Fourier Transform Compiler

Page 13: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Benchmark: Just how fast compared to other FFTs?

Test System: 3.0 GHz Intel Core Duo, Intel compilers, 32-bit mode

Matteo Frigo A Fast Fourier Transform Compiler

Page 14: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Reasons for Speed Advantage?

I FFTW does not implement any single fixed DFTalgorithm

I Instead, DFT is computed using a structuredlibrary of highly optimized blocks of C codecalled codelets which can be composed inmany ways

I Composition of codelets is called a plan thatdetermines which codelet should be executed inwhat order

Matteo Frigo A Fast Fourier Transform Compiler

Page 15: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Reasons for Speed Advantage?

I At runtime FFTW finds optimal composition ofcodelets by measuring speed of different plans,choosing the fastest

I FFTW contains 120 codelets with total ofapproximately 55,000 lines of optimized code tocompute forward, backward, real to complex,and complex to real transforms

Matteo Frigo A Fast Fourier Transform Compiler

Page 16: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Outline of the Compilation Strategy

1. Creation: genfft produces a directed acyclicgraph (dag) of the codelet according to analgorithm for the DFT; FFTW contains anumber of such algorithms and applies the mostappropriate

2. Simplification: genfft applies rewriting rulesto each dag node in order to simplify the node

Matteo Frigo A Fast Fourier Transform Compiler

Page 17: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Outline of the Compilation Strategy

3. Scheduling: genfft applies a topological sortof the dag which minimizes the number ofregister spills “no matter how many registers thetarget machine has...”

4. Unparsing: genfft finally unparses to C (or toany other language by swapping out theunparser)

Matteo Frigo A Fast Fourier Transform Compiler

Page 18: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

dag Representation

Definition of the node data type which represents anarithmetic expression dag. Cited [Aho86] for syntax treerepresentation.

Matteo Frigo A Fast Fourier Transform Compiler

Page 19: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

dag Creation

I Function fftgen produces the expression dag

I fftgen performs symbolic evaluation of FFT algorithm toproduce the dag for DFT of size n

I No single FFT algorithm is optimal for all size n sogenfft contains many algorithms and fftgen choosesmost appropriate

I For example, for complex transform of size n = 13,generator employs Rader’s algorithm in a variantformulated by Tolimieri et al. [Tol97]. However, thatalgorithm performs 214 real floating point additionsand 76 real multiplications while generated FFTWcode executes only 176 additions and 68multiplications—genfft found simplificationsoverlooked by the authors!

Matteo Frigo A Fast Fourier Transform Compiler

Page 20: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

dag Creation

I For FFTW version 2.0, fftgen implemented:1. Cooley-Tukey for n = n1n2 where n 6= 12. Split-Radix for n muliple of 43. Prime Factor if n factors into n1n2, n 6= 1, and

gcd(n1, n2) = 14. Rader’s for prime length if n = 5 or n ≥ 135. Direct application of DFT definition

Matteo Frigo A Fast Fourier Transform Compiler

Page 21: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

dag Creation

OCaml code for Cooley-Tukey FFT algorithm. The infixoperator @* computes the complex product while the functionexp n k computes the constant exp(2πk

√−1/n).

Matteo Frigo A Fast Fourier Transform Compiler

Page 22: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Simplification

I Simplifier traverses dag bottom-up and appliesseries of “improvements” at every node

I Common, well-known optimizations [Aho86]:1. Algebraic Transformations: constant folding and

simplify multiplication by 0, 1,−1 and addition by 02. Common-Subexpression Elimination (CSE): simplifier

implemented in monadic style [Wad97] in which themonad performs CSE

Matteo Frigo A Fast Fourier Transform Compiler

Page 23: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Simplification

I DFT-specific:1. Eliminate negative constants. Constants generally

appear as pairs in a DFT dag ; C compiler would storevalues in program text and then load both constantsinto a register at runtime. Thus, making all constantspositive reduces load by factor of two, speeding upgenerated codelets by 10-15%

2. Network transposition. Based on fact that network isa dag that computes a linear function [Cro75]

Matteo Frigo A Fast Fourier Transform Compiler

Page 24: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Simplification

genfft’s simplifier performs three passes over thedag :

Optimize(G ) =

E := Simplify(G )

FT := Simplify(ET )

RETURN Simplify(F )

Matteo Frigo A Fast Fourier Transform Compiler

Page 25: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Summary of dag transposition benefits

Matteo Frigo A Fast Fourier Transform Compiler

Page 26: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Scheduling

I The genfft scheduler produces a topologicalsort of the dag so register allocator of Ccompiler can minimize number of register spills

I Proven [HK81] that for DFTs of size power of 2(n = 2k), there exists a schedule that isasymptotically optimal

Matteo Frigo A Fast Fourier Transform Compiler

Page 27: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Scheduling

I genfft’s schedule is cache-oblivious, i.e. notdependent on the number R of registers on amachine and yet optimal for every R

I In fact, execution of FFT dag of size n = 2k ona machine of R registers where R ≤ n has:

1. lower bound of Ω(n log n/ log R) registerspills

2. upper bound in which gennfft’s outputprogram incurs at most O(n log n/ log R)register spills

Matteo Frigo A Fast Fourier Transform Compiler

Page 28: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Runtime & Memory Footprint

I Takes approximately 75 seconds for DFT of sizen = 64 to run FFTW generated C code on a200MHz Pentium Pro machine running Linux2.2

I genfft needs less than 3 MB of memory tocomplete generation which resulted in a codeletcontaining 912 additions and 248 multiplications

I Regeneration of whole FFTW system can bedone in approximately 15 minutes

Matteo Frigo A Fast Fourier Transform Compiler

Page 29: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Some Conclusions to Draw

I Optimal Performance: Main goal of projectachieved since up-to-date benchmarks showFFTW’s performance still ahead of othercompeting FFTs

Matteo Frigo A Fast Fourier Transform Compiler

Page 30: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Some Conclusions to Draw

I Correctness: In words of author: “surprisinglyeasy.” Since DFT algorithms in genfft wereencoded using a straight-forward, high-levellanguage (OCaml), simplification phase of thecompiler transforms algorithms into optimizedcode via application of simple algebraic ruleswhich are easy to verify

Matteo Frigo A Fast Fourier Transform Compiler

Page 31: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Some Conclusions to Draw

I Rapid Turnaround: Just around 15 minutes(back in 1999) to regenerate FFTW formscratch

Matteo Frigo A Fast Fourier Transform Compiler

Page 32: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Some Conclusions to Draw

I Domain-specific code enhancements:Topological sort in scheduling phase is effectiveonly for DFT dags and perform poorly for othercomputations while simplification performscertain improvements which rely on DFT beinga linear transformation

I genfft “derived” or “discovered” newalgorithms, as in case of n = 13 discussed earlier

Matteo Frigo A Fast Fourier Transform Compiler

Page 33: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

FFTW v3.0

I Released April 2003

I Latest stable release: v3.3, Jul 26, 2011I Major enhancements:

1. Complete rewrite adding new algorithms and FFTs(Bluestein’s, etc.)

2. Improved speed: programs often 20% faster thancomparable FFTW 2.x code

3. New set of APIs to support more general semantics4. Single Instruction, Multiple Data (SIMD) support for

parallel processing CPUs (SSE, SSE2, 3DNow!,Altivec)

5. Read release notes for full list of improvements andbug fixes: http://www.fftw.org/release-notes.html

Matteo Frigo A Fast Fourier Transform Compiler

Page 34: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Awards & Recognition

I 1999 J. H. Wilkinson Prize for NumericalSoftware (awarded every 4 years)

I 2009 Most Influential PLDI Paper Award(http://sigplan.org/award-pldi.htm)

Matteo Frigo A Fast Fourier Transform Compiler

Page 35: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

Questions & Answers?

Matteo Frigo A Fast Fourier Transform Compiler

Page 36: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

References

I [Ken02] Kent, Ray D. and Read, Charles (2002). AcousticAnalysis of Speech. ISBN 0-7693-0112-6. Cites Strang,G. (1994)/MayJune). Wavelets. American Scientist, 82,250-255.

I [CD65] J. W. Cooley and J.W. Tukey. An algorithm forthe machine computation of the complex Fourier series.Mathematics of Computation, 19:297301, April 1965.

I [OS89] A. V. Oppenheim and R. W. Schafer.Discrete-time Signal Processing. Prentice-Hall,Englewood Cliffs, NJ 07632, 1989.

I [DV90] P. Duhamel and M. Vetterli. Fast Fouriertransforms: a tutorial review and a state of the art. SignalProcessing, 19:259299, April 1990.

Matteo Frigo A Fast Fourier Transform Compiler

Page 37: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

References

I [Rad68] C. M. Rader. Discrete Fourier transforms whenthe number of data samples is prime. Proc. of the IEEE,56:11071108, June 1968.

I [Win78] S. Winograd. On computing the discrete Fouriertransform. Mathematics of Computation, 32(1):175199,January 1978.

I [Aho86] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman.Compilers, principles, techniques, and tools.Addison-Wesley, March 1986.

I [Tol97] Richard Tolimieri, Myoung An, and Chao Lu.Algorithms for Discrete Fourier Transform andConvolution. Springer Verlag, 1997.

Matteo Frigo A Fast Fourier Transform Compiler

Page 38: A Fast Fourier Transform Compiler - Columbia Universityaho/cs6998/lectures/11-10-25_Le_FFTW.pdf · The Discrete Fourier Transform De ned I The forward DFT: Y[i] = Xn 1 j=0 X[j]! ij

References

I [Wad97] Philip Wadler. How to declare an imperative.ACM Computing Surveys, 29(3):240263, September 1997.

I [Cro75] R. E. Crochiere and A. V. Oppenheim. Analysis oflinear digital networks. Proceedings of the IEEE,63:581595, April 1975.

I [Soren87] H. V. Sorensen, D. L. Jones, M. T. Heideman,and C. S. Burrus. Real-valued fast Fourier transformalgorithms. IEEE Transactions on Acoustics, Speech, andSignal Processing, ASSP-35(6):849863, June 1987.

Matteo Frigo A Fast Fourier Transform Compiler