carnegie mellon high-performance code generation for fir filters and the discrete wavelet transform...
Post on 05-Jan-2016
217 Views
Preview:
TRANSCRIPT
Car
neg
ie M
ello
n
High-Performance Code Generation for High-Performance Code Generation for FIR Filters and the Discrete Wavelet FIR Filters and the Discrete Wavelet
Transform Using SPIRALTransform Using SPIRAL
Aca Gačić
Markus Püschel
José M. F. Moura
SPIRAL project http://www.spiral.net
Electrical and Computer Engineering Department
Carnegie Mellon University
Car
neg
ie M
ello
n
We want to close this gapWe want to close this gap
Automatic performance tuningAutomatic performance tuningATLAS, PHiPAC, SPARSITY, FFTW, SPIRALATLAS, PHiPAC, SPARSITY, FFTW, SPIRAL
No filtering and wavelet kernels !
Existing FIR filtering and DWT softwareExisting FIR filtering and DWT softwareMatlabMatlab©©, Mathematica, S+Wavelets, WaveLab, Wave++, IPP, Mathematica, S+Wavelets, WaveLab, Wave++, IPP
Automatic platform-adapted solutions not available !
High-Performance Code Generation for FIR Filters High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRALand the Discrete Wavelet Transform Using SPIRAL
Aca Gačić, Markus Püschel, José M. F. Moura
SPIRAL project: http://www.spiral.netElectrical and Computer Engineering Department
Carnegie Mellon University
Support: DARPA-DSO & NSF
Car
neg
ie M
ello
n
DSP transformspecifies
userS
P I
R A
L
Sea
rch
En
gin
e
controlsimplementation options
controlsalgorithm generation
C/Fortran/SIMDcode
Formula Generator
SPL Compiler
fast algorithmas SPL formula
goes for a coffee
(or an
espres
so fo
r small tran
sfo
rms)
SPIRAL SPIRAL systemsystemAutomatic generation Automatic generation ofof fast fast implementations for DSPimplementations for DSP
runtime on given platform
platform-adaptedimplementation
comes back
nmmmn
nmnmnnm LDFTITIDFTDFT
breakdown rules
82222
84428 LFIITIFDFT
Car
neg
ie M
ello
n
• Arithmetic cost not a reliable predictor of performance • Best implementation not obvious for a human programmer • Best implementation is platform dependent
• Extremely large number of algorithms even for modest size transforms• Automatically generated code includes implementation options• Optimized through intelligent search
Capture Capture implementations inimplementations in redesigned redesigned SPIRAL frameworkSPIRAL framework
FIR Filters and Wavelets in SPIRALFIR Filters and Wavelets in SPIRAL
*,,2/2/ ),(I),(DWT),(DWT rlnnnnn H Eghghgh
Discrete Wavelet TransformDiscrete Wavelet Transform
222/
2/2/ L....)()()( kn
eTn
no
Tnn FFH hhgh,
FIR Filter TransformFIR Filter Transform hh b
kbnn FIF 1
/
...I 1/1 hhh TbkbmkLn FTF
Car
neg
ie M
ello
nMotivationMotivation
Choose Fast AlgorithmFast Algorithm Arithmetic cost – only a rough estimate of performance Many algorithms with similar cost – which one?
Design Efficient CodeEfficient Code Architecture conscious – machine dependent Obsolete when platform is changed/upgraded
Best implementationBest implementation: Algorithm + Machine + Compiler Requires experts in algorithms and computer architecture Frequent re-implementation
Better way: automatic performance tuning
FIR filters - image enhancement, equalization, speech synthesis, biomedical signal processing, …
Wavelets - image compression, detection, denoising, signal recovery Numerically intensive - significant impact on application performance
High-Performance Implementations (How-To)High-Performance Implementations (How-To)
Car
neg
ie M
ello
n
We want to close this gapWe want to close this gap
Automatic performance tuning packagesAutomatic performance tuning packagesATLAS, PHiPAC, SPARSITY, FFTW, SPIRALATLAS, PHiPAC, SPARSITY, FFTW, SPIRAL
Include several basic linear algebra and DSP functions
No filtering and wavelet kernels !
Existing FIR filtering and DWT softwareExisting FIR filtering and DWT softwareMatlabMatlab©©, Mathematica, S+Wavelets, WaveLab, Wave++, IPP, Mathematica, S+Wavelets, WaveLab, Wave++, IPP
Application oriented – not machine specific, or
Hand-tuned code for specific platform Automatic platform-adapted solutions not available !
Car
neg
ie M
ello
nSPIRAL SPIRAL systemsystemGenerator of optimized DSP transform implementationsGenerator of optimized DSP transform implementations
DSP transformspecifies
usergoes for a coffee
Formula Generator
SPL Compiler Sea
rch
En
gin
e
runtime on given platform
controlsimplementation options
controlsalgorithm generation
fast algorithmas SPL formula
C/Fortran/SIMDcode
S P
I R
A L
(or an
espres
so fo
r small tran
sfo
rms)
platform-adaptedimplementation
comes back
Car
neg
ie M
ello
nSPIRAL’s Four Key ConceptsSPIRAL’s Four Key Concepts
RuleRule
FormulaFormula
nmmmn
nmnmnnm LDFTITIDFTDFT
82222
84428 LFIITIFDFT
• a breakdown strategy - product of sparse matrices• captures important structural information
• recursive application of rules• uniquely defines an algorithm• efficient representation• easy manipulation
Rule TreeRule Tree
8DFT
2DFT 4DFT
2DFT 2DFT
• few constructs and primitives• uniquely defines an algorithm• can be translated into code
TransformTransform parameterized matrix 1,....,0,/2
nDFT nlk
njkle
Diagonal matrix (twiddles)
Kronecker product Identity Permutation
Car
neg
ie M
ello
n
x[0]
x[4]
x[2]
x[6]
x[1]
x[5]
x[3]
x[7]
X[0]
X[1]
X[2]
X[3]
X[4]
X[5]
X[6]
X[7]
W N2
W N2
W N1
W N3
W N2
x[0]
x[4]
x[1]
x[5]
x[2]
x[6]
x[3]
x[7]
X[0]
X[1]
X[2]
X[3]
X[4]
X[5]
X[6]
X[7]
W N2
W N2
W N1
W N3
W N2
Rule Trees = Formulas = AlgorithmsRule Trees = Formulas = AlgorithmsR
ule
Tre
esR
ule
Tre
esF
orm
ula
sF
orm
ula
sA
lgo
rith
ms
Alg
ori
thm
s
DFT8
DFT2DFT4
Cooley-Tukey
DFT2DFT 2
Cooley-Tukey
DFT8
DFT2 DFT4
Cooley-Tukey
DFT2 DFT2
Cooley-Tukey
Transform
Rule
8224
84248 L)DFT(IT)IDFT(DFT 8
44282428 L)DFT(IT)I(DFTDFT
4222
4222 L)DFT(IT)I(DFT
Incr
easi
ng
lev
el o
f ab
stra
ctio
nIn
crea
sin
g l
evel
of
abst
ract
ion
Car
neg
ie M
ello
n
Number of rule trees grows exponentially with Size of the transform
Number of applicable rules
Trigonometric Rule
nnn i SinDFTCosDFTDFT
DCT-Recursive Rule n
nIInnnn
nn
IInnnn
LSS
LSS
2)(
4/2/
2)(
4/2/
DCTSinDFTSinDFT
DCTCosDFTCosDFT
Type-Split Rule
nn PPIVn
IInn
IIn FP
2n/2
)(2/
)(2/
)( IDCTDCTDCT
Examples: Rules and Rule TreesExamples: Rules and Rule Trees
Type-Conversion Rule n
IInn
IVn DS )()( DCTDCT
DFT16
CosDFT16SinDFT16
Trigonometric
SinDFT8 DCT(II)4
DCT-Recursive
SinDFT 4 DCT(II)2 DCT(IV)
2DCT(II)2
DCT-Recursive Type-Split
......
SPIRAL also includes search over implementation options, such as the degree of loop unrolling
Recursive RHT Rule k
kkkk L2222222 111 IFIRHTRHT
Car
neg
ie M
ello
n
FIR Filters and the DWT in SPIRAL: ChallengesFIR Filters and the DWT in SPIRAL: Challenges
Existing filtering and wavelet algorithm representations do not capture all important structural information
SPIRAL uses a concise math framework to represent many algorithms for several transforms (DFT, DCTs, DHT, RDFT, …)
Filtering and wavelet algorithms have considerably different structure from the current SPIRAL transforms Current framework not enough to capture algorithms’ structure Existing algorithm representations need to be “spiralized” – i.e., captured
concisely using the rule formalism
Need changes on several levels: New transforms, constructs, rules, translation templates Modified search Need to rephrase several concepts of the framework
Existing framework has to be redesigned to accommodate new constructions and decomposition strategies
New framework has to be integrated in SPIRAL
Car
neg
ie M
ello
nFIR Filters: Rules and AlgorithmsFIR Filters: Rules and Algorithms
BA k BA k AIsk
New constructsNew constructs
column and row overlapped direct sum overlapped tensor product
1
0
k
iininnn xhxhy
1
11
01
1
01
0
)(
k
k
kn
h
hh
hh
h
hh
h
F
h
FIR Filter TransformFIR Filter Transform
TknnF hh 1I
Overlap-Add RuleOverlap-Add Rule
hh bk
bnn FIF 1/
• Localized access of the input data• Input is blocked into length b segments
Time Domain MethodsTime Domain Methods
standard “sum” notation
SPIRAL transform notation
SPIRAL rulecaptures the structure
concise rule notation
Baseline RuleBaseline Rule
Car
neg
ie M
ello
n
ibk
i
bvbnn TIF hh
/
1/
Blocking RuleBlocking Rule
Overlap-Save RuleOverlap-Save Rule Rk
TbkbmkLn TFTF hhhh 11/1 I
• Localized storage of the output• Output segments are computed independently
Toeplitz matrix
• Multilevel blocking• Control over input & output locality
Frequency Domain MethodsFrequency Domain Methods
1,1, knnkn EECF hh
nk x
n, 0
IknEExpanded Circulant RuleExpanded Circulant Rule
nnnC RDFTRDFTXRDFT -1 aaCyclic-RDFT RuleCyclic-RDFT Rule
Circulant matrix real DFT
nn
klnn w 222/22/ LRDFTI)X(IRDFTRDFT Radix-2 Cooley-Radix-2 Cooley-
Tukey RDFT RuleTukey RDFT Rule
Time domain - better data access structureFrequency domain - potentially reduce arithmetic cost
sparse matrix
Car
neg
ie M
ello
n
Filter Bank – Mallat’s AlgorithmFilter Bank – Mallat’s Algorithm
g(z-1)
{xk} = {cj,k }
{dj-1,k }
d0,k
c0,k
{dj-2,k }
{cj-1,k
}
2
2
2
2
2
2
{c1,k
}
g(z-1)
g(z-1)
h(z-1)
h(z-1)
h(z-1)
DWT of {xn}
DWT: Rules and AlgorithmsDWT: Rules and Algorithms
*,,2/12/2/2/ )()(I),(DWT),(DWT rlnnknnnnn FF Egh2ghgh
Mallat RuleMallat Rule Extension matrix
*,,2/2/ ),(I),(DWT),(DWT rlnnnnn H Eghghgh
Fused Mallat RuleFused Mallat Rule
T
kn
Tkn
nH g
hgh
22/
22/
I
I),(
single-level DWT• Redundant computations removed• Filter structure lost
Car
neg
ie M
ello
n
g(z-1)
h(z-1)
{xk }
2
2
{xk }
2
2
z-1
P(z-1)
{xk }odd
{xk }even
)()(
)()()(
11
111
zgzg
zhzhzP
oe
oe
1)(
01
10
)(1
1)(
01
10
)(1
/10
0
)()(
)()(1
0
10
1
1
11
11
zt
zs
zt
zs
K
K
zgzg
zhzh
i
i
oe
oe
Primal(Predict)
Dual(Update) Lifting steps
Polyphase RepresentationPolyphase Representation
222/
2/2/22/
2/2/ L)()()()()(
kne
Tn
no
Tnkne
Tn
no
Tnn FFFFH gghhgh,
Polyphase RulePolyphase Rule
Lifting SchemeLifting Scheme
Gateway to frequency domain computation
2
22/2/
2/2/
2/2/2/
2/2/2/
2/
I)(I
I)(II/1I)(
kn
nn
oTnnn
nniTn
nnn
nnn
LF
FKKH
o
i
t
sgh,
t
s Lifting RuleLifting Rule
• Asymptotically reduce cost by 50%• In-place computations
Car
neg
ie M
ello
n
Fn(h)
Fb(h)
T(h1) T(hk/b)
Overlap-Add
Blocking
.....
T(h1) T(hk/b1)
Blocking
.....F
n(h)
Fb(h)
T(h1) T(h
k/b)
Overlap-Add
Blocking
.....
T(h1) T(h
k/b1)
Blocking
.....Fn(h)
Fb(h)
T(h1) T(hk/b)
Overlap-Add
Blocking
.....
T(h1) T(hk/b1)
Blocking
.....
Fn(h)
Cb+l-1(h)
Fb(h)
RDFTb+l-1
RDFTb+l-1/2 RDFT2
T(hL) T(hR)
Overlap-Save
Expanded-Circulant
Cyclic-RDFT
RDFT Cooley-Tukey
Fn(h)
Fb(h)
T(h1) T(hk/b)
Overlap-Add
Blocking
.....
T(h1) T(hk/b1)
Blocking
.....
DWTn(h,g)
Cb+l-1(h)
Fn/2(he)
RDFTb+l-1
RDFT b+l-1/2 RDFT2
Polyphase
Expanded-Circulant
Cyclic-RDFT
RDFT Cooley-Tukey
DWT n/2(h,g) Fn/2(ho) Fn/2(ge) Fn/2(go)
........
........
........
........
Examples: FIR Filter and DWT Rule TreesExamples: FIR Filter and DWT Rule Trees
DWTn(h,g)
Fn/2(h)
Mallat
DWTn/2(h,g) Fn/2(g)
........
........
T(h1) T(hk/b)
Blocking
.....
All these rule trees form an extensive search space of alternative algorithms in SPIRAL
Exceedingly large number of trees even for modest size transforms
Car
neg
ie M
ello
nTime-Domain MethodsTime-Domain Methods
60-70% improvement over baselineSame arithmetic cost
unrolled loops 1- level blockingblocks size 2
Pentium 4Pentium 4 SUNSUN
• overlap-add, overlap-save, and blocking• all methods have the same cost• baseline method = overlap-add with b=1• sanity check – percentage of the peak performance
flops / peak performance
comparison of different time-domain methods
FIR FilterFIR Filter
Car
neg
ie M
ello
nFrequency Domain vs. Time DomainFrequency Domain vs. Time Domain
circulant too sparse
FD methods better on large “fat” circulantsFilter lengths > 64
Considerable discrepancy between arithmetic costs and actual runtimes
best time-domain
FIR FilterFIR Filter
arithmetic cost cross-over
runtime cross-over
CirculantCirculant
“half–dense” circulants emerge in frequency domain methods
Frequency domain – potentially lower costTime domain – beter data access patterns
Car
neg
ie M
ello
nDaubechies DDaubechies D4 4 DWT DWT
XeonXeon AthlonAthlon
cache size reacheddirect method
flops / peak performance for D4 and D30
• single-level DWT using D4
• 3 different classes of rule trees based on the initial rule
•Fused Mallat Rule (direct method)•Polyphase Rule (frequency domain)•Lifting Rule (lifting scheme)
comparison of runtimes
Frequency domain trees not foundLifting scheme – 30% lower cost – slower
Short WaveletShort Wavelet
Car
neg
ie M
ello
nDaubechies DDaubechies D3030 DWT Experiment DWT Experiment
Long WaveletLong Wavelet
XeonXeon AthlonAthlon
cache size reachedextension overhead
liftingbetter
tempcache
Different methods are found for different size rangesBest implementations vary from machine to machine
Car
neg
ie M
ello
nSummarySummary
Fast automatically generated code for FIR filters and DWT Significant contribution – no existing solutions
Preliminary results Arithmetic cost not a reliable predictor of performance Best implementation not obvious for a human programmer Best implementations vary over different computer platforms
What lies ahead?What lies ahead? Further algorithmic and implementation optimizations
Fusion of the extension, exploiting in-place structure, scheduling,..
Entire class of wavelets and extension methods Orthogonal, biorthogonal, frames,… Periodic, symmetric, polynomial extension, zero-padding
Extend to other wavelet transforms 2D-DWT, M-band, wavelet packet transform
Utilize special instruction set extensions Single Instruction Multiple Data (SIMD) Fused Multiply-Add (FMA)
top related