carnegie mellon high-performance code generation for fir filters and the discrete wavelet transform...

21
Carnegie Mellon High-Performance Code Generation High-Performance Code Generation for FIR Filters and the for FIR Filters and the Discrete Wavelet Transform Using Discrete Wavelet Transform Using SPIRAL SPIRAL Aca Gačić Markus Püschel José M. F. Moura SPIRAL project http://www.spiral.net Electrical and Computer Engineering Department Carnegie Mellon University

Upload: winfred-shields

Post on 05-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

n

High-Performance Code Generation for High-Performance Code Generation for FIR Filters and the Discrete Wavelet FIR Filters and the Discrete Wavelet

Transform Using SPIRALTransform Using SPIRAL

Aca Gačić

Markus Püschel

José M. F. Moura

SPIRAL project http://www.spiral.net

Electrical and Computer Engineering Department

Carnegie Mellon University

Page 2: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

n

We want to close this gapWe want to close this gap

Automatic performance tuningAutomatic performance tuningATLAS, PHiPAC, SPARSITY, FFTW, SPIRALATLAS, PHiPAC, SPARSITY, FFTW, SPIRAL

No filtering and wavelet kernels !

Existing FIR filtering and DWT softwareExisting FIR filtering and DWT softwareMatlabMatlab©©, Mathematica, S+Wavelets, WaveLab, Wave++, IPP, Mathematica, S+Wavelets, WaveLab, Wave++, IPP

Automatic platform-adapted solutions not available !

High-Performance Code Generation for FIR Filters High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRALand the Discrete Wavelet Transform Using SPIRAL

Aca Gačić, Markus Püschel, José M. F. Moura

SPIRAL project: http://www.spiral.netElectrical and Computer Engineering Department

Carnegie Mellon University

Support: DARPA-DSO & NSF

Page 3: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

n

DSP transformspecifies

userS

P I

R A

L

Sea

rch

En

gin

e

controlsimplementation options

controlsalgorithm generation

C/Fortran/SIMDcode

Formula Generator

SPL Compiler

fast algorithmas SPL formula

goes for a coffee

(or an

espres

so fo

r small tran

sfo

rms)

SPIRAL SPIRAL systemsystemAutomatic generation Automatic generation ofof fast fast implementations for DSPimplementations for DSP

runtime on given platform

platform-adaptedimplementation

comes back

nmmmn

nmnmnnm LDFTITIDFTDFT

breakdown rules

82222

84428 LFIITIFDFT

Page 4: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

n

• Arithmetic cost not a reliable predictor of performance • Best implementation not obvious for a human programmer • Best implementation is platform dependent

• Extremely large number of algorithms even for modest size transforms• Automatically generated code includes implementation options• Optimized through intelligent search

Capture Capture implementations inimplementations in redesigned redesigned SPIRAL frameworkSPIRAL framework

FIR Filters and Wavelets in SPIRALFIR Filters and Wavelets in SPIRAL

*,,2/2/ ),(I),(DWT),(DWT rlnnnnn H Eghghgh

Discrete Wavelet TransformDiscrete Wavelet Transform

222/

2/2/ L....)()()( kn

eTn

no

Tnn FFH hhgh,

FIR Filter TransformFIR Filter Transform hh b

kbnn FIF 1

/

...I 1/1 hhh TbkbmkLn FTF

Page 5: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

nMotivationMotivation

Choose Fast AlgorithmFast Algorithm Arithmetic cost – only a rough estimate of performance Many algorithms with similar cost – which one?

Design Efficient CodeEfficient Code Architecture conscious – machine dependent Obsolete when platform is changed/upgraded

Best implementationBest implementation: Algorithm + Machine + Compiler Requires experts in algorithms and computer architecture Frequent re-implementation

Better way: automatic performance tuning

FIR filters - image enhancement, equalization, speech synthesis, biomedical signal processing, …

Wavelets - image compression, detection, denoising, signal recovery Numerically intensive - significant impact on application performance

High-Performance Implementations (How-To)High-Performance Implementations (How-To)

Page 6: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

n

We want to close this gapWe want to close this gap

Automatic performance tuning packagesAutomatic performance tuning packagesATLAS, PHiPAC, SPARSITY, FFTW, SPIRALATLAS, PHiPAC, SPARSITY, FFTW, SPIRAL

Include several basic linear algebra and DSP functions

No filtering and wavelet kernels !

Existing FIR filtering and DWT softwareExisting FIR filtering and DWT softwareMatlabMatlab©©, Mathematica, S+Wavelets, WaveLab, Wave++, IPP, Mathematica, S+Wavelets, WaveLab, Wave++, IPP

Application oriented – not machine specific, or

Hand-tuned code for specific platform Automatic platform-adapted solutions not available !

Page 7: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

nSPIRAL SPIRAL systemsystemGenerator of optimized DSP transform implementationsGenerator of optimized DSP transform implementations

DSP transformspecifies

usergoes for a coffee

Formula Generator

SPL Compiler Sea

rch

En

gin

e

runtime on given platform

controlsimplementation options

controlsalgorithm generation

fast algorithmas SPL formula

C/Fortran/SIMDcode

S P

I R

A L

(or an

espres

so fo

r small tran

sfo

rms)

platform-adaptedimplementation

comes back

Page 8: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

nSPIRAL’s Four Key ConceptsSPIRAL’s Four Key Concepts

RuleRule

FormulaFormula

nmmmn

nmnmnnm LDFTITIDFTDFT

82222

84428 LFIITIFDFT

• a breakdown strategy - product of sparse matrices• captures important structural information

• recursive application of rules• uniquely defines an algorithm• efficient representation• easy manipulation

Rule TreeRule Tree

8DFT

2DFT 4DFT

2DFT 2DFT

• few constructs and primitives• uniquely defines an algorithm• can be translated into code

TransformTransform parameterized matrix 1,....,0,/2

nDFT nlk

njkle

Diagonal matrix (twiddles)

Kronecker product Identity Permutation

Page 9: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

n

x[0]

x[4]

x[2]

x[6]

x[1]

x[5]

x[3]

x[7]

X[0]

X[1]

X[2]

X[3]

X[4]

X[5]

X[6]

X[7]

W N2

W N2

W N1

W N3

W N2

x[0]

x[4]

x[1]

x[5]

x[2]

x[6]

x[3]

x[7]

X[0]

X[1]

X[2]

X[3]

X[4]

X[5]

X[6]

X[7]

W N2

W N2

W N1

W N3

W N2

Rule Trees = Formulas = AlgorithmsRule Trees = Formulas = AlgorithmsR

ule

Tre

esR

ule

Tre

esF

orm

ula

sF

orm

ula

sA

lgo

rith

ms

Alg

ori

thm

s

DFT8

DFT2DFT4

Cooley-Tukey

DFT2DFT 2

Cooley-Tukey

DFT8

DFT2 DFT4

Cooley-Tukey

DFT2 DFT2

Cooley-Tukey

Transform

Rule

8224

84248 L)DFT(IT)IDFT(DFT 8

44282428 L)DFT(IT)I(DFTDFT

4222

4222 L)DFT(IT)I(DFT

Incr

easi

ng

lev

el o

f ab

stra

ctio

nIn

crea

sin

g l

evel

of

abst

ract

ion

Page 10: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

n

Number of rule trees grows exponentially with Size of the transform

Number of applicable rules

Trigonometric Rule

nnn i SinDFTCosDFTDFT

DCT-Recursive Rule n

nIInnnn

nn

IInnnn

LSS

LSS

2)(

4/2/

2)(

4/2/

DCTSinDFTSinDFT

DCTCosDFTCosDFT

Type-Split Rule

nn PPIVn

IInn

IIn FP

2n/2

)(2/

)(2/

)( IDCTDCTDCT

Examples: Rules and Rule TreesExamples: Rules and Rule Trees

Type-Conversion Rule n

IInn

IVn DS )()( DCTDCT

DFT16

CosDFT16SinDFT16

Trigonometric

SinDFT8 DCT(II)4

DCT-Recursive

SinDFT 4 DCT(II)2 DCT(IV)

2DCT(II)2

DCT-Recursive Type-Split

......

SPIRAL also includes search over implementation options, such as the degree of loop unrolling

Recursive RHT Rule k

kkkk L2222222 111 IFIRHTRHT

Page 11: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

n

FIR Filters and the DWT in SPIRAL: ChallengesFIR Filters and the DWT in SPIRAL: Challenges

Existing filtering and wavelet algorithm representations do not capture all important structural information

SPIRAL uses a concise math framework to represent many algorithms for several transforms (DFT, DCTs, DHT, RDFT, …)

Filtering and wavelet algorithms have considerably different structure from the current SPIRAL transforms Current framework not enough to capture algorithms’ structure Existing algorithm representations need to be “spiralized” – i.e., captured

concisely using the rule formalism

Need changes on several levels: New transforms, constructs, rules, translation templates Modified search Need to rephrase several concepts of the framework

Existing framework has to be redesigned to accommodate new constructions and decomposition strategies

New framework has to be integrated in SPIRAL

Page 12: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

nFIR Filters: Rules and AlgorithmsFIR Filters: Rules and Algorithms

BA k BA k AIsk

New constructsNew constructs

column and row overlapped direct sum overlapped tensor product

1

0

k

iininnn xhxhy

1

11

01

1

01

0

)(

k

k

kn

h

hh

hh

h

hh

h

F

h

FIR Filter TransformFIR Filter Transform

TknnF hh 1I

Overlap-Add RuleOverlap-Add Rule

hh bk

bnn FIF 1/

• Localized access of the input data• Input is blocked into length b segments

Time Domain MethodsTime Domain Methods

standard “sum” notation

SPIRAL transform notation

SPIRAL rulecaptures the structure

concise rule notation

Baseline RuleBaseline Rule

Page 13: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

n

ibk

i

bvbnn TIF hh

/

1/

Blocking RuleBlocking Rule

Overlap-Save RuleOverlap-Save Rule Rk

TbkbmkLn TFTF hhhh 11/1 I

• Localized storage of the output• Output segments are computed independently

Toeplitz matrix

• Multilevel blocking• Control over input & output locality

Frequency Domain MethodsFrequency Domain Methods

1,1, knnkn EECF hh

nk x

n, 0

IknEExpanded Circulant RuleExpanded Circulant Rule

nnnC RDFTRDFTXRDFT -1 aaCyclic-RDFT RuleCyclic-RDFT Rule

Circulant matrix real DFT

nn

klnn w 222/22/ LRDFTI)X(IRDFTRDFT Radix-2 Cooley-Radix-2 Cooley-

Tukey RDFT RuleTukey RDFT Rule

Time domain - better data access structureFrequency domain - potentially reduce arithmetic cost

sparse matrix

Page 14: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

n

Filter Bank – Mallat’s AlgorithmFilter Bank – Mallat’s Algorithm

g(z-1)

{xk} = {cj,k }

{dj-1,k }

d0,k

c0,k

{dj-2,k }

{cj-1,k

}

2

2

2

2

2

2

{c1,k

}

g(z-1)

g(z-1)

h(z-1)

h(z-1)

h(z-1)

DWT of {xn}

DWT: Rules and AlgorithmsDWT: Rules and Algorithms

*,,2/12/2/2/ )()(I),(DWT),(DWT rlnnknnnnn FF Egh2ghgh

Mallat RuleMallat Rule Extension matrix

*,,2/2/ ),(I),(DWT),(DWT rlnnnnn H Eghghgh

Fused Mallat RuleFused Mallat Rule

T

kn

Tkn

nH g

hgh

22/

22/

I

I),(

single-level DWT• Redundant computations removed• Filter structure lost

Page 15: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

n

g(z-1)

h(z-1)

{xk }

2

2

{xk }

2

2

z-1

P(z-1)

{xk }odd

{xk }even

)()(

)()()(

11

111

zgzg

zhzhzP

oe

oe

1)(

01

10

)(1

1)(

01

10

)(1

/10

0

)()(

)()(1

0

10

1

1

11

11

zt

zs

zt

zs

K

K

zgzg

zhzh

i

i

oe

oe

Primal(Predict)

Dual(Update) Lifting steps

Polyphase RepresentationPolyphase Representation

222/

2/2/22/

2/2/ L)()()()()(

kne

Tn

no

Tnkne

Tn

no

Tnn FFFFH gghhgh,

Polyphase RulePolyphase Rule

Lifting SchemeLifting Scheme

Gateway to frequency domain computation

2

22/2/

2/2/

2/2/2/

2/2/2/

2/

I)(I

I)(II/1I)(

kn

nn

oTnnn

nniTn

nnn

nnn

LF

FKKH

o

i

t

sgh,

t

s Lifting RuleLifting Rule

• Asymptotically reduce cost by 50%• In-place computations

Page 16: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

n

Fn(h)

Fb(h)

T(h1) T(hk/b)

Overlap-Add

Blocking

.....

T(h1) T(hk/b1)

Blocking

.....F

n(h)

Fb(h)

T(h1) T(h

k/b)

Overlap-Add

Blocking

.....

T(h1) T(h

k/b1)

Blocking

.....Fn(h)

Fb(h)

T(h1) T(hk/b)

Overlap-Add

Blocking

.....

T(h1) T(hk/b1)

Blocking

.....

Fn(h)

Cb+l-1(h)

Fb(h)

RDFTb+l-1

RDFTb+l-1/2 RDFT2

T(hL) T(hR)

Overlap-Save

Expanded-Circulant

Cyclic-RDFT

RDFT Cooley-Tukey

Fn(h)

Fb(h)

T(h1) T(hk/b)

Overlap-Add

Blocking

.....

T(h1) T(hk/b1)

Blocking

.....

DWTn(h,g)

Cb+l-1(h)

Fn/2(he)

RDFTb+l-1

RDFT b+l-1/2 RDFT2

Polyphase

Expanded-Circulant

Cyclic-RDFT

RDFT Cooley-Tukey

DWT n/2(h,g) Fn/2(ho) Fn/2(ge) Fn/2(go)

........

........

........

........

Examples: FIR Filter and DWT Rule TreesExamples: FIR Filter and DWT Rule Trees

DWTn(h,g)

Fn/2(h)

Mallat

DWTn/2(h,g) Fn/2(g)

........

........

T(h1) T(hk/b)

Blocking

.....

All these rule trees form an extensive search space of alternative algorithms in SPIRAL

Exceedingly large number of trees even for modest size transforms

Page 17: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

nTime-Domain MethodsTime-Domain Methods

60-70% improvement over baselineSame arithmetic cost

unrolled loops 1- level blockingblocks size 2

Pentium 4Pentium 4 SUNSUN

• overlap-add, overlap-save, and blocking• all methods have the same cost• baseline method = overlap-add with b=1• sanity check – percentage of the peak performance

flops / peak performance

comparison of different time-domain methods

FIR FilterFIR Filter

Page 18: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

nFrequency Domain vs. Time DomainFrequency Domain vs. Time Domain

circulant too sparse

FD methods better on large “fat” circulantsFilter lengths > 64

Considerable discrepancy between arithmetic costs and actual runtimes

best time-domain

FIR FilterFIR Filter

arithmetic cost cross-over

runtime cross-over

CirculantCirculant

“half–dense” circulants emerge in frequency domain methods

Frequency domain – potentially lower costTime domain – beter data access patterns

Page 19: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

nDaubechies DDaubechies D4 4 DWT DWT

XeonXeon AthlonAthlon

cache size reacheddirect method

flops / peak performance for D4 and D30

• single-level DWT using D4

• 3 different classes of rule trees based on the initial rule

•Fused Mallat Rule (direct method)•Polyphase Rule (frequency domain)•Lifting Rule (lifting scheme)

comparison of runtimes

Frequency domain trees not foundLifting scheme – 30% lower cost – slower

Short WaveletShort Wavelet

Page 20: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

nDaubechies DDaubechies D3030 DWT Experiment DWT Experiment

Long WaveletLong Wavelet

XeonXeon AthlonAthlon

cache size reachedextension overhead

liftingbetter

tempcache

Different methods are found for different size rangesBest implementations vary from machine to machine

Page 21: Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura

Car

neg

ie M

ello

nSummarySummary

Fast automatically generated code for FIR filters and DWT Significant contribution – no existing solutions

Preliminary results Arithmetic cost not a reliable predictor of performance Best implementation not obvious for a human programmer Best implementations vary over different computer platforms

What lies ahead?What lies ahead? Further algorithmic and implementation optimizations

Fusion of the extension, exploiting in-place structure, scheduling,..

Entire class of wavelets and extension methods Orthogonal, biorthogonal, frames,… Periodic, symmetric, polynomial extension, zero-padding

Extend to other wavelet transforms 2D-DWT, M-band, wavelet packet transform

Utilize special instruction set extensions Single Instruction Multiple Data (SIMD) Fused Multiply-Add (FMA)