parallelism in spiral - rice universitycscads.rice.edu/franchetti.pdfspiral dct4 fftw 3.1.2 dct4...

Carnegie Mellon

Parallelism in Spiral

This work was supported by DARPA DESA program, NSF-NGS/ITR, NSF-ACR, and Intel

Franz Franchetti

Electrical and Computer EngineeringCarnegie Mellon University

Joint work withYevgen VoronenkoMarkus Püschel

… and the Spiral team (only part shown)

Carnegie Mellon

The Problem

0

2

4

6

8

10

12

14

16

18

20

22

24

26

16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144

Problem size

Discrete Fourier Transform (single precision): 2 x Core2 Extreme 3 GHzPerformance [Gflop/s]

What’s going on?

30xbest code

Numerical Recipes

Carnegie Mellon

Automatic Performance TuningCurrent vicious circle: Whenever a new platform comes out, the same functionality needs to be rewritten and reoptimized

Automatic Performance TuningBLAS: ATLAS Linear algebra: Bebop, Spike, FlameSorting Fourier transform: FFTW Linear transforms: Spiral…othersNew compiler techniques

Proceedings of the IEEE special issue, Feb. 2005But what about parallelism … ?

Carnegie Mellon

Vision Behind Spiral

Numerical problem

Computing platform

algorithm selection

compilation

hum

an e

ffort

auto

mat

ed

implementationC program

auto

mat

edalgorithm selection

compilation

implementation

Numerical problem

Computing platform

Current Future

C code a singularity: Compiler hasno access to high level information

Challenge: conquer the high abstraction level for complete automation

Carnegie Mellon

Organization

Spiral overview

Parallelization in Spiral

Results

Concluding remarks

Carnegie Mellon

SpiralLibrary generator for linear transforms (DFT, DCT, DWT, filters, ….) and recently more …

Wide range of platforms supported: scalar, fixed point, vector, parallel, Verilog, GPU

Research Goal: “Teach” computers to write fast librariesComplete automation of implementation and optimizationConquer the “high” algorithm level for automation

When a new platform comes out: Regenerate a retuned library

When a new platform paradigm comes out (e.g., vector or CMPs):Update the tool rather than rewriting the library

Intel has started to use Spiral to generate parts of their MKL library

Carnegie Mellon

How Spiral Works

Algorithm GenerationAlgorithm Optimization

ImplementationCode Optimization

CompilationCompiler Optimizations

algorithm

C code

performance

Problem specification (transform)

Fast executable

Sear

ch

controls

controls

Spiral

Spiral: Complete automation of the implementation and optimization task

Basic idea:Declarative representation of algorithms

Rewriting systems to generate and optimize algorithms

Carnegie Mellon

What is a (Linear) Transform?Mathematically: Matrix-vector multiplication

Example: Discrete Fourier transform (DFT)

input vector (signal)output vector (signal) transform = matrix

Carnegie Mellon

Transform Algorithms: Example 4-point FFTCooley/Tukey fast Fourier transform (FFT):

Algorithms reduce arithmetic cost O(n2) → O(nlog(n))Product of structured sparse matricesMathematical notation exhibits structure: SPL (signal processing language)

1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1

j j

j j j

⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − ⋅ ⋅ ⋅ ⋅ ⋅ − ⋅ ⋅ ⋅ ⋅ ⋅⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥=⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − ⋅ − ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − ⋅ ⋅ − ⋅ ⋅ ⋅ ⋅ ⋅ − ⋅ ⋅ ⋅⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦

Fourier transform

Identity Permutation

Diagonal matrix (twiddles)

Kronecker product

Carnegie Mellon

Examples: Transforms

Spiral currently contains 55 transforms

Carnegie Mellon

Examples: Breakdown Rules (currently ≈220)

Base case rules

“Teaches” Spiral about existing algorithm knowledge (~200 journal papers)

Includes many new ones (algebraic theory, Pueschel, Moura, Voronenko)

Carnegie Mellon

SPL to Sequential Code

Example: tensor product

Carnegie Mellon

Program Generation in Spiral (Sketched)Transformuser specified

C Code:

Fast algorithmin SPLmany choices

∑-SPL:

Iteration of this process to search for the fastest

But that’s not all …

parallelizationvectorization

loop optimizations

constant foldingscheduling……

Optimization at allabstraction levels

Carnegie Mellon

Organization

Spiral overview


Results

Concluding remarks

Carnegie Mellon

SPL to Shared Memory Code: Basic IdeaGoverning construct: tensor product

Independent operation, load-balanced

AAAA

x y

Processor 0Processor 1Processor 2Processor 3

Problematic construct: permutations produce false sharing

Task: Rewrite formulas to extract tensor product + keep contiguous blocks

x y

[SC 06]

Carnegie Mellon

Step 1: Shared Memory TagsIdentify crucial hardware parameters

Number of processors: pCache line size: μ

Introduce them as tags in SPL

This means: formula A is to be optimized for p processors and cache line size μ

Carnegie Mellon

Step 2: Identify “Good” FormulasLoad balanced, avoiding false sharing

Tagged operators (no further rewriting necessary)

Definition: A formula is fully optimized if it is one of the above or of the form

where A and B are fully optimized.

Carnegie Mellon

Step 3: Identify Rewriting RulesGoal: Transform formulas into fully optimized formulas

Formulas rewritten, tags propagatedThere may be choices

Carnegie Mellon

Simple Rewriting Example

fully optimized

Loop splitting + loop exchange

Carnegie Mellon

Parallelization by Rewriting

Fully optimized (load-balanced, no false sharing) in the sense of our definition

Carnegie Mellon

Same Approach for Other Parallel ParadigmsVectorization: [IPDPS 02, VecPar 06]Message Passing: [ISPA 06]

Cg/OpenGL for GPUs: Verilog for FPGAs: [DAC 05]

MPI

With Bonelli, Lorenz, Ueberhuber, TU Vienna

With Shen, TU Denmark With Milder, Hoe, CMU

Carnegie Mellon

Going Beyond TransformsTransform = linear operator with one vector input and one vector output

Key ideas: Generalize to (possibly nonlinear) operators with several inputs and severaloutputsGeneralize SPL (including tensor product) to OL (operator language)

Cooley-Tukey FFT in OL:

Viterbi in OL:

Mat-Mat-Mult:

Carnegie Mellon

OL Rewriting RulesSPL rules reusedOnly few OL-specific rules required

New OL rules

Carnegie Mellon

Example: Viterbi Decoder in OLViterbi decoder (forward part) as operator

Viterbi kernel (butterfly)

Viterbi algorithm as breakdown rule

First non-transform supported by Spiral

Carnegie Mellon

Viterbi: Vectorization Through Rewriting

Sufficient to vectorize one inputVectorized kernelIn-register shuffle operation

Carnegie Mellon

Organization

Spiral overview


Results

Concluding remarks

Carnegie Mellon

Benchmarks

platforms

kernels

vector dual/quad coreGPU FPGA

DFT

All Spiral code shownis “push-button” generatedfrom scratch

“click”

FPGA+CPU

Carnegie Mellon

DFT (single precision): on 3 GHz 2 x Core 2 Extremeperformance [Gflop/s]

0

5

10

15

20

25

30

256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144

input size

Spiral 5.0 SPMDSpiral 5.0 sequentialIntel IPP 5.0FFTW 3.2 alpha SMPFFTW 3.2 alpha sequential

Benchmark: Vector and SMP

2 processors

4 processors

2 processors 4 processors

Memory footprint < L1$ of 1 processor

25 Gflop/s!

4-way vectorized + up to 4-threaded + adapted to the memory hierarchy

Carnegie Mellon

Benchmark: Cell (1 processor = SPE)DFT (single precision) on 3.2 GHz Cell BE (Single SPE)performance [Gflop/s]

0

2

4

6

8

10

12

14

16

16 32 48 64 80 96 128 160 256 384 512 1024input size

Generated using the simulator; run at Mercury (thanks to Robert Cooper)

Joint work with Th. Peter (ETH Zurich), S. Chellappa, M. Telgarsky, J. Moura (CMU)

Carnegie Mellon

DCT4, Multiples of 32: 4-way VectorizedDCT (single precision) 2.66 GHz Core2 (4-way 32-bit SSE)performance [Gflop/s]

0

1

2

3

4

5

6

7

8

9

10

32 64 96 128 160 192 224 256

input size

Spiral DCT4FFTW 3.1.2 DCT4 (k11)IPP 5.1 DCT2 (?)

novel algorithm (algebraic algorithm theory)

Carnegie Mellon

DFT, 8-way Vectorized: All Sizes Up To 80

0

2

4

6

8

10

12

14

16

18

2 7 12 17 22 27 32 37 42 47 52 57 62 67 72 77input size

Spiral SSE2

Intel IPP 5.1

DFT (16-bit integer) on 2.66 GHz Core2 Duo (8-way SSE2)performance [Gflop/s]

first 8-way DFTs for all sizesarbitrary vector length /arbitrary DFT sizes in principle solved

Carnegie Mellon

Benchmark: GPU

0

1

2

3

4

5

6

8 10 12 14 16 18 20 22 24log2(input size)

WHT (single precision) on 3.6 GHz Pentium 4 with Nvidia 7900 GTXperformance [Gflops/s]

Spiral CPU Spiral GPU

Joint work with H. Shen, TU Denmark

CPU + GPU

Carnegie Mellon

Benchmark: FPGADFT 256 on Xilinx Virtex 2 Pro FPGAinverse throughput (gap) [us]

0

1

2

3

4

5

6

7

8

0 5000 10000 15000 20000 25000Area [slices]

Xilinx Logicore 3.2

Spiral

better

Joint work with P. Milder, J. Hoe (CMU)

competitive with professional designsmuch larger set of performance/area trade-offs

(Pareto-optimal designs)

Carnegie Mellon

Benchmark: Hardware Accelerator (FPGA)Xilinx Virtex 2 Pro FPGA: 1M gates @ 100 MHz + 2 PowerPC 405 @ 300 MHz

Fixed set of accelerators speed up a whole library

better

0

100

200

300

400

500

600

700

16 32 64 128 256 512 1024 2048 4096 8192

Problem size

[Mflop/s]

Software only

Software + hardware

Joint work with P. D’Alberto (Yahoo), A. Sandryhaila, P. Milder, J. Hoe, J . M. F. Moura (CMU), J. Johnson (Drexel)

6.5x

native sizes

Carnegie Mellon

Benchmarks

platforms

kernels

vector dual/quad coreGPU FPGA

DFT

filter

GEMM

Viterbi

All Spiral code shownis “push-button” generatedfrom scratch

“click”

FPGA+CPU

Carnegie Mellon

Benchmark: Finite Impulse Response Filter

0

5

10

15

20

25

30

35

40

45

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20log2(input size)

theoretical peak performance: 74.66 Gflop/s Spiral 8 taps

Spiral 32 taps

IPP 32 taps

IPP 8 taps

FIR filter (double precision) on 2.33 GHz 2x Core 2 Quad (8 threads)performance [Gflop/s]

Carnegie Mellon

Beyond Transforms : Viterbi DecodingViterbi decoding (8-bit) on 2.66 GHz Core 2 Duoperformance [Gbutterflies/s]

0

0.5

1

1.5

2

2.5

6 7 8 9 10 11 12 13log2(size of state machine)

Spiral 16-way vectorized

Spiral scalar

Karn's Viterbi decoder(hand-tuned assembly)

1 butterfly = ~22 ops

Vectorized using practically the same rules as for DFT

Joint work with S. Chellappa, CMU Karn: http://www.ka9q.net/code/fec/

Carnegie Mellon

First Results: Matrix-Matrix-Multiply

DGEMM on 3 GHz Core 2 Duo (1 thread)performance [Gflop/s]

0

2

4

6

8

10

12

2 4 8 16 32 64 128 256 512

N (matrix size is N x N)

Goto

Spiral

triple loop

work with F. de Mesmay, CMU

Carnegie Mellon

Organization

Spiral overview


Results

Concluding remarks

Carnegie Mellon

ConclusionsAutomatic generation of very fast and fastest numerical kernels is possible and desirable

High level language and approachAlgorithm generationAlgorithm optimization

Same approach for loop optimization, different forms of parallelism, SW and HW implementations

Carnegie Mellon

Spiral Web Interface @spiral.net (beta version)

1. Select platform

2. Select functionality

3.“click”

http://www.spiral.net/

parallelism in spiral - rice universitycscads.rice.edu/franchetti.pdfspiral dct4 fftw 3.1.2 dct4...

Documents