automatic generation of efficient accelerator designs for

Datacenters are increasingly adopting FPGAs

Motherboards with FPGAs are now on the horizon

Automatic Generation of Efficient Accelerator Designs for Reconfigurable Hardware

{dkoeplin, raghup17, yaqiz}@stanford.edu

Contact Information?

David Koeplinger Raghu Prabhakar Yaqi Zhang

Christina Delimitrou Christos Kozyrakis Kunle Olukotun

108

107

106

109

108

107

107

106

20% 60% 100% 20% 60% 100% 20% 60% 100% ALMs DSPs BRAMs

Resource Usage (% of maximum)

a b c

d e f

g h i

Ru

nti

me

- C

ycle

s (L

og

Scal

e)

TP

CH

Q6

G

DA

Bla

ckSc

ho

les

Metapipelines only Pareto-optimal, Metapipelines only One or more Sequentials Pareto-optimal, 1+ Sequentials Invalid design Pareto-optimal, used in CPU comparison

Benchmark Designs Space Search Time

Dot Product 86,628 24 sec 2 ms/design

Outer Product 57,752 15 sec 2 ms/design

TPCHQ6 43,314 16 sec 3 ms / design

BlackScholes 4,000 64 sec 128 ms / design

Matrix Multiply 200,000 38 min 90 ms / design

K-Means 200,000 9 min 21 ms / design

GDA 200,000 40 min 96 ms / design

Comparison with Vivado HLS (using GDA as an example):

Outer Loop Pipelining Designs Design Space Search Time

Always disabled 250 20 min 4.75 sec / design

Optional as part of space 250 7.7 hours 1.85 min / design

- Searched minimum of full space or 200K random designs

- Loop pipelining and parallelization always explored when valid

- Time per design normalized to single thread performance

Depending on the space, our DSE is 49.5× to 1156× faster than Vivado HLS

General intuition: max out main memory bandwidth if possible, duplicate computation to match rate of incoming data. Reality: hardware design complexity grows exponentially with program complexity – more arrays and loops means more hardware parameters DHDL can drive automatic search of parameter space without exposing artifacts of low-level hardware design to the end-user. We demonstrate random sampling here to give a view of the full space, but better search methods exist.

Benchmark ALMs DSPs BRAMs Runtime

Dot Product 3.3% 0.0% 9.5% 2.8%

Outer Product 2.2% 41.9% 8.3% 3.4%

TPCHQ6 2.4% 0.0% 3.4% 3.1%

BlackScholes 3.4% 4.3% 17.6% 3.6%

Matrix Multiply 13.8% 15.8% 14.2% 18.4%

K-Means 7.0% 0.0% 21.9% 7.7%

GDA 2.9% 13.3% 10.6% 8.6%

Average 5.0% 10.7% 12.2% 6.8%

-Estimated application’s pareto frontier using simple random sampling

- Generated, synthesized ,and ran five pareto-optimal designs per benchmark

-Average of absolute value of errors across designs shown on the right

Model’s self-relative accuracy is sufficient to drive design space exploration

Key advantages of FPGAs - Performance/watt (compared to CPUs and GPUs) - Design cost, greater reconfigurability (compared to ASICs)

But accelerator design and optimization remains difficult:

0.69 0.79 1.05 2.54

18.42

2.36 1.28

0

5

10

15

20

Spe

ed

up

// Core GDA computation

val sigma = sum(0, m){i =>

val mu = if (y(i)) mu1 else mu0

((x(i)-mu).t) ** (x(i)-mu))

}

Map, Reduce, Filter, etc. Parallel Pattern IR

High Level User Program in DSL w/ parallel patterns

Params

DHDL

Design Space

Estimates

Design Space Exploration

Staging [1,2]

Pattern Transformations [3]

Design Proposal

- Pattern and array tile sizes - Pattern parallelization factors - Coarse-grain pipelining toggles

References

- Total ALMs - Total DSPs - Total BRAM - Runtime

Modeled runtime includes off-chip main memory accesses Off-chip access time is estimated based on a target-specific contention model.

[1] Arvind K. Sujeeth, et. al. Delite: A compiler architecture for 13 performance-oriented embedded domain-specific languages. In TECS, July 2014. [2] Arvind K. Sujeeth, et. al. Composition and reuse with compiled domain-specific languages. In European Conference on Object Oriented Programming, In ECOOP, 2013. [3] Raghu Prabhakar, et. al. Generating configurable hardware from parallel patterns. In ASPLOS , April2016

HDL? VHDL

Verilog

- Detailed hardware design, slow evaluation of design tradeoffs - Too low level for most software programmers

HLS? Vivado

OpenCL SDK

- Require pragmas and hardware-specific coding style for good results - Still too low level for most software programmers

Key domains - machine learning - image processing - financial analytics - internet search

Sample Design Space

Design Space Exploration Runtimes

Model Accuracy

Accelerator Design Flow

Transformed DHDL

MaxJ Code Generation FPGA Config Bitstream

Generation FPGA

DHDL Compiler

Area and Latency Model

Motivation

Delite Hardware Definition Language

CPU Comparison

New intermediate language for describing parameterized hardware designs.

- Captures parallelism, pipelining, and locality information at any loop nest level

- Supports arbitrarily nested coarse-grain pipelines (“metapipelines”)

- Can be generated automatically from higher level DSLs with parallel patterns [3]

Tiled dot product in DHDL: val out = ArgOut[Flt]

val vectorA = OffChipMem[Flt](N)

val vectorB = OffChipMem[Flt](N)

MetaPipe.reduce(N by B, out){i =>

val tileA = BRAM[Flt](B)

val tileB = BRAM[Flt](B)

val acc = Reg[Flt]

Parallel {

tileA := vectorA(i::i+B)

tileB := vectorB(i::i+B)

}

Pipe.reduce(B by 1, acc){i =>

tileA(i) * tileB(i)

}{_+_}

accum.value

}{_+_}

Communication between FPGA and host through memory mapped I/O

Declaration of local memories and scalar accumulators

Pipelined, parallelizable outer sum

Data transfer between off-chip arrays and local memories

Pipelined, parallelizable inner sum

vectorA

vectorB

TileLoad

TileLoad tileB +

acc tileA

×

+ out

Pipe 1 Pipe 2

MetaPipe 1 Parallelism factor #1

Pipelining toggle

Tile Size (B) Load Rate Parallelism factor #2

DHDL Hardware Templates

-Synthesized and ran single pareto-optimal design on Max4 MAIA Board at 150 MHz fabric clock

- Compared to single threaded optimized C benchmarks on Intel Xeon E5-2697 CPU at 2.70GHz

-Speedups relative to single CPU shown on the right

Primitives +, -, *, /, etc. Basic math, logic, and control Vector width

Local Load/Store Load/store from on-chip memories Vector width, stride

Memories

OffChipMem N-dimensional off-chip array Dimensions

BRAM On-chip scratchpad Size, buffering, banking

Priority Queue On-chip sorting queue Length, buffering

Reg Accumulator register Buffering

Controllers

Counter Loop indices Parallelization, pattern

Parallel Fork-join coarse-grained parallelism

Pipe Pipelined inner-loop body Parallelization, pattern

Sequential Non-pipelined, sequential stages Parallelization, pattern

MetaPipe Coarse-grained pipeline Parallelization, pattern

Data Transfer Tile Load/Store Load/store from off-chip arrays Tile size, load rate

automatic generation of efficient accelerator designs for

Documents