automatic generation of efficient accelerator designs for
TRANSCRIPT
Datacenters are increasingly adopting FPGAs
Motherboards with FPGAs are now on the horizon
Automatic Generation of Efficient Accelerator Designs for Reconfigurable Hardware
{dkoeplin, raghup17, yaqiz}@stanford.edu
Contact Information?
David Koeplinger Raghu Prabhakar Yaqi Zhang
Christina Delimitrou Christos Kozyrakis Kunle Olukotun
108
107
106
109
108
107
107
106
20% 60% 100% 20% 60% 100% 20% 60% 100% ALMs DSPs BRAMs
Resource Usage (% of maximum)
a b c
d e f
g h i
Ru
nti
me
- C
ycle
s (L
og
Scal
e)
TP
CH
Q6
G
DA
Bla
ckSc
ho
les
Metapipelines only Pareto-optimal, Metapipelines only One or more Sequentials Pareto-optimal, 1+ Sequentials Invalid design Pareto-optimal, used in CPU comparison
Benchmark Designs Space Search Time
Dot Product 86,628 24 sec 2 ms/design
Outer Product 57,752 15 sec 2 ms/design
TPCHQ6 43,314 16 sec 3 ms / design
BlackScholes 4,000 64 sec 128 ms / design
Matrix Multiply 200,000 38 min 90 ms / design
K-Means 200,000 9 min 21 ms / design
GDA 200,000 40 min 96 ms / design
Comparison with Vivado HLS (using GDA as an example):
Outer Loop Pipelining Designs Design Space Search Time
Always disabled 250 20 min 4.75 sec / design
Optional as part of space 250 7.7 hours 1.85 min / design
- Searched minimum of full space or 200K random designs
- Loop pipelining and parallelization always explored when valid
- Time per design normalized to single thread performance
Depending on the space, our DSE is 49.5× to 1156× faster than Vivado HLS
General intuition: max out main memory bandwidth if possible, duplicate computation to match rate of incoming data. Reality: hardware design complexity grows exponentially with program complexity – more arrays and loops means more hardware parameters DHDL can drive automatic search of parameter space without exposing artifacts of low-level hardware design to the end-user. We demonstrate random sampling here to give a view of the full space, but better search methods exist.
Benchmark ALMs DSPs BRAMs Runtime
Dot Product 3.3% 0.0% 9.5% 2.8%
Outer Product 2.2% 41.9% 8.3% 3.4%
TPCHQ6 2.4% 0.0% 3.4% 3.1%
BlackScholes 3.4% 4.3% 17.6% 3.6%
Matrix Multiply 13.8% 15.8% 14.2% 18.4%
K-Means 7.0% 0.0% 21.9% 7.7%
GDA 2.9% 13.3% 10.6% 8.6%
Average 5.0% 10.7% 12.2% 6.8%
-Estimated application’s pareto frontier using simple random sampling
- Generated, synthesized ,and ran five pareto-optimal designs per benchmark
-Average of absolute value of errors across designs shown on the right
Model’s self-relative accuracy is sufficient to drive design space exploration
Key advantages of FPGAs - Performance/watt (compared to CPUs and GPUs) - Design cost, greater reconfigurability (compared to ASICs)
But accelerator design and optimization remains difficult:
0.69 0.79 1.05 2.54
18.42
2.36 1.28
0
5
10
15
20
Spe
ed
up
// Core GDA computation
val sigma = sum(0, m){i =>
val mu = if (y(i)) mu1 else mu0
((x(i)-mu).t) ** (x(i)-mu))
}
Map, Reduce, Filter, etc. Parallel Pattern IR
High Level User Program in DSL w/ parallel patterns
Params
DHDL
Design Space
Estimates
Design Space Exploration
Staging [1,2]
Pattern Transformations [3]
Design Proposal
- Pattern and array tile sizes - Pattern parallelization factors - Coarse-grain pipelining toggles
References
- Total ALMs - Total DSPs - Total BRAM - Runtime
Modeled runtime includes off-chip main memory accesses Off-chip access time is estimated based on a target-specific contention model.
[1] Arvind K. Sujeeth, et. al. Delite: A compiler architecture for 13 performance-oriented embedded domain-specific languages. In TECS, July 2014. [2] Arvind K. Sujeeth, et. al. Composition and reuse with compiled domain-specific languages. In European Conference on Object Oriented Programming, In ECOOP, 2013. [3] Raghu Prabhakar, et. al. Generating configurable hardware from parallel patterns. In ASPLOS , April2016
HDL? VHDL
Verilog
- Detailed hardware design, slow evaluation of design tradeoffs - Too low level for most software programmers
HLS? Vivado
OpenCL SDK
- Require pragmas and hardware-specific coding style for good results - Still too low level for most software programmers
Key domains - machine learning - image processing - financial analytics - internet search
Sample Design Space
Design Space Exploration Runtimes
Model Accuracy
Accelerator Design Flow
Transformed DHDL
MaxJ Code Generation FPGA Config Bitstream
Generation FPGA
DHDL Compiler
Area and Latency Model
Motivation
Delite Hardware Definition Language
CPU Comparison
New intermediate language for describing parameterized hardware designs.
- Captures parallelism, pipelining, and locality information at any loop nest level
- Supports arbitrarily nested coarse-grain pipelines (“metapipelines”)
- Can be generated automatically from higher level DSLs with parallel patterns [3]
Tiled dot product in DHDL: val out = ArgOut[Flt]
val vectorA = OffChipMem[Flt](N)
val vectorB = OffChipMem[Flt](N)
MetaPipe.reduce(N by B, out){i =>
val tileA = BRAM[Flt](B)
val tileB = BRAM[Flt](B)
val acc = Reg[Flt]
Parallel {
tileA := vectorA(i::i+B)
tileB := vectorB(i::i+B)
}
Pipe.reduce(B by 1, acc){i =>
tileA(i) * tileB(i)
}{_+_}
accum.value
}{_+_}
Communication between FPGA and host through memory mapped I/O
Declaration of local memories and scalar accumulators
Pipelined, parallelizable outer sum
Data transfer between off-chip arrays and local memories
Pipelined, parallelizable inner sum
vectorA
vectorB
TileLoad
TileLoad tileB +
acc tileA
×
+ out
Pipe 1 Pipe 2
MetaPipe 1 Parallelism factor #1
Pipelining toggle
Tile Size (B) Load Rate Parallelism factor #2
DHDL Hardware Templates
-Synthesized and ran single pareto-optimal design on Max4 MAIA Board at 150 MHz fabric clock
- Compared to single threaded optimized C benchmarks on Intel Xeon E5-2697 CPU at 2.70GHz
-Speedups relative to single CPU shown on the right
Primitives +, -, *, /, etc. Basic math, logic, and control Vector width
Local Load/Store Load/store from on-chip memories Vector width, stride
Memories
OffChipMem N-dimensional off-chip array Dimensions
BRAM On-chip scratchpad Size, buffering, banking
Priority Queue On-chip sorting queue Length, buffering
Reg Accumulator register Buffering
Controllers
Counter Loop indices Parallelization, pattern
Parallel Fork-join coarse-grained parallelism
Pipe Pipelined inner-loop body Parallelization, pattern
Sequential Non-pipelined, sequential stages Parallelization, pattern
MetaPipe Coarse-grained pipeline Parallelization, pattern
Data Transfer Tile Load/Store Load/store from off-chip arrays Tile size, load rate