fast design space exploration using vivado hls: non-binary
TRANSCRIPT
Fast Design Space Exploration using Vivado HLS:Non-Binary LDPC Decoders
Joao Andrade∗, Nithin George†, Kimon Karras‡, David Novo†, Vitor Silva∗, Paolo Ienne†, Gabriel Falcao∗
∗ Instituto de Telecomunicacoes, Dept. Electrical and Computer Engineering, Univ. of Coimbra, Portugal† Ecole Polytechnique Federale de Lausanne (EPFL), School of Comp. and Comm. Sciences, Switzerland
‡ Xilinx Research Labs, Dublin, Ireland
Introduction: Non-binary LDPC Decoder on FPGAs
I We explore a complex error-correction signal processing algorithm:. non-binary LDPC decoding (FFT-SPA)
α α αα2 1 α2 α 1 1 α2 1 1
αc1
c6c5c4c3c2c1
α2c1 c6c6α2c5c4 c5αc4αc2 c3 α2c3αc2
F F F F F F F F F F F F
mv(x)
mvc(x) mcv(x)
mcv(z)mvc(z)
perm
ute
deperm
ute
CN1 CN2 CN3
VN1 VN3 VN4 VN5 VN6
Walsh-Hadamard
Transform
m∗v(x)
VN2
Fig. 1 Non-binary LDPC factor graph example and message-passing algorithm.
I We utilize a high-level synthesis tool to design an LDPC decoderFPGA accelerator
I Vivado HLS allows:. fast design space exploration via directive optimizations. C/C++ code as input for generating an FPGA accelerator
Proposed LDPC Decoder Accelerator
loop
loop
loop
//compute
loop
loop
//compute loop
//compute
loop
loop
//fetch data loop
loop
//compute loop
//store data
vn_proc
BRAMs
DR
AM
DR
AM
fwht
permute
loop
loop
loop
//compute
cn_proc
loop
loop
//fetch data loop
loop
//compute loop
//store data
fwht
loop
loop
//compute loop
//compute
depermute
Fig. 2 Non-binary LDPC decoder base
solution block diagram.
I LDPC decoder characteristics. 3-dimensions of computation:
I N×d/M×dc probability massfunctions (pmfs)
I 2m probabilities per pmfI dv/dc pmf per nodeI 2m is the Galois field dimension
. each dimension is defined over acomputation loop
I Applied LDPC computation:. Fast Walsh-Hadamard transform
(fwht). Hadamard products
(vn/cn proc). Cyclic permutations
((de)permute)
I Under the hood transformations:. 3 different nested loop structures:
I cn proc/vn proc: 3 loops triple-nestedI depermute/permute: 2 loops double-nestedI fwht: 5 loops triple-nested
. no computation performed directly on DRAM data→ high bandwidth available but high latency of access
. data is moved to BRAM memory for computation at prologueand to DRAM memory at epilogue
High-Level Architecture
BoardDRAM 0 DRAM 1
Memory Interface
...
AXI4
Interconnect
AXI4
Interconnect
BRAMs KBRAMs 1
HLS IP
Core 1
HLS IP
Core K
FPGA
core 0
core 1
core 2
AXI4 I.
Mem. Int.
BRAMs 2
HLS IP
Core 2
Fig. 3 High-level architecture and die shot with 3 decoders P&R’d.
I Vivado HLS exports an accelerator design as an IP-XACT withoutexternal I/O, clock interface or AXI4 data movers. 1 DRAM and AXI-M controllers per SODIMM (2). 1 port on AXI-M controllers per accelerator instantiated (K)
Proposed Accelerator Optimizations
Table 2 Optimizations carried out for each solution.
SolutionsOptimizations I II III IV V VI
Unrolling X X X XPipelining X X
Array partitioning X X X
I We combined the followingoptimizations to the 6 testedsolutions:. loop unrolling (II, V). loop pipelining (III, VI). array partitioning (IV, V, VI)
I Opt. directives are not applied until code refactoring in some casesI Every dimension where parallelism is exploited must be defined in its
particular loop, otherwise unrolling or pipelining becomes unbearable tomanage. in fact, some optimization configurations do not complete the C-synthesis
I pipeline is targeted at II=1I unrolling is complete
Experimental Results: Latency vs. LUTs utilization
I II III IV V VI10
0
101
102
103
104
105
106
Late
ncy [
cycle
s]
Optimizations
160
180
200
220
240
260
Fre
qu
en
cy [
MH
z]
vn_proc
cn_proc
permute
depermute
fwht
I II III IV V VI10
0
101
102
103
104
105
106
Late
ncy [
cycle
s]
Optimizations
160
180
200
220
240
260
Fre
qu
en
cy [
MH
z]
I II III IV V VI10
0
101
102
103
104
105
106
Late
ncy [
cycle
s]
Optimizations
160
180
200
220
240
260
Fre
qu
en
cy [
MH
z]
Fig. 4 Latency and clock frequency of operation of each LDPC accelerator solution for GF({22, 23, 24}).
I Applying the different optimizations we obtain a set of pareto points withtradeoffs in frequency and LUTs utilization:. providing more memory ports (higher bandwidth) is useful only if ALUs
consume data. clock frequencies across the solutions can vary widely (160∼260) MHz. pipelining has diminishing returns in latency reduction
(depermute/permute) for increasing Galois Field dimensions
Comparison with RTL-based Decoders
Table 1 Dec. throughput, FPGA util. and
freq. of operation.
Decoder m KLUT FF BRAM DSP Thr. Clk
[%] [%] [%] [%] [Mbit/s] [MHz]
This work
21 14 7 0.5 0.5 1.17 250
14 80 35 6 6 14.54 219
31 21 9 0.9 0.9 0.95 250
6 81 34 5 5 4.81 210
41 30 13 2 2 0.66 216
3 73 32 5 5 1.85 201
Emden @ ISTCIIP’10
2 33.16
1004 N/A 13.22
8 1.56
Zhang @ TCS–I’11 4
1
48 (Slices) 41 N/A 9.3 N/A
Boutillon, @ TCS–I’13 6 19 6 1 N/A 2.95 61
Scheiber @ ICECS’13 1 14 (Slices) 21 N/A 13.4 122
Andrade @ ICASSP’14 8 85 (LEs) 62 7 1.1 163
LUTs [%]0 5 10 15 20
Late
ncy [
us]
10 1
10 2
10 3
10 4
10 5
GF(4) GF(8) GF(16)
Pareto optimal points
larger circuit sizesame latency
same circuit sizelower latency
Fig. 5 Pareto and non-Pareto optimization points measured in
latency (µs) vs. LUTs utilization (%).
I LUT utilization grows with the Galois Field dimension
. Pareto points observed clearly illustrate the diminishing returns in thelatency for LUTs tradeoff
I We can settle for the optimized solution VI and increase the number K ofinstantiated LDPC decoder accelerators on the high-level architecture
I RTL-based circuits still achieve higher performances but we reach quite closeeven though HLS is being used
. approx. 50% dec. throughput
. but only for several K instantiated decoders
Conclusions
I We show that combining the correct optimizations we are able to reachwithin 50% of RTL-based LDPC decoders
I Programming language is the same but programming model is different. Code refactoring is still required. Exploited parallelism dimensions are exposed in proper loop structures
I By instantiating the accelerators in a suitable high-level architecture we areable to fit multiple accelerators further elevating the parallelism level
Created with LATEXbeamerposter http://www.it.pt/site_group_detail_p.asp?id=2323rd IEEE FCCM May 3-5, Vancouver, BC, Canada
CORES DIRECTASPANTONE
CORES CMYK
This work supported by the Portuguese Fundac~ao para a Ciencia e Tecnologia (FCT) under grantsUID/EEA/50008/2013 and SFRH/BD/78238/2011.