fast design space exploration using vivado hls: non-binary

1
Fast Design Space Exploration using Vivado HLS: Non-Binary LDPC Decoders Joao Andrade * , Nithin George , Kimon Karras , David Novo , Vitor Silva * , Paolo Ienne , Gabriel Falcao * * Instituto de Telecomunica¸ c˜oes, Dept. Electrical and Computer Engineering, Univ. of Coimbra, Portugal ´ Ecole Polytechnique F´ ed´ erale de Lausanne (EPFL), School of Comp. and Comm. Sciences, Switzerland Xilinx Research Labs, Dublin, Ireland Introduction: Non-binary LDPC Decoder on FPGAs I We explore a complex error-correction signal processing algorithm: . non-binary LDPC decoding (FFT-SPA) α α α α 2 1 α 2 α 1 1 α 2 1 1 αc 1 c 6 c 5 c 4 c 3 c 2 c 1 α 2 c 1 c 6 c 6 α 2 c 5 c 4 c 5 αc 4 αc 2 c 3 α 2 c 3 αc 2 F F F F F F F F F F F F m v (x) m vc (x) m cv (x) m cv (z ) m vc (z ) permute depermute CN 1 CN 2 CN 3 VN 1 VN 3 VN 4 VN 5 VN 6 Walsh-Hadamard Transform m * v (x) VN 2 Fig. 1 Non-binary LDPC factor graph example and message-passing algorithm. I We utilize a high-level synthesis tool to design an LDPC decoder FPGA accelerator I Vivado HLS allows: . fast design space exploration via directive optimizations . C/C++ code as input for generating an FPGA accelerator Proposed LDPC Decoder Accelerator loop loop loop //compute loop loop //compute loop //compute loop loop //fetch data loop loop //compute loop //store data vn_proc BRAMs DRAM DRAM fwht permute loop loop loop //compute cn_proc loop loop //fetch data loop loop //compute loop //store data fwht loop loop //compute loop //compute depermute Fig. 2 Non-binary LDPC decoder base solution block diagram. I LDPC decoder characteristics . 3-dimensions of computation: I N×d/M×d c probability mass functions (pmfs) I 2 m probabilities per pmf I d v /d c pmf per node I 2 m is the Galois field dimension . each dimension is defined over a computation loop I Applied LDPC computation: . Fast Walsh-Hadamard transform (fwht) . Hadamard products (vn/cn proc) . Cyclic permutations ((de)permute) I Under the hood transformations: . 3 different nested loop structures: I cn proc/vn proc: 3 loops triple-nested I depermute/permute: 2 loops double-nested I fwht: 5 loops triple-nested . no computation performed directly on DRAM data high bandwidth available but high latency of access . data is moved to BRAM memory for computation at prologue and to DRAM memory at epilogue High-Level Architecture Board DRAM 0 DRAM 1 Memory Interface ... AXI4 Interconnect AXI4 Interconnect BRAMs K BRAMs 1 HLS IP Core 1 HLS IP Core K FPGA core 0 core 1 core 2 AXI4 I. Mem. Int. BRAMs 2 HLS IP Core 2 Fig. 3 High-level architecture and die shot with 3 decoders P&R’d. I Vivado HLS exports an accelerator design as an IP-XACT without external I/O, clock interface or AXI4 data movers . 1 DRAM and AXI-M controllers per SODIMM (2) . 1 port on AXI-M controllers per accelerator instantiated (K) Proposed Accelerator Optimizations Table 2 Optimizations carried out for each solution. Solutions Optimizations I II III IV V VI Unrolling X X X X Pipelining X X Array partitioning X X X I We combined the following optimizations to the 6 tested solutions: . loop unrolling (II, V) . loop pipelining (III, VI) . array partitioning (IV, V, VI) I Opt. directives are not applied until code refactoring in some cases I Every dimension where parallelism is exploited must be defined in its particular loop, otherwise unrolling or pipelining becomes unbearable to manage . in fact, some optimization configurations do not complete the C-synthesis I pipeline is targeted at II=1 I unrolling is complete Experimental Results: Latency vs. LUTs utilization I II III IV V VI 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Latency [cycles] Optimizations 160 180 200 220 240 260 Frequency [MHz] vn_proc cn_proc permute depermute fwht I II III IV V VI 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Latency [cycles] Optimizations 160 180 200 220 240 260 Frequency [MHz] I II III IV V VI 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Latency [cycles] Optimizations 160 180 200 220 240 260 Frequency [MHz] Fig. 4 Latency and clock frequency of operation of each LDPC accelerator solution for GF({2 2 , 2 3 , 2 4 }). I Applying the different optimizations we obtain a set of pareto points with tradeoffs in frequency and LUTs utilization: . providing more memory ports (higher bandwidth) is useful only if ALUs consume data . clock frequencies across the solutions can vary widely (160260) MHz . pipelining has diminishing returns in latency reduction (depermute/permute) for increasing Galois Field dimensions Comparison with RTL-based Decoders Table 1 Dec. throughput, FPGA util. and freq. of operation. Decoder m K LUT FF BRAM DSP Thr. Clk [%] [%] [%] [%] [Mbit/s] [MHz] This work 2 1 14 7 0.5 0.5 1.17 250 14 80 35 6 6 14.54 219 3 1 21 9 0.9 0.9 0.95 250 6 81 34 5 5 4.81 210 4 1 30 13 2 2 0.66 216 3 73 32 5 5 1.85 201 Emden @ ISTCIIP’10 2 33.16 100 4 N/A 13.22 8 1.56 Zhang @ TCS–I’11 4 1 48 (Slices) 41 N/A 9.3 N/A Boutillon, @ TCS–I’13 6 19 6 1 N/A 2.95 61 Scheiber @ ICECS’13 1 14 (Slices) 21 N/A 13.4 122 Andrade @ ICASSP’14 8 85 (LEs) 62 7 1.1 163 LUTs [%] 0 5 10 15 20 Latency [us] 10 1 10 2 10 3 10 4 10 5 GF(4) GF(8) GF(16) Pareto optimal points larger circuit size same latency same circuit size lower latency Fig. 5 Pareto and non-Pareto optimization points measured in latency (μs) vs. LUTs utilization (%). I LUT utilization grows with the Galois Field dimension . Pareto points observed clearly illustrate the diminishing returns in the latency for LUTs tradeoff I We can settle for the optimized solution VI and increase the number K of instantiated LDPC decoder accelerators on the high-level architecture I RTL-based circuits still achieve higher performances but we reach quite close even though HLS is being used . approx. 50% dec. throughput . but only for several K instantiated decoders Conclusions I We show that combining the correct optimizations we are able to reach within 50% of RTL-based LDPC decoders I Programming language is the same but programming model is different . Code refactoring is still required . Exploited parallelism dimensions are exposed in proper loop structures I By instantiating the accelerators in a suitable high-level architecture we are able to fit multiple accelerators further elevating the parallelism level 23rd IEEE FCCM May 3-5, Vancouver, BC, Canada This work supported by the Portuguese Funda¸ c~ao para a Ci^encia e Tecnologia (FCT) under grants UID/EEA/50008/2013 and SFRH/BD/78238/2011.

Upload: vuthuan

Post on 10-Feb-2017

232 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fast Design Space Exploration using Vivado HLS: Non-Binary

Fast Design Space Exploration using Vivado HLS:Non-Binary LDPC Decoders

Joao Andrade∗, Nithin George†, Kimon Karras‡, David Novo†, Vitor Silva∗, Paolo Ienne†, Gabriel Falcao∗

∗ Instituto de Telecomunicacoes, Dept. Electrical and Computer Engineering, Univ. of Coimbra, Portugal† Ecole Polytechnique Federale de Lausanne (EPFL), School of Comp. and Comm. Sciences, Switzerland

‡ Xilinx Research Labs, Dublin, Ireland

Introduction: Non-binary LDPC Decoder on FPGAs

I We explore a complex error-correction signal processing algorithm:. non-binary LDPC decoding (FFT-SPA)

α α αα2 1 α2 α 1 1 α2 1 1

αc1

c6c5c4c3c2c1

α2c1 c6c6α2c5c4 c5αc4αc2 c3 α2c3αc2

F F F F F F F F F F F F

mv(x)

mvc(x) mcv(x)

mcv(z)mvc(z)

perm

ute

deperm

ute

CN1 CN2 CN3

VN1 VN3 VN4 VN5 VN6

Walsh-Hadamard

Transform

m∗v(x)

VN2

Fig. 1 Non-binary LDPC factor graph example and message-passing algorithm.

I We utilize a high-level synthesis tool to design an LDPC decoderFPGA accelerator

I Vivado HLS allows:. fast design space exploration via directive optimizations. C/C++ code as input for generating an FPGA accelerator

Proposed LDPC Decoder Accelerator

loop

loop

loop

//compute

loop

loop

//compute loop

//compute

loop

loop

//fetch data loop

loop

//compute loop

//store data

vn_proc

BRAMs

DR

AM

DR

AM

fwht

permute

loop

loop

loop

//compute

cn_proc

loop

loop

//fetch data loop

loop

//compute loop

//store data

fwht

loop

loop

//compute loop

//compute

depermute

Fig. 2 Non-binary LDPC decoder base

solution block diagram.

I LDPC decoder characteristics. 3-dimensions of computation:

I N×d/M×dc probability massfunctions (pmfs)

I 2m probabilities per pmfI dv/dc pmf per nodeI 2m is the Galois field dimension

. each dimension is defined over acomputation loop

I Applied LDPC computation:. Fast Walsh-Hadamard transform

(fwht). Hadamard products

(vn/cn proc). Cyclic permutations

((de)permute)

I Under the hood transformations:. 3 different nested loop structures:

I cn proc/vn proc: 3 loops triple-nestedI depermute/permute: 2 loops double-nestedI fwht: 5 loops triple-nested

. no computation performed directly on DRAM data→ high bandwidth available but high latency of access

. data is moved to BRAM memory for computation at prologueand to DRAM memory at epilogue

High-Level Architecture

BoardDRAM 0 DRAM 1

Memory Interface

...

AXI4

Interconnect

AXI4

Interconnect

BRAMs KBRAMs 1

HLS IP

Core 1

HLS IP

Core K

FPGA

core 0

core 1

core 2

AXI4 I.

Mem. Int.

BRAMs 2

HLS IP

Core 2

Fig. 3 High-level architecture and die shot with 3 decoders P&R’d.

I Vivado HLS exports an accelerator design as an IP-XACT withoutexternal I/O, clock interface or AXI4 data movers. 1 DRAM and AXI-M controllers per SODIMM (2). 1 port on AXI-M controllers per accelerator instantiated (K)

Proposed Accelerator Optimizations

Table 2 Optimizations carried out for each solution.

SolutionsOptimizations I II III IV V VI

Unrolling X X X XPipelining X X

Array partitioning X X X

I We combined the followingoptimizations to the 6 testedsolutions:. loop unrolling (II, V). loop pipelining (III, VI). array partitioning (IV, V, VI)

I Opt. directives are not applied until code refactoring in some casesI Every dimension where parallelism is exploited must be defined in its

particular loop, otherwise unrolling or pipelining becomes unbearable tomanage. in fact, some optimization configurations do not complete the C-synthesis

I pipeline is targeted at II=1I unrolling is complete

Experimental Results: Latency vs. LUTs utilization

I II III IV V VI10

0

101

102

103

104

105

106

Late

ncy [

cycle

s]

Optimizations

160

180

200

220

240

260

Fre

qu

en

cy [

MH

z]

vn_proc

cn_proc

permute

depermute

fwht

I II III IV V VI10

0

101

102

103

104

105

106

Late

ncy [

cycle

s]

Optimizations

160

180

200

220

240

260

Fre

qu

en

cy [

MH

z]

I II III IV V VI10

0

101

102

103

104

105

106

Late

ncy [

cycle

s]

Optimizations

160

180

200

220

240

260

Fre

qu

en

cy [

MH

z]

Fig. 4 Latency and clock frequency of operation of each LDPC accelerator solution for GF({22, 23, 24}).

I Applying the different optimizations we obtain a set of pareto points withtradeoffs in frequency and LUTs utilization:. providing more memory ports (higher bandwidth) is useful only if ALUs

consume data. clock frequencies across the solutions can vary widely (160∼260) MHz. pipelining has diminishing returns in latency reduction

(depermute/permute) for increasing Galois Field dimensions

Comparison with RTL-based Decoders

Table 1 Dec. throughput, FPGA util. and

freq. of operation.

Decoder m KLUT FF BRAM DSP Thr. Clk

[%] [%] [%] [%] [Mbit/s] [MHz]

This work

21 14 7 0.5 0.5 1.17 250

14 80 35 6 6 14.54 219

31 21 9 0.9 0.9 0.95 250

6 81 34 5 5 4.81 210

41 30 13 2 2 0.66 216

3 73 32 5 5 1.85 201

Emden @ ISTCIIP’10

2 33.16

1004 N/A 13.22

8 1.56

Zhang @ TCS–I’11 4

1

48 (Slices) 41 N/A 9.3 N/A

Boutillon, @ TCS–I’13 6 19 6 1 N/A 2.95 61

Scheiber @ ICECS’13 1 14 (Slices) 21 N/A 13.4 122

Andrade @ ICASSP’14 8 85 (LEs) 62 7 1.1 163

LUTs [%]0 5 10 15 20

Late

ncy [

us]

10 1

10 2

10 3

10 4

10 5

GF(4) GF(8) GF(16)

Pareto optimal points

larger circuit sizesame latency

same circuit sizelower latency

Fig. 5 Pareto and non-Pareto optimization points measured in

latency (µs) vs. LUTs utilization (%).

I LUT utilization grows with the Galois Field dimension

. Pareto points observed clearly illustrate the diminishing returns in thelatency for LUTs tradeoff

I We can settle for the optimized solution VI and increase the number K ofinstantiated LDPC decoder accelerators on the high-level architecture

I RTL-based circuits still achieve higher performances but we reach quite closeeven though HLS is being used

. approx. 50% dec. throughput

. but only for several K instantiated decoders

Conclusions

I We show that combining the correct optimizations we are able to reachwithin 50% of RTL-based LDPC decoders

I Programming language is the same but programming model is different. Code refactoring is still required. Exploited parallelism dimensions are exposed in proper loop structures

I By instantiating the accelerators in a suitable high-level architecture we areable to fit multiple accelerators further elevating the parallelism level

Created with LATEXbeamerposter http://www.it.pt/site_group_detail_p.asp?id=2323rd IEEE FCCM May 3-5, Vancouver, BC, Canada

CORES DIRECTASPANTONE

CORES CMYK

This work supported by the Portuguese Fundac~ao para a Ciencia e Tecnologia (FCT) under grantsUID/EEA/50008/2013 and SFRH/BD/78238/2011.