fpga smart dust - prime€¦ · fpga smart dust john mcallister institute of electronics,...

Post on 05-Oct-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

FPGA Smart DustJohn McAllister

Institute of Electronics, Communications and Information Technology (ECIT),

Queen’s University Belfast

jp.mcallister@qub.ac.uk

FPGA Then & Now

Then…Virtex-II

Multipliers Look-Up Tables

Block RAM

Then…..VHDL Verilog Constraints

/Directives

Synthesis (Synplify, XST)

Place and Route (ISE/Vivado)

Now…Virtex-Ultrascale

DSP Slices Look-Up Tables

Block RAM

Now…C/C++ SystemC Constraints

/Directives

High Level Synthesis Tool (Vivado)

VHDL Verilog

The HLS AdvantageVHDL/Verilog

C/C++/SystemC

Why HLS?

Design abstraction

Design productivity

Design time

Control of results

Performance or efficiency

Fewer things for the designer to manageReduced from 100s or 1000s to 10s

FPGA Compute

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

V585T V1500T V2000T VX330T VX415T VX485T VX550T VX690T VX980T VX1140TVH290T VH580T VH870T

LUT MACs

DSP48E1 MACs

An Alternative

Software

Constraints /Directives

Architectural Synthesis

Compilation

Processing Elements

FPGA Processors

Vector coprocessor

Conventional Soft Processors (e.g. Microblaze, NIOS, MIPS,

LEON)

Lean Processors (e.g. iDEA)

‘Smart Dust’ Processing Elements

90 LUTs

Scaling Up

Pile-em Up

Point-to-Point FIFO Connection

...

Interface Controller

...

PE PE PE

SPU...

PE PE PE

SPU

...

PE PE PE

SPU... PE PE PE

SPU...

PE PE PE

SPU

... PE PE PE

SPU

...

Tree Search

Preprocessing (QR Decomposition)

When It Worked

Tree Search

Preprocessing (QR Decomposition)

Tree Search

Preprocessing (QR Decomposition)

Tree Search

Preprocessing (QR Decomposition)

108

As Good As Custom RTL Circuits

Realisation Throughput (Mbps) DSP48e LUTs BRAM

FPE 502.5 144 16,601 0

Barbero & Thompson, ICC ‘08 600 160 13,197 49

Qi & Chakrabarti, SiPS ‘10 200 64 18,893 12

Wu & Masera, Euromicro DSD ‘10 27.7 0 6,587 0

When It Didn’t: Low Compute/Communication Ratio

Low Compute/Data Access Ratio Is A Problem

SIMD FFT MIMD FFT

The Issue

Streaming Processing Elements

Stream Processing

The Effect of StreamingSIMD FFT MIMD FFT

When It Didn’t: Large Data Objects

10242 Matrix-Matrix Multiplication

CIF Full Search Motion Estimation

Token Processing

Block Memory Access & Zero-Overhead Repeat

Dramatic Reductions in No. of Instructions

Class FPE sFPE δ(%)ALU 32768 32 -99.9COMM 2048 6 -99.7CTRL 559 4 -99.7NOP 0 4Total 32375 54 -99.8

Class FPE sFPE δ(%)ALU 268353 26 -99.9COMM 2467 14 -99.4CTRL 12582 12 -99.9NOP 1026 6 -99.6Total 284428 58 -99.9

10242 Matrix-Matrix Multiplication

CIF Full Search Motion Estimation

14284671

43900

6100

sFPE FPE VEGAS VENICE

10242 Matrix Multiplication2.8

2.1

1.4

0.6

sFPE FPE VEGAS VENICE

x106/s LUTs

32

64

132

20

sFPE FPE VEGAS VENICE

DSP48e

16

96

32

17

sFPE FPE VEGAS VENICE

BRAM

Full Search Motion Estimation106.9

56.4

4.810.9 15.8

sFPE FPE VIPERS VEGAS VENICE

1.9 4.79.4 8.4

66.1

sFPE FPE VIPERS VEGAS VENICE

1

22

54

20 20

sFPE FPE VIPERS VEGAS VENICE

32

44

10

64

17

sFPE FPE VIPERS VEGAS VENICE

Frames/s LUTs

DSP48e BRAM

FFTs

0.50 1.01

3.89 6.

03

0.84 2.68 3.16

7.08

2.24 3.51

10.11

21.13

64 128 256 512

sFPE SpiralXilinx

0.23 0.

60 0.99

2.10

0.21 0.48 0.75

2.11

0.50 0.65

1.23

4.22

64 128 256 512

12

32

64

160

8 16 24 2424

48

136

272

64 128 256 512

5

9 10

15

8

10

22

24

4

8

16

28

64 128 256 512

Frames/s

LUTs

DSP48e BRAM

SummaryGoal: productivity gain with performance/cost

benefit.

One instance: multicores, requiring a designer to handle

tens of components

HLS undermines a key reason for

using FPGA.

Domain-specific, configurable and programmable RTL

components?Are there others?

Good performance/cost, much greater productivity

top related