a flexible interconnection structure for reconfigurable fpga dataflow applications

Post on 25-Jan-2016

33 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

A Flexible Interconnection Structure for Reconfigurable FPGA Dataflow Applications. Gianluca Durelli , Alessandro A. Nacci , Riccardo Cattaneo, Christian Pilato, Donatella Sciuto and Marco Domenico Santambrogio Politecnico di Milano - PowerPoint PPT Presentation

TRANSCRIPT

A Flexible Interconnection Structure

for Reconfigurable FPGA Dataflow Applications

Gianluca Durelli, Alessandro A. Nacci, Riccardo Cattaneo, Christian Pilato, Donatella Sciuto and Marco Domenico Santambrogio

Politecnico di MilanoDipartimento di Elettronica, Informazione e Bioingegneria

Milano, IT

[durelli, nacci, rcattaneo, pilato, sciuto]@elet.polimi.itmarco.santambrogio@polimi.it

1

20th Reconfigurable Architectures Workshop May 20-21, 2013, Boston, USA

Rationale

• Strive for performance in computing intensive applications

• Reconfigurable HW well suited for certain classes of applications– Multimedia, computational biology, physical

simulation

• FPGA used in HPC systems• High maintenance costs

– need to share resources among users

• Need to dynamically share and reuse components on FPGA among different users

2

Outline

• Goals• State of Art• Proposed Solution• Design and Evaluation• Case Study• Conclusions and Future work

3

Goals

• Design an interconnection able to:– Create different pipelines reusing

available components on the FPGA– Share the resources between different

applications– Not insert any stall in the pipeline

• Target FPGA for HPC scenario

4

State of Art

• BUS interconnection– Congestion problem– Does not scale

• Network on Chip– Possible congestion problem– Good scalability

5

• Introduce unexpected delays in computation– Can’t assure performance when sharing

the device between different users

Proposed Solution

• Switch based interconnection– Cores inputs connected to interconnection

outputs– Cores outputs connected to interconnection

inputs– Fully pipelined point-to-point communication

• Data read/write only when all the inputs are available

• Can be configured by setting for each input and output channels:– Switching configuration:

• Multiplexer configuration to route information

– From which clock cycle the channel is active– How much data have to be read/write through that

channel6

Proposed Solution

• Suited for Dataflow/Pipelined applications• Parameters can be extracted from an high

level description of the application and pipeline structure:– Possibility to automate the parameter

extraction and interconnection design

7

3

5

2

4

Implementation

8

• Solution Implemented with HLS:– HLS well suited for dataflow/stencil loop synthesis– Simplify HW development– Generation of compatible interfaces

• Maxeler Technologies:– HPC Dataflow computing exploiting FPGA– Proprietary HLS starting from Java-like description:

• Proposed interconnection solution easily described in Java

• MaxWorkstation 3A:– Intel i7 quad-core– Xilinx Virtex6 XC6VSX547T– PCIe communication:

• Maximum 8 channels/streams

Evaluation: Area Occupation

9

• Area increment (10-30%) due to increase in switching logic

• The interconnection consumes up to 6% of the FPGA:– Lot of space remains for user cores

Evaluation: Frequency

10

• Tested with pass-through cores to evaluate maximum working frequency of the interconnection (300MHz)

• In case of real life applications (Brain network with cores working at 200MHz) the interconnection does not affect the critical path

Case Study• Application:

– Image processing pipeline (up to 4 stages):• Gray scale (GS), Gaussian blur (GB), Edge detection (ED) filters• Their combinations

• Tested architectures:

• Experiments:– Single execution of a N stages pipeline– Batch execution of a workload of 100 random applications

11

(A) (B) (C) (D)

Case Study: Single execution

12

(A) (B) (C) (D)

Case Study: Single execution

13

(A) (B) (C) (D)

Case Study: Batch execution

14

• Proposed solution (D) does not introduce overhead in the overall execution time w.r.t. the other two architectures

• Low system load:– Up to 30% reduction in the overall workload execution time

Case Study: Batch execution

15

• Low system load (1-2 applications):– Proposed solution (D) does not introduce delays in the

execution of a single application of the workload

• Higher system loads (more than 2 applications):– 10%-30% reduction in single application execution time

Conclusions and Future work

• Conclusion:– Design of a interconnection to support HW

resource sharing in multi-application scenario

– Solution suited for dataflow/pipelined systems

– Possibility to realize different pipeline configurations at run-time

• Future works:– Design of a mapping/reconfiguration strategy

to allocate user cores and configure new core instances at run-time

16

17

top related