fpga architecture support for heterogeneous, relocatable...

Post on 13-Jul-2020

25 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

24th International Conferenceon Field Programmable Logic and Applications September 3rd, 2014

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 1

FPGA Architecture Support for Heterogeneous, Relocatable Partial

Bitstreams

Christophe HURIAUXv, Olivier SENTIEYSv★, Russell TESSIER✜

University of Rennes 1, France vInria, France ★

University of Massachusetts, USA ✜

2

Outline§ Introduction

§ Overview of the FlexTiles project§ Architecture Overview§ Advantages of 3-D Stacking

§ Principles§ Task Migration in an FPGA§ Task Migration in FlexTiles§ Heterogeneous case

§ Approach§ Coping with Heterogeneity§ Design Constraints

§ Results§ Implementation in VPR

§ Conclusion

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 2

3

FP7 FlexTiles Project

§ FlexTiles: Self adaptive heterogeneous manycore based on Flexible Tiles

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 3

§ Provide a heterogeneous many-core architecture offering § Large flexibility§ High-performance, energy efficiency§ Raised programming efficiency§ Self-adaptation through virtualization

4

Architecture Overview

§ 3D-Stacked Heterogeneous manycore§ General Purpose Processors (GPP)

§ for flexibility and programming homogeneity§ Network On Chip§ Dedicated hardware accelerators mapped at

run-time on a reconfigurable layer

§ Reconfigurable layer with seamless task migration capabilities

§ Virtualization layer to provide an abstraction of the manycore and self adaptive services

§ Tool-chain for parallelization and compilation

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 4

5

Architecture Overview

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 5- 5

3D interface to the NoC

DSP blocks

Memory blocks

6

Task migration

§ Classical problem in dynamic reconfiguration[1]§ Enhance resource usage

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 6

4x4?

[1] K. Compton, Z. Li, J. Cooley, S. Knol, and S. Hauck, “Configuration relocation and defragmentation for run-time reconfigurable computing,” IEEE Transactions on VLSI Systems, vol. 10, no. 3, pp. 209 –220, 2002.

7

3D Stacking

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 7- 7

Core Core CoreCore Core Core

Core Core Core

reconfigurable layer

multicore layer

§ 3D-Stacked Reconfigurable Accelerators§ Improved resource usage§ Improved bandwidth/latency§ Improved performance and energy efficiency

Core Core CoreCore Core Core

Core Core Core

8

Task Migration in an FPGA

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 8

§ Predefined reconfigurable regions

§ Bit-stream depends on task location

I/O I/O I/O I/O I/O I/O I/O

I/O I/O I/O I/O I/O I/O I/O

I/OI/O

I/OI/O

I/OI/O

I/OI/O

I/OI/O

I/OI/O

I/OI/O

I/OI/O

I/O

I/O

HW Accelerator #1

BS #1

HW Accelerator #1

BS #2

9

Task Migration in FlexTiles

§ A task is synthesized, placed & routed into a Virtual Bit-Stream (VBS)§ Independent from task physical location in the fabric§ No predefined configuration domains

§ Resource sharing/distribution easiness, simplified task migration

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 9

1 2 3 11 321 2

3 212

212

3

1 321

§ Reconfiguration controller generates final BS at run-time

10

Task Migration in FlexTiles

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 10

3D NI3D NI

3D NI3D NI

RAM DSP RAM DSP

RAM DSP RAM DSP

3D NI3D NI

3D NI

3D NI

3D NI

3D NI

3D NI

3D NI

3D NI

3D NI

3D NI

HW Accelerator #2

VBS #2

HW Accelerator #1

VBS #1

11

Heterogeneity

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 11

§ Homogeneous case§ No constraint on task placement§ Regular routing architecture

§ Cope with heterogeneity§ RAM, DSP, 3D I/Os§ Migration is limited

§ vertically to the same column§ to the next column containing same

complex blocks

TaskConfigured LELogic Element (LE)

12

Proposed architecture

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 12

§ Heterogeneous blocks routing is abstracted from logic routing§ Long lines allow a trade-off between placement

flexibility and routing complexity§ A two-level routing is performed at runtime:

§ Logic routing (as in the homogeneous case)§ Heterogeneous block routing through long lines

13

Design Constraints

§ I/Os are made through 3D Network Interfaces, spread over the reconfigurable fabric

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 13

Rec

onfig

urat

ion

RAM

Reconfiguration CTRL

MEM

DSP 3D NI

AI

3D NI

AI

DSPDSPDSPDSPDSPDSPDSPDSPDSPDSP

MEMMEMMEMMEMMEMMEMMEM

3D NI

AI3D NI

AI

3D NI

AI

3D NI

AI

3D NI

AI

3D NI

AI

DSPDSPDSPDSPDSP

MEMMEMMEM

3D NI

MEM

MEM

DSPDSPDSPDSPDSPDSPDSPDSPDSPDSPDSP

MEMMEMMEMMEMMEMMEMMEM

DSPDSPDSPDSPDSP

MEMMEMMEMMEM

AI

14

Implementation in VPR

§ Versatile Place and Route (VPR), open source CAD tool for placement and routing

§ Part of the Verilog To Routing (VTR) framework

§ Source code modified to implement ourtechniques and deal with our constraints§ Horizontal long-lines spread over partitions§ Separate homogeneous and heterogeneous routing

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 14

VPR and VTR: https://code.google.com/p/vtr-verilog-to-routing/

15

Implementation in VPR

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 15

X

X

Y X

X

Fc=0.5Fc=1

VPR Original Routing Model

§ Logic grid§ Block placement

§ X: simple block§ Y: 2 blocks tall

§ Mesh routing lines§ Switch boxes§ Interconnect

16

Implementation in VPR

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 16

YX

X

X

X

Enhanced Routing Model

§ Logic grid§ Block placement§ Block typing

§ X: homogeneous§ Y: heterogeneous

§ Mesh routing lines§ Long lines§ Switch boxes§ Interconnect

§ Homogeneous§ Heterogeneous

17

Results

§ Architecture based on a simplified Stratix IV with:§ Dual-port 144k memories§ Fracturable 36x36 multipliers

§ Evaluation on two criteria§ Delay of the critical path§ Minimum channel width

§ Number of tracks in the homogeneous routing channels

§ Minimum channel width determined by VPR§ Not directly related to silicon area

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 17

18

Results§ Benchmark set: VTR framework circuits [1]

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 18

[1] Rose, Jonathan, Luu, Jason, Yu, Chi Wai, et al. The VTR project: architecture and CAD for FPGAs from verilog to routing. In Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays. ACM, 2012. p. 77-86.

Circuit # Mem # Mult # LBbgm 0 11 2,174boundtop 1 0 2,977ch_intrinsics 1 0 272diffeq1 0 5 41diffeq2 0 5 43LU8PEEng 45 8 30mkDelayWorker32B 41 0 497mkPktMerge 15 0 17mkSMAdapter4B 5 0 181or1200 2 1 273raygentop 1 7 192stereovision1 0 38 990

19

Results: Delay

§ Estimation of the worst case delay§ Impossible to predict where connections to long lines

will be done§ Some channels crossing fixed-function blocks are

longer

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 19

20

Results: Delay

§ Only 2% delay increase (in average)

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 20

0

0,2

0,4

0,6

0,8

1

1,2

0,00

20,00

40,00

60,00

80,00

100,00

120,00

140,00

160,00proposed/classicns

Crit. Path (classic)

Crit. Path. (enhanced)

Crit. Path. (ratio)

21

Results: Min. Channel Width

§ 1.8X channel width increase on average§ Need for specific routing algorithms to deal with

the heterogeneous interconnection network

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 21

0

0,5

1

1,5

2

2,5

3

3,5

4

4,5

0,00

20,00

40,00

60,00

80,00

100,00

120,00

140,00

160,00proposed/classic# tracks

min W (classic)

min W (enhanced)

min W (ratio)

22

Conclusion

§ FPGA embedded in a 3D architecture§ More flexibility for task placement and/or

relocation§ Low impact on delay but cost on routing

resources§ Need to find a trade-off between flexibility and

area increase of additional connections

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 22

23

Thank you for your attention

More info on FlexTiles: http://www.flextiles.eu

C. Huriaux, O. Sentieys and R. Tessier September 3rd, 2014 - 23

24

Thank you for your attention

C. Huriaux, O. Sentieys and R. Tessier September 3rd, 2014 - 24

25

Virtual Bit-Stream: Example

§ Hiding routing details§ Full BS is 129 bits§ Could be reduced by

giving less details

Jan. 2014CAIRN project-team - 25

CLBIN[1]

CLBIN[2]

CLBIN[3] CLBOUT

CLBIN[0]

4567

12131415

0123

891011

16

17

18

1920

26

Virtual Bit-Stream: Example

§ Hiding routing details§ List of I/O and

connections§ 20 è 8 § 1 è 9 § 5 è 18

Jan. 2014CAIRN project-team - 26

4567

0123

89101116

17

18

1920

12131415

27

Results: BS Sizes on MCNC Benchmarks

0"

200"

400"

600"

800"

1000"

1200"

1400"

1600"

tseng" tseng" diffeq" diffeq" apex4" des" ex5p" misex3"

Kilo%bits)

Rou:ng"

Logic"

Jan. 2014CAIRN project-team - 27

28

Results: VBS Sizes on MCNC Benchmarks

44.4%$49.2%$ 47.2%$

55.2%$49.7%$

29.5%$ 27.4%$ 26.6%$

0.0%$

10.0%$

20.0%$

30.0%$

40.0%$

50.0%$

60.0%$

70.0%$

80.0%$

90.0%$

100.0%$

0$

200$

400$

600$

800$

1000$

1200$

1400$

1600$

tseng$ tseng$ diffeq$ diffeq$ apex4$ des$ ex5p$ misex3$

Kilo%bits)

BS$size$

VBS$size$

Compression$raBo$

Jan. 2014CAIRN project-team - 28

29

Introduction: Architecture Overview

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 29- 29

3D Access Pointto the NoC

30

Introduction: Architecture Overview

September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 30- 30

General Architecture Overview

top related