fpl'2014 - flextiles workshop - 6 - flextiles embedded fpga accelerators

www.flextiles.eu

FlexTiles

Runtime Mapping of Hardware

Accelerators on the Embedded

FPGA Layer

FPL’14, FlexTiles Workshop

September 1st 2014

Olivier SENTIEYS★, Christophe HURIAUX, Antoine COURTAY University of Rennes 1

★ Inria

2 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

The Multicore Era is Hitting the Utilization Wall

Multicore era is true since 2005-2008,

but what’s next?

Energy efficiency is not scaling along

with integration capacity

Transistor and power budgets

no longer balanced

Classical scaling Device count S2

Device frequency S

Device power (cap) 1/S

Device power (Vdd) 1/S2

Utilization 1

Leakage limited scaling Device count S2

Device frequency S

Device power (cap) 1/S

Device power (Vdd) ~1

Utilization 1/S2

Pi=ai fi Ci Vddi2

Corei

[Venkatesh et al., ASPLOS’10]

3 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


The Utilization Wall

With each successive process generation, the

percentage of a chip that can switch at full frequency

drops exponentially due to power constraints

8nm in 2018 best-case average

3.7x speedup 14% per year

(highly parallel codes

and optimal per-

benchmark)

[Esmaeilzadeh et al., ISCA’11]

4 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


0

5

10

15

20

45nm 32nm 22nm 16nm 11nm 8nm

Spee

du

p

Historical Scaling

ITRS Scaling

Realistic Scaling

18x

7.9x

3.7x

Multicore and Dark Silicon

[Doug Burger, HiPEAC’13]

Dar

k Si

lico

n

47% 36% 71% 51% 62% 40%

17%

1%

2014 >2016 >2018

5 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


The Efficiency of Specialization

* Source: Ning Zhang and Bob Brodersen, ISSCC data

100-1000X Gap in Efficiency … but Specialization

comes with Penalties in Programmability

ASICs

FPGAs

6 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Heterogeneous Multicores

Different cores on a single chip

GPPs, HW accelerators, memory, network-on-chip

Reconfigurable HW accelerators keep flexibility while increasing

area and energy efficiency

Self-adapting devices

Dynamically adapt the hardware to the application and to changing

environments

Core Core

Core Core

Core

Core

Core Core Core

Proc. Reconf.

HW Mem.

HW

Acc.

7 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Can 3D Stacking Help?

3D-Stacked Reconfigurable Accelerators

Improved bandwidth/latency between cores and accelerators

Improved resource usage

Improved performance and energy efficiency

Core Core Core

Core Core Core

Core Core Core

Core Core Core

Core Core Core

Core Core Core

reconfigurable layer

multicore layer

8 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Outline

eFPGA Reconfigurable Fabric

General architecture overview

Expected features

Task migration in FPGA vs. task migration in eFPGA

Virtual Bit-Stream

Coping with Heterogeneous Blocks

Development Flow

Achievements & Conclusion

9 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


FlexTiles Architecture Overview

- 9

3D interface to the NoC

DSP blocks

Memory blocks

10 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Expected Features of the Reconfigurable Layer

Main expected features

Low reconfiguration time (and power) overhead

Double-context configuration memory

Low complexity reconfiguration control

Resource sharing/distribution easiness, simplified task

migration

No predefined configuration domains

Bit-stream independent from task location

Smaller bit-stream size in configuration memory

Virtual Bit-Stream (VBS)

11 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Task Allocation & Migration in an FPGA

Predefined

reconfigurable

regions

Bit-stream depends

on task location

I/O I/O I/O I/O I/O I/O I/O

I/O I/O I/O I/O I/O I/O I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

HW Accelerator #1

BS #1

HW Accelerator #1

BS #2

12 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Task Migration in eFPGA

3D NI 3D NI

3D NI 3D NI

RAM RAM RAM RAM

RAM RAM RAM RAM

3D NI 3D NI

3D NI

3D NI

3D NI

3D NI

3D NI

3D NI

3D NI

3D NI

3D NI

HW Accelerator #2

BS #2

HW Accelerator #1

BS #1

13 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Outline


Virtual Bit-Stream

Concept

Abstraction of routing details

Results

Coping with Heterogeneous Fabric

Development Flow


14 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Concept of Virtual Bit-Stream

A task is synthesized and

placed&routed into a Virtual

Bit-Stream (VBS)

Hide some routing details which are architecture dependent

Remove details coming from task physical location in the fabric

No predefined configuration domains

Final Bits-Stream is

generated at run time

Resource sharing/distribution becomes easier, task migration is simplified

Quartus II

15 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Interconnection Architecture

Hiding routing details

Full BS is 129 bits

Could be reduced by giving

less details

CLBIN[1]

CLBIN[2]

CLBIN[3] CLBOUT

CLBIN[0]

4 5 6 7

12 13 14 15

0 1

2 3

8 9

10

11

16

17

18

19 20

16 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Virtual Bit Stream

Hiding routing details

List of I/O and connections

20 8

1 9

5 18

4 5 6 7

12 13 14 15

0 1

2 3

8 9

10

11

1

6

17

18

19 20

17 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Results

VBS is independent of task location with a

smaller size than BS

44.4%49.2%

47.2%

55.2%

49.7%

29.5%27.4% 26.6%

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

0

200

400

600

800

1000

1200

1400

1600

tseng tseng diffeq diffeq apex4 des ex5p misex3

Kilo

-bits

BSsize

VBSsize

Compressionra o

3-4 time smaller for

large bit-streams

18 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


eFPGA Architecture using VBS

Reconfiguration controller

Upon GPP requirements: can place, duplicate and migrate tasks

Finalizes VBS

Reconfiguration

controller

External

memory

VBS

1

VBS

2

VBS

3

VBS

N

Buffer

memory

data

control

1

2

19 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Outline


Virtual Bit-Stream


Heterogeneous Blocks

Task placement in a Homogeneous context

Task placement in a Heterogeneous context

Development Flow


20 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0



Logic Elements

Cluster of four 6-input LUTs

3309 mm2

Arithmetic Elements

18x18 multiplier, 48-bit adder/subtractor

4351 mm2

…

…

…

…

…

CLBIN

CLBOUT

LUT

LUT

LUT

LUT

+ -

A

B

18

18

36

48

21 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0



Memories

1024 x 16-bit word SRAM

6570 mm2

3D TSV and Accelerator Interface

Reconfiguration

Controller

3D

3D 3D

3D

3D

3D

3D

3D

3D

Reconfiguration

RAM

3DNI 3DNI3DNI

3DNI 3DNI

3DNI 3DNI3DNI

NoC Link (400 I/O) Pitch X Y size X size Y Area mm²

40 20 20 800 800 0,64

26.95mm²

Work In Progress

22 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


eFPGA Floorplan (heterogeneous)

Logic Block

Arithmetic Accelerator

Memories

Accelerator Interface

23 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Task Placement & Migration

Homogeneous case

No constraint on task placement

Regular routing architecture

Easy! (thanks to the Virtual Bit-Stream)

Cope with heterogeneity

RAM, DSP, 3D I/Os

Migration is limited

vertically to the same column

to the next column containing same complex

blocks

Task

Configured LE

Logic Element (LE)

24 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


eFPGA: Handling of Complex Blocks

Heterogeneous blocks routing is abstracted from

logic routing

Long lines allow a trade-off between placement flexibility and routing complexity

A two-level routing is performed at runtime:

Logic routing (as in the homogeneous case)

Heterogeneous block routing through long lines

25 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


eFPGA: Handling of Complex Blocks

Delay depends on final placement

Only worst-case delay can be estimated offline

Flexibility is still limited in the vertical axis

Multiple of block height

Length of long lines and connections long-lines –

routing-resources should be limited

Area overhead, but slight delay penalty

(see our paper at FPL’14 on Wednesday)

26 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Outline


Virtual Bit-Stream


Development Flow


27 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Development Flow

Custom development flow from C to Virtual Bit-Stream

High-level Synthesis

High-level task description

RTL task description

HDL Synthesis

HDL task description

Flat logic netlist

Technology mapping

Mapped logic netlist

Placer Router

Placement data

Routing data

Arch. netlist

Bitstream generation

Virtual bit-stream Arch.

description

Integrated within the

FlexTiles

development flow

Generates VBS from

a C description or a

HDL description

28 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Development Flow


Relies on Catapult C

from Calypto Design

Systems

High-level synthesis

from C to VHDL

29 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Development Flow


Use the Verilog To

Routing (VTR)

academic tool flow

to generate netlist

and routing data

from Verilog

RTL task description

HDL Synthesis

HDL task description

Flat logic netlist

Technology mapping

Mapped logic netlist

Placer Router

Placement data

Routing data

Arch. netlist

Arch. description

30 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Development Flow


A custom back-end

generate the VBS

from the data

generated by VTR

The VBS can be

loaded on the

FlexTiles platform

31 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Conclusions

Overall results and achievements

3-D stacked embedded FPGA coupled to a processor layer

Flexible resource allocation/sharing

Seamless task migration

Virtual Bit-Stream

VBS also reduces bitstream size

eFPGA Chip “Proof of Concept”

65nm CMOS

Homogenous Fabric of LBs

I/O Ring (not 3D…)

External Reconfiguration Controller

32 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Results

Thank you for your

attention

33 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


D-cache 6%

Datapath 3%

Energy Saved 91%

D-cache 6%

Datapath 38%

Reg. File 14%

Fetch/ Decode

19%

I-cache 23%

Where do the energy savings come from?

MIPS baseline

91 pJ/instr.

Specialized core

8 pJ/instr.

[Goulding et al., Hot Chips’10]

34 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Energy per operation: 45nm CMOS, 40nm V6 FPGA

HW operators (45nm)

32-bit addition: 0.5pJ

16-bit multiply: 2.2pJ

64-bit FPU: 50pJ/op

40nm V6 FPGA

16/32-bit multiply and add: 114pJ (DSP blocks), 170pJ (LUT)

32-bit I/O access: 1.47nJ

32-bit memory read: 660 pJ

32-bit register R/W: 1.12 pJ

Embedded RISC Processor (45nm)

32-bit register R/W: 0.33pJ

32-bit cache R/W: 3.5pJ

add instruction⋆⋆: 5.32 pJ

⋆⋆add instruction (best case) = fetch, decode, read 2 operands from RF, execute, write back (into local reg. first, then copy into RF)

[Dally et al., Computer, 2010] [Bonamy et al., 2013]

35 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


The Energy Cost of Data Movement

Fetching operands costs more than computing

Energy cost of cache coherence is huge!

28nm

CMOS

500 pJ Efficient

off-chip

link

16 nJ

DRAM

Rd/Wr

64-bit DP

20pJ 26 pJ 256 pJ

1 nJ

256-

bit

buses

50 pJ

256-bit

access

8 kB SRAM

[Dally, IPDPS’11]

36 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


Efficient Hardware Task Swapping

Hiding reconfiguration time with computing

Single-context memory

Double-context memory

eFPGA will use double-context memory

Gain in dynamic reconfiguration efficiency

At the cost of ~50% overhead

Task 1 Task 2

time

Cfg. 2 Cfg. 1

Task 1 Task 2

time

Cfg. 2 Cfg. 1

CB

FF

ConfClk Latch

ConfEn

CB

CB: one configuration bit

37 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


eFPGA(V1) Architecture

Logic Block

Switch Block

LUT CLBIN

ScanIn

FF

mu

x

C

B

ScanOut

CLBOUT

clk,rstb C

B

C

B

C

B

C B

NORTH(i)

SOUTH(i)

EAST(i) WEST(i)

ScanIn

ScanOut

38 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


eFPGA Architecture

Interconnection Block

CLBIN[1]

CLBIN[2]

CLBIN[3] CLBOUT

CLBIN[0]

NORTH 0 1 2 3

0 1 2 3 SOUTH

0 1

2 3

W

EST

EAST

0 1

2 3

39 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


eFPGA Architecture

eFPGA macro CHANY

(i,j+1)

SB

(i-1,j)

CHANX

(i+1,j)

CLB

(i+1,j)

SB

(i,j-1)

SB(i,j)

CLB

(i,j+1)

CLB

(i,j)

CLBIN[1]

CLBIN[2]

CLBIN[0]

CLBIN[3] CLBOUT

CHANX(i,j)

CHANY(i,j)

CLBIN[3] CLBOUT

CLBIN[0]

40 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0


eFPGA Floorplan

eFPGA Floorplan

fpl'2014 - flextiles workshop - 6 - flextiles embedded fpga accelerators

Engineering

fpl14 flextiles workshop32can

template version

project trtdj624412785

power budgets

power constraints8nm

efficiency of specialization

utilization 1s2pi

utilization wallwith