fpl'2014 - flextiles workshop - 6 - flextiles embedded fpga accelerators

40
www.flextiles.eu FlexTiles Runtime Mapping of Hardware Accelerators on the Embedded FPGA Layer FPL’14, FlexTiles Workshop September 1st 2014 Olivier SENTIEYS , Christophe HURIAUX , Antoine COURTAY University of Rennes 1 Inria

Upload: flextiles-team

Post on 20-Jun-2015

141 views

Category:

Engineering


2 download

DESCRIPTION

Slides presented at the FlexTiles Workshop at FPL'2014. Presentation #6: FlexTiles Embedded FPGA Accelerators FlexTiles is a heterogeneous many-core platform reconfigurable at run-time developed within an FP7 project.

TRANSCRIPT

Page 1: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

www.flextiles.eu

FlexTiles

Runtime Mapping of Hardware

Accelerators on the Embedded

FPGA Layer

FPL’14, FlexTiles Workshop

September 1st 2014

Olivier SENTIEYS★, Christophe HURIAUX, Antoine COURTAY University of Rennes 1

★ Inria

Page 2: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

2 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

The Multicore Era is Hitting the Utilization Wall

Multicore era is true since 2005-2008,

but what’s next?

Energy efficiency is not scaling along

with integration capacity

Transistor and power budgets

no longer balanced

Classical scaling Device count S2

Device frequency S

Device power (cap) 1/S

Device power (Vdd) 1/S2

Utilization 1

Leakage limited scaling Device count S2

Device frequency S

Device power (cap) 1/S

Device power (Vdd) ~1

Utilization 1/S2

Pi=ai fi Ci Vddi2

Corei

[Venkatesh et al., ASPLOS’10]

Page 3: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

3 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

The Utilization Wall

With each successive process generation, the

percentage of a chip that can switch at full frequency

drops exponentially due to power constraints

8nm in 2018 best-case average

3.7x speedup 14% per year

(highly parallel codes

and optimal per-

benchmark)

[Esmaeilzadeh et al., ISCA’11]

Page 4: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

4 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

0

5

10

15

20

45nm 32nm 22nm 16nm 11nm 8nm

Spee

du

p

Historical Scaling

ITRS Scaling

Realistic Scaling

18x

7.9x

3.7x

Multicore and Dark Silicon

[Doug Burger, HiPEAC’13]

Dar

k Si

lico

n

47% 36% 71% 51% 62% 40%

17%

1%

2014 >2016 >2018

Page 5: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

5 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

The Efficiency of Specialization

* Source: Ning Zhang and Bob Brodersen, ISSCC data

100-1000X Gap in Efficiency … but Specialization

comes with Penalties in Programmability

ASICs

FPGAs

Page 6: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

6 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Heterogeneous Multicores

Different cores on a single chip

GPPs, HW accelerators, memory, network-on-chip

Reconfigurable HW accelerators keep flexibility while increasing

area and energy efficiency

Self-adapting devices

Dynamically adapt the hardware to the application and to changing

environments

Core Core

Core Core

Core

Core

Core Core Core

Proc. Reconf.

HW Mem.

HW

Acc.

Page 7: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

7 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Can 3D Stacking Help?

3D-Stacked Reconfigurable Accelerators

Improved bandwidth/latency between cores and accelerators

Improved resource usage

Improved performance and energy efficiency

Core Core Core

Core Core Core

Core Core Core

Core Core Core

Core Core Core

Core Core Core

reconfigurable layer

multicore layer

Page 8: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

8 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Outline

eFPGA Reconfigurable Fabric

General architecture overview

Expected features

Task migration in FPGA vs. task migration in eFPGA

Virtual Bit-Stream

Coping with Heterogeneous Blocks

Development Flow

Achievements & Conclusion

Page 9: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

9 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

FlexTiles Architecture Overview

- 9

3D interface to the NoC

DSP blocks

Memory blocks

Page 10: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

10 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Expected Features of the Reconfigurable Layer

Main expected features

Low reconfiguration time (and power) overhead

Double-context configuration memory

Low complexity reconfiguration control

Resource sharing/distribution easiness, simplified task

migration

No predefined configuration domains

Bit-stream independent from task location

Smaller bit-stream size in configuration memory

Virtual Bit-Stream (VBS)

Page 11: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

11 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Task Allocation & Migration in an FPGA

Predefined

reconfigurable

regions

Bit-stream depends

on task location

I/O I/O I/O I/O I/O I/O I/O

I/O I/O I/O I/O I/O I/O I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

HW Accelerator #1

BS #1

HW Accelerator #1

BS #2

Page 12: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

12 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Task Migration in eFPGA

3D NI 3D NI

3D NI 3D NI

RAM RAM RAM RAM

RAM RAM RAM RAM

3D NI 3D NI

3D NI

3D NI

3D NI

3D NI

3D NI

3D NI

3D NI

3D NI

3D NI

HW Accelerator #2

BS #2

HW Accelerator #1

BS #1

Page 13: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

13 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Outline

eFPGA Reconfigurable Fabric

Virtual Bit-Stream

Concept

Abstraction of routing details

Results

Coping with Heterogeneous Fabric

Development Flow

Achievements & Conclusion

Page 14: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

14 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Concept of Virtual Bit-Stream

A task is synthesized and

placed&routed into a Virtual

Bit-Stream (VBS)

Hide some routing details which are architecture dependent

Remove details coming from task physical location in the fabric

No predefined configuration domains

Final Bits-Stream is

generated at run time

Resource sharing/distribution becomes easier, task migration is simplified

Quartus II

Page 15: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

15 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Interconnection Architecture

Hiding routing details

Full BS is 129 bits

Could be reduced by giving

less details

CLBIN[1]

CLBIN[2]

CLBIN[3] CLBOUT

CLBIN[0]

4 5 6 7

12 13 14 15

0 1

2 3

8 9

10

11

16

17

18

19 20

Page 16: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

16 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Virtual Bit Stream

Hiding routing details

List of I/O and connections

20 8

1 9

5 18

4 5 6 7

12 13 14 15

0 1

2 3

8 9

10

11

1

6

17

18

19 20

Page 17: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

17 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Results

VBS is independent of task location with a

smaller size than BS

44.4%49.2%

47.2%

55.2%

49.7%

29.5%27.4% 26.6%

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

0

200

400

600

800

1000

1200

1400

1600

tseng tseng diffeq diffeq apex4 des ex5p misex3

Kilo

-bits

BSsize

VBSsize

Compressionra o

3-4 time smaller for

large bit-streams

Page 18: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

18 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

eFPGA Architecture using VBS

Reconfiguration controller

Upon GPP requirements: can place, duplicate and migrate tasks

Finalizes VBS

Reconfiguration

controller

External

memory

VBS

1

VBS

2

VBS

3

VBS

N

Buffer

memory

data

control

1

2

Page 19: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

19 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Outline

eFPGA Reconfigurable Fabric

Virtual Bit-Stream

Coping with Heterogeneous Fabric

Heterogeneous Blocks

Task placement in a Homogeneous context

Task placement in a Heterogeneous context

Development Flow

Achievements & Conclusion

Page 20: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

20 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Heterogeneous Blocks

Logic Elements

Cluster of four 6-input LUTs

3309 mm2

Arithmetic Elements

18x18 multiplier, 48-bit adder/subtractor

4351 mm2

CLBIN

CLBOUT

LUT

LUT

LUT

LUT

+ -

A

B

18

18

36

48

Page 21: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

21 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Heterogeneous Blocks

Memories

1024 x 16-bit word SRAM

6570 mm2

3D TSV and Accelerator Interface

Reconfiguration

Controller

3D

3D 3D

3D

3D

3D

3D

3D

3D

Reconfiguration

RAM

3DNI 3DNI3DNI

3DNI 3DNI

3DNI 3DNI3DNI

NoC Link (400 I/O) Pitch X Y size X size Y Area mm²

40 20 20 800 800 0,64

26.95mm²

Work In Progress

Page 22: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

22 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

eFPGA Floorplan (heterogeneous)

Logic Block

Arithmetic Accelerator

Memories

Accelerator Interface

Page 23: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

23 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Task Placement & Migration

Homogeneous case

No constraint on task placement

Regular routing architecture

Easy! (thanks to the Virtual Bit-Stream)

Cope with heterogeneity

RAM, DSP, 3D I/Os

Migration is limited

vertically to the same column

to the next column containing same complex

blocks

Task

Configured LE

Logic Element (LE)

Page 24: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

24 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

eFPGA: Handling of Complex Blocks

Heterogeneous blocks routing is abstracted from

logic routing

Long lines allow a trade-off between placement flexibility and routing complexity

A two-level routing is performed at runtime:

Logic routing (as in the homogeneous case)

Heterogeneous block routing through long lines

Page 25: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

25 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

eFPGA: Handling of Complex Blocks

Delay depends on final placement

Only worst-case delay can be estimated offline

Flexibility is still limited in the vertical axis

Multiple of block height

Length of long lines and connections long-lines –

routing-resources should be limited

Area overhead, but slight delay penalty

(see our paper at FPL’14 on Wednesday)

Page 26: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

26 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Outline

eFPGA Reconfigurable Fabric

Virtual Bit-Stream

Coping with Heterogeneous Fabric

Development Flow

Achievements & Conclusion

Page 27: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

27 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Development Flow

Custom development flow from C to Virtual Bit-Stream

High-level Synthesis

High-level task description

RTL task description

HDL Synthesis

HDL task description

Flat logic netlist

Technology mapping

Mapped logic netlist

Placer Router

Placement data

Routing data

Arch. netlist

Bitstream generation

Virtual bit-stream Arch.

description

Integrated within the

FlexTiles

development flow

Generates VBS from

a C description or a

HDL description

Page 28: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

28 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Development Flow

Custom development flow from C to Virtual Bit-Stream

Relies on Catapult C

from Calypto Design

Systems

High-level synthesis

from C to VHDL

Page 29: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

29 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Development Flow

Custom development flow from C to Virtual Bit-Stream

Use the Verilog To

Routing (VTR)

academic tool flow

to generate netlist

and routing data

from Verilog

RTL task description

HDL Synthesis

HDL task description

Flat logic netlist

Technology mapping

Mapped logic netlist

Placer Router

Placement data

Routing data

Arch. netlist

Arch. description

Page 30: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

30 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Development Flow

Custom development flow from C to Virtual Bit-Stream

A custom back-end

generate the VBS

from the data

generated by VTR

The VBS can be

loaded on the

FlexTiles platform

Page 31: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

31 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Conclusions

Overall results and achievements

3-D stacked embedded FPGA coupled to a processor layer

Flexible resource allocation/sharing

Seamless task migration

Virtual Bit-Stream

VBS also reduces bitstream size

eFPGA Chip “Proof of Concept”

65nm CMOS

Homogenous Fabric of LBs

I/O Ring (not 3D…)

External Reconfiguration Controller

Page 32: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

32 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Results

Thank you for your

attention

Page 33: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

33 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

D-cache 6%

Datapath 3%

Energy Saved 91%

D-cache 6%

Datapath 38%

Reg. File 14%

Fetch/ Decode

19%

I-cache 23%

Where do the energy savings come from?

MIPS baseline

91 pJ/instr.

Specialized core

8 pJ/instr.

[Goulding et al., Hot Chips’10]

Page 34: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

34 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Energy per operation: 45nm CMOS, 40nm V6 FPGA

HW operators (45nm)

32-bit addition: 0.5pJ

16-bit multiply: 2.2pJ

64-bit FPU: 50pJ/op

40nm V6 FPGA

16/32-bit multiply and add: 114pJ (DSP blocks), 170pJ (LUT)

32-bit I/O access: 1.47nJ

32-bit memory read: 660 pJ

32-bit register R/W: 1.12 pJ

Embedded RISC Processor (45nm)

32-bit register R/W: 0.33pJ

32-bit cache R/W: 3.5pJ

add instruction⋆⋆: 5.32 pJ

⋆⋆add instruction (best case) = fetch, decode, read 2 operands from RF, execute, write back (into local reg. first, then copy into RF)

[Dally et al., Computer, 2010] [Bonamy et al., 2013]

Page 35: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

35 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

The Energy Cost of Data Movement

Fetching operands costs more than computing

Energy cost of cache coherence is huge!

28nm

CMOS

500 pJ Efficient

off-chip

link

16 nJ

DRAM

Rd/Wr

64-bit DP

20pJ 26 pJ 256 pJ

1 nJ

256-

bit

buses

50 pJ

256-bit

access

8 kB SRAM

[Dally, IPDPS’11]

Page 36: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

36 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

Efficient Hardware Task Swapping

Hiding reconfiguration time with computing

Single-context memory

Double-context memory

eFPGA will use double-context memory

Gain in dynamic reconfiguration efficiency

At the cost of ~50% overhead

Task 1 Task 2

time

Cfg. 2 Cfg. 1

Task 1 Task 2

time

Cfg. 2 Cfg. 1

CB

FF

ConfClk Latch

ConfEn

CB

CB: one configuration bit

Page 37: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

37 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

eFPGA(V1) Architecture

Logic Block

Switch Block

LUT CLBIN

ScanIn

FF

mu

x

C

B

ScanOut

CLBOUT

clk,rstb C

B

C

B

C

B

C B

NORTH(i)

SOUTH(i)

EAST(i) WEST(i)

ScanIn

ScanOut

Page 38: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

38 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

eFPGA Architecture

Interconnection Block

CLBIN[1]

CLBIN[2]

CLBIN[3] CLBOUT

CLBIN[0]

NORTH 0 1 2 3

0 1 2 3 SOUTH

0 1

2 3

W

EST

EAST

0 1

2 3

Page 39: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

39 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

eFPGA Architecture

eFPGA macro CHANY

(i,j+1)

SB

(i-1,j)

CHANX

(i+1,j)

CLB

(i+1,j)

SB

(i,j-1)

SB(i,j)

CLB

(i,j+1)

CLB

(i,j)

CLBIN[1]

CLBIN[2]

CLBIN[0]

CLBIN[3] CLBOUT

CHANX(i,j)

CHANY(i,j)

CLBIN[3] CLBOUT

CLBIN[0]

Page 40: FPL'2014 - FlexTiles Workshop - 6 - FlexTiles Embedded FPGA Accelerators

40 /

The info

rmation c

onta

ined in this

docum

ent and a

ny a

ttachm

ents

are

the p

ropert

y o

f F

lexT

iles c

onsort

ium

. Y

ou a

re h

ere

by n

otified that any r

evie

w, dis

sem

ination, dis

trib

ution,

copyin

g o

r oth

erw

ise u

se o

f th

is d

ocum

ent m

ust be d

one in a

ccord

ance w

ith the C

A o

f th

e p

roje

ct (T

RT

/DJ/6

24412785.2

011).

Tem

pla

te v

ers

ion 1.0

University of Rennes 1 – FPL’14 FlexTiles Workshop 32

eFPGA Floorplan

eFPGA Floorplan