fpl'2014 - flextiles workshop - 6 - flextiles embedded fpga accelerators
DESCRIPTION
Slides presented at the FlexTiles Workshop at FPL'2014. Presentation #6: FlexTiles Embedded FPGA Accelerators FlexTiles is a heterogeneous many-core platform reconfigurable at run-time developed within an FP7 project.TRANSCRIPT
www.flextiles.eu
FlexTiles
Runtime Mapping of Hardware
Accelerators on the Embedded
FPGA Layer
FPL’14, FlexTiles Workshop
September 1st 2014
Olivier SENTIEYS★, Christophe HURIAUX, Antoine COURTAY University of Rennes 1
★ Inria
2 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
The Multicore Era is Hitting the Utilization Wall
Multicore era is true since 2005-2008,
but what’s next?
Energy efficiency is not scaling along
with integration capacity
Transistor and power budgets
no longer balanced
Classical scaling Device count S2
Device frequency S
Device power (cap) 1/S
Device power (Vdd) 1/S2
Utilization 1
Leakage limited scaling Device count S2
Device frequency S
Device power (cap) 1/S
Device power (Vdd) ~1
Utilization 1/S2
Pi=ai fi Ci Vddi2
Corei
[Venkatesh et al., ASPLOS’10]
3 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
The Utilization Wall
With each successive process generation, the
percentage of a chip that can switch at full frequency
drops exponentially due to power constraints
8nm in 2018 best-case average
3.7x speedup 14% per year
(highly parallel codes
and optimal per-
benchmark)
[Esmaeilzadeh et al., ISCA’11]
4 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
0
5
10
15
20
45nm 32nm 22nm 16nm 11nm 8nm
Spee
du
p
Historical Scaling
ITRS Scaling
Realistic Scaling
18x
7.9x
3.7x
Multicore and Dark Silicon
[Doug Burger, HiPEAC’13]
Dar
k Si
lico
n
47% 36% 71% 51% 62% 40%
17%
1%
2014 >2016 >2018
5 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
The Efficiency of Specialization
* Source: Ning Zhang and Bob Brodersen, ISSCC data
100-1000X Gap in Efficiency … but Specialization
comes with Penalties in Programmability
ASICs
FPGAs
6 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Heterogeneous Multicores
Different cores on a single chip
GPPs, HW accelerators, memory, network-on-chip
Reconfigurable HW accelerators keep flexibility while increasing
area and energy efficiency
Self-adapting devices
Dynamically adapt the hardware to the application and to changing
environments
Core Core
Core Core
Core
Core
Core Core Core
Proc. Reconf.
HW Mem.
HW
Acc.
7 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Can 3D Stacking Help?
3D-Stacked Reconfigurable Accelerators
Improved bandwidth/latency between cores and accelerators
Improved resource usage
Improved performance and energy efficiency
Core Core Core
Core Core Core
Core Core Core
Core Core Core
Core Core Core
Core Core Core
reconfigurable layer
multicore layer
8 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Outline
eFPGA Reconfigurable Fabric
General architecture overview
Expected features
Task migration in FPGA vs. task migration in eFPGA
Virtual Bit-Stream
Coping with Heterogeneous Blocks
Development Flow
Achievements & Conclusion
9 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
FlexTiles Architecture Overview
- 9
3D interface to the NoC
DSP blocks
Memory blocks
10 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Expected Features of the Reconfigurable Layer
Main expected features
Low reconfiguration time (and power) overhead
Double-context configuration memory
Low complexity reconfiguration control
Resource sharing/distribution easiness, simplified task
migration
No predefined configuration domains
Bit-stream independent from task location
Smaller bit-stream size in configuration memory
Virtual Bit-Stream (VBS)
11 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Task Allocation & Migration in an FPGA
Predefined
reconfigurable
regions
Bit-stream depends
on task location
I/O I/O I/O I/O I/O I/O I/O
I/O I/O I/O I/O I/O I/O I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
HW Accelerator #1
BS #1
HW Accelerator #1
BS #2
12 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Task Migration in eFPGA
3D NI 3D NI
3D NI 3D NI
RAM RAM RAM RAM
RAM RAM RAM RAM
3D NI 3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
HW Accelerator #2
BS #2
HW Accelerator #1
BS #1
13 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Outline
eFPGA Reconfigurable Fabric
Virtual Bit-Stream
Concept
Abstraction of routing details
Results
Coping with Heterogeneous Fabric
Development Flow
Achievements & Conclusion
14 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Concept of Virtual Bit-Stream
A task is synthesized and
placed&routed into a Virtual
Bit-Stream (VBS)
Hide some routing details which are architecture dependent
Remove details coming from task physical location in the fabric
No predefined configuration domains
Final Bits-Stream is
generated at run time
Resource sharing/distribution becomes easier, task migration is simplified
Quartus II
15 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Interconnection Architecture
Hiding routing details
Full BS is 129 bits
Could be reduced by giving
less details
CLBIN[1]
CLBIN[2]
CLBIN[3] CLBOUT
CLBIN[0]
4 5 6 7
12 13 14 15
0 1
2 3
8 9
10
11
16
17
18
19 20
16 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Virtual Bit Stream
Hiding routing details
List of I/O and connections
20 8
1 9
5 18
4 5 6 7
12 13 14 15
0 1
2 3
8 9
10
11
1
6
17
18
19 20
17 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Results
VBS is independent of task location with a
smaller size than BS
44.4%49.2%
47.2%
55.2%
49.7%
29.5%27.4% 26.6%
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
100.0%
0
200
400
600
800
1000
1200
1400
1600
tseng tseng diffeq diffeq apex4 des ex5p misex3
Kilo
-bits
BSsize
VBSsize
Compressionra o
3-4 time smaller for
large bit-streams
18 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
eFPGA Architecture using VBS
Reconfiguration controller
Upon GPP requirements: can place, duplicate and migrate tasks
Finalizes VBS
Reconfiguration
controller
External
memory
VBS
1
VBS
2
VBS
3
VBS
N
Buffer
memory
data
control
1
2
19 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Outline
eFPGA Reconfigurable Fabric
Virtual Bit-Stream
Coping with Heterogeneous Fabric
Heterogeneous Blocks
Task placement in a Homogeneous context
Task placement in a Heterogeneous context
Development Flow
Achievements & Conclusion
20 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Heterogeneous Blocks
Logic Elements
Cluster of four 6-input LUTs
3309 mm2
Arithmetic Elements
18x18 multiplier, 48-bit adder/subtractor
4351 mm2
…
…
…
…
…
CLBIN
CLBOUT
LUT
LUT
LUT
LUT
+ -
A
B
18
18
36
48
21 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Heterogeneous Blocks
Memories
1024 x 16-bit word SRAM
6570 mm2
3D TSV and Accelerator Interface
Reconfiguration
Controller
3D
3D 3D
3D
3D
3D
3D
3D
3D
Reconfiguration
RAM
3DNI 3DNI3DNI
3DNI 3DNI
3DNI 3DNI3DNI
NoC Link (400 I/O) Pitch X Y size X size Y Area mm²
40 20 20 800 800 0,64
26.95mm²
Work In Progress
22 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
eFPGA Floorplan (heterogeneous)
Logic Block
Arithmetic Accelerator
Memories
Accelerator Interface
23 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Task Placement & Migration
Homogeneous case
No constraint on task placement
Regular routing architecture
Easy! (thanks to the Virtual Bit-Stream)
Cope with heterogeneity
RAM, DSP, 3D I/Os
Migration is limited
vertically to the same column
to the next column containing same complex
blocks
Task
Configured LE
Logic Element (LE)
24 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
eFPGA: Handling of Complex Blocks
Heterogeneous blocks routing is abstracted from
logic routing
Long lines allow a trade-off between placement flexibility and routing complexity
A two-level routing is performed at runtime:
Logic routing (as in the homogeneous case)
Heterogeneous block routing through long lines
25 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
eFPGA: Handling of Complex Blocks
Delay depends on final placement
Only worst-case delay can be estimated offline
Flexibility is still limited in the vertical axis
Multiple of block height
Length of long lines and connections long-lines –
routing-resources should be limited
Area overhead, but slight delay penalty
(see our paper at FPL’14 on Wednesday)
26 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Outline
eFPGA Reconfigurable Fabric
Virtual Bit-Stream
Coping with Heterogeneous Fabric
Development Flow
Achievements & Conclusion
27 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Development Flow
Custom development flow from C to Virtual Bit-Stream
High-level Synthesis
High-level task description
RTL task description
HDL Synthesis
HDL task description
Flat logic netlist
Technology mapping
Mapped logic netlist
Placer Router
Placement data
Routing data
Arch. netlist
Bitstream generation
Virtual bit-stream Arch.
description
Integrated within the
FlexTiles
development flow
Generates VBS from
a C description or a
HDL description
28 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Development Flow
Custom development flow from C to Virtual Bit-Stream
Relies on Catapult C
from Calypto Design
Systems
High-level synthesis
from C to VHDL
29 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Development Flow
Custom development flow from C to Virtual Bit-Stream
Use the Verilog To
Routing (VTR)
academic tool flow
to generate netlist
and routing data
from Verilog
RTL task description
HDL Synthesis
HDL task description
Flat logic netlist
Technology mapping
Mapped logic netlist
Placer Router
Placement data
Routing data
Arch. netlist
Arch. description
30 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Development Flow
Custom development flow from C to Virtual Bit-Stream
A custom back-end
generate the VBS
from the data
generated by VTR
The VBS can be
loaded on the
FlexTiles platform
31 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Conclusions
Overall results and achievements
3-D stacked embedded FPGA coupled to a processor layer
Flexible resource allocation/sharing
Seamless task migration
Virtual Bit-Stream
VBS also reduces bitstream size
eFPGA Chip “Proof of Concept”
65nm CMOS
Homogenous Fabric of LBs
I/O Ring (not 3D…)
External Reconfiguration Controller
32 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Results
Thank you for your
attention
33 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
D-cache 6%
Datapath 3%
Energy Saved 91%
D-cache 6%
Datapath 38%
Reg. File 14%
Fetch/ Decode
19%
I-cache 23%
Where do the energy savings come from?
MIPS baseline
91 pJ/instr.
Specialized core
8 pJ/instr.
[Goulding et al., Hot Chips’10]
34 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Energy per operation: 45nm CMOS, 40nm V6 FPGA
HW operators (45nm)
32-bit addition: 0.5pJ
16-bit multiply: 2.2pJ
64-bit FPU: 50pJ/op
40nm V6 FPGA
16/32-bit multiply and add: 114pJ (DSP blocks), 170pJ (LUT)
32-bit I/O access: 1.47nJ
32-bit memory read: 660 pJ
32-bit register R/W: 1.12 pJ
Embedded RISC Processor (45nm)
32-bit register R/W: 0.33pJ
32-bit cache R/W: 3.5pJ
add instruction⋆⋆: 5.32 pJ
⋆⋆add instruction (best case) = fetch, decode, read 2 operands from RF, execute, write back (into local reg. first, then copy into RF)
[Dally et al., Computer, 2010] [Bonamy et al., 2013]
35 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
The Energy Cost of Data Movement
Fetching operands costs more than computing
Energy cost of cache coherence is huge!
28nm
CMOS
500 pJ Efficient
off-chip
link
16 nJ
DRAM
Rd/Wr
64-bit DP
20pJ 26 pJ 256 pJ
1 nJ
256-
bit
buses
50 pJ
256-bit
access
8 kB SRAM
[Dally, IPDPS’11]
36 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
Efficient Hardware Task Swapping
Hiding reconfiguration time with computing
Single-context memory
Double-context memory
eFPGA will use double-context memory
Gain in dynamic reconfiguration efficiency
At the cost of ~50% overhead
Task 1 Task 2
time
Cfg. 2 Cfg. 1
Task 1 Task 2
time
Cfg. 2 Cfg. 1
CB
FF
ConfClk Latch
ConfEn
CB
CB: one configuration bit
37 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
eFPGA(V1) Architecture
Logic Block
Switch Block
LUT CLBIN
ScanIn
FF
mu
x
C
B
ScanOut
CLBOUT
clk,rstb C
B
C
B
C
B
C B
NORTH(i)
SOUTH(i)
EAST(i) WEST(i)
ScanIn
ScanOut
38 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
eFPGA Architecture
Interconnection Block
CLBIN[1]
CLBIN[2]
CLBIN[3] CLBOUT
CLBIN[0]
NORTH 0 1 2 3
0 1 2 3 SOUTH
0 1
2 3
W
EST
EAST
0 1
2 3
39 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
eFPGA Architecture
eFPGA macro CHANY
(i,j+1)
SB
(i-1,j)
CHANX
(i+1,j)
CLB
(i+1,j)
SB
(i,j-1)
SB(i,j)
CLB
(i,j+1)
CLB
(i,j)
CLBIN[1]
CLBIN[2]
CLBIN[0]
CLBIN[3] CLBOUT
CHANX(i,j)
CHANY(i,j)
CLBIN[3] CLBOUT
CLBIN[0]
40 /
The info
rmation c
onta
ined in this
docum
ent and a
ny a
ttachm
ents
are
the p
ropert
y o
f F
lexT
iles c
onsort
ium
. Y
ou a
re h
ere
by n
otified that any r
evie
w, dis
sem
ination, dis
trib
ution,
copyin
g o
r oth
erw
ise u
se o
f th
is d
ocum
ent m
ust be d
one in a
ccord
ance w
ith the C
A o
f th
e p
roje
ct (T
RT
/DJ/6
24412785.2
011).
Tem
pla
te v
ers
ion 1.0
University of Rennes 1 – FPL’14 FlexTiles Workshop 32
eFPGA Floorplan
eFPGA Floorplan