mohamed abdelfattah vaughn betz. 2 why nocs on fpgas? embedded nocs 1 1 2 2 comparison against buses...
TRANSCRIPT
Augmenting FPGAs with Embedded Networks-on-Chip
Mohamed ABDELFATTAHVaughn BETZ
2
Outline
Why NoCs on FPGAs?
Embedded NoCs
1
2
Comparison Against Buses3
3
Interconnect
Motivation1. Why NoCs on FPGAs?
Logic Blocks
Switch Blocks
Wires
4
Motivation1. Why NoCs on FPGAs?
Logic Blocks
Switch Blocks
Wires
Hard Blocks:• Memory• Multiplier• Processor
5
Motivation1. Why NoCs on FPGAs?
Logic Blocks
Switch Blocks
Wires
Hard InterfacesDDR/PCIe ..
Interconnect still the same
Hard Blocks:• Memory• Multiplier• Processor
1600 MHz
200 MHz
800 MHz
6
MotivationDDR3 PHY and Controller
Problems:1. Bandwidth requirements for
hard logic/interfaces2. Timing closure
1. Why NoCs on FPGAs?PCIe Controller
Gigabit Ethernet
1600 MHz
200 MHz
800 MHz
7
MotivationDDR3 PHY and Controller
Problems:1. Bandwidth requirements for
hard logic/interfaces2. Timing closure3. High interconnect utilization:
– Huge CAD Problem– Slow compilation– Power/area utilization
4. Wire speed not scaling:– Delay is interconnect-dominated
1. Why NoCs on FPGAs?PCIe Controller
Gigabit Ethernet
Barcelona Los Angeles
Keep the “roads”, but add “freeways”.
Hard Blocks
Logic Cluster
Source: Google Earth
9
DDR3 PHY and Controller
1. Why NoCs on FPGAs?PCIe Controller
Gigabit Ethernet
Problems:1. Bandwidth requirements for
hard logic/interfaces2. Timing closure3. High interconnect utilization:
– Huge CAD Problem– Slow compilation– Power/area utilization
4. Wire speed not scaling:– Delay is interconnect-dominated
FPGA with NoCNoC
Routers
Links Router forwards data packet
Router moves data to local interconnect
10
DDR3 PHY and Controller
1. Why NoCs on FPGAs?PCIe Controller
Gigabit Ethernet
Problems:1. Bandwidth requirements for
hard logic/interfaces2. Timing closure3. High interconnect utilization:
– Huge CAD Problem– Slow compilation– Power/area utilization
4. Wire speed not scaling:– Delay is interconnect-dominated
5. Abstraction favours modularity:– Parallel compilation– Partial reconfiguration– Multi-chip interconnect
FPGA with NoC
Pre-design NoC to requirements NoC links are “re-usable” NoC is heavily “pipelined” NoC abstraction favors modularity
High bandwidth endpoints known
11
DDR3 PHY and Controller
1. Why NoCs on FPGAs?PCIe Controller
Gigabit Ethernet
FPGA with NoC
Latency-tolerant communication NoC abstraction favors modularity
Problems:1. Bandwidth requirements for
hard logic/interfaces2. Timing closure3. High interconnect utilization:
– Huge CAD Problem– Slow compilation– Power/area utilization
4. Wire speed not scaling:– Delay is interconnect-dominated
5. Abstraction favours modularity:– Parallel compilation– Partial reconfiguration– Multi-chip interconnect
12
1. Why NoCs on FPGAs?
Compute Acceleration
• Maxeler• Geoscience (14x, 70x)• Financial analysis (5x, 163x)
• Altera OpenCL• Video compression (3x, 114x)• Information filtering (5.5x)
GPU CPU
13
1. Why NoCs on FPGAs?
Compute Acceleration
14
1. Why NoCs on FPGAs?
Compute Acceleration
15
1. Why NoCs on FPGAs?
Compute Acceleration
NoC
16
Outline
Why NoCs on FPGAs?
Embedded NoCs
1
2
Mixed NoCs Hard NoCs
Comparison Against Buses3
Embedded NoCsFPGA
DD
Rx In
terf
ace
PCIe
Inte
rfac
e
Router
Compute Module
Links(Hard or Soft)
Fabric
Port
(Hard or Soft)
2. Embedded NoCs
“Mixed” NoC
“Hard” NoC
Soft LinksHard Routers
Hard LinksHard Routers =++
=“Soft” NoCSoft LinksSoft Routers + =
18
Soft Hard
FPGA CAD Tools ASIC CAD Tools
Design Compiler
Area
Speed
Power?Power
Methodology
Toggle rates
Gate-level simulation Gate-level simulation
Mixed
HSPICE
19
Router Logic
Programmable Interconnect
FPGA
Router
Mixed NoCs2. Embedded NoCs
Logic blocks
Baseline Router
Programmable“soft” interconnect
Width VCs Ports Buffer
32 2 5 10/VC
“Mixed” NoCSoft LinksHard Routers + =
20
Router Logic
Programmable Interconnect
FPGA
Router
Mixed NoCs2. Embedded NoCs
Router Logic
20“Mixed” NoCSoft LinksHard Routers + =
21
Router Logic
Programmable Interconnect
Router
Assumed a mesh Can form any topology
FPGA
Mixed NoCs2. Embedded NoCs
Special FeatureConfigurable topology
22
Router Logic
Dedicated Interconnect
FPGA
Router
Hard NoCs2. Embedded NoCs
Logic blocks
Dedicated “hard” interconnect
Programmable“soft” interconnect
22“Hard” NoCHard LinksHard Routers + =
23
Router Logic
Dedicated Interconnect
FPGA
Router
Hard NoCs2. Embedded NoCs
Router Logic
23“Hard” NoCHard LinksHard Routers + =
24
Router Logic
Dedicated Interconnect
FPGA
Router
Hard NoCs2. Embedded NoCs
Low-V mode
1.1 V0.9 V
Save 33% Dynamic Power
Special Feature
~15% slower
24“Hard” NoCHard LinksHard Routers + =
Soft, Mixed and Hard
Mixed Hard Soft
Speed
Speed
Bisection BW
~ 1.5% of FPGA33% of FPGA
730 – 940 MHz166 MHz
~ 50 GB/s~ 10 GB/s
64 –
NoC
[65 nm]
3. Area/Power Analysis
576 LBs~12,500 LBsArea
448 LBs
64-node NoC on Stratix III
Soft, Mixed and Hard
Mixed Hard (Low-V)Soft
Speed
Speed
Bisection BW
~ 1.5% of FPGA33% of FPGA
730 – 940 MHz166 MHz
~ 50 GB/s~ 10 GB/s
64 –
NoC
[65 nm]
3. Area/Power Analysis
576 LBs~12,500 LBsArea
448 LBs
Provides ~50GB/s peak bisection bandwidth
Very Cheap! Less than cost of 3 soft nodes
64-node NoC on Stratix III
29
NoC Power BudgetSoft NoC Mixed NoC Hard NoC Hard NoC (Low-V)
17.4 W
250 GB/s total bandwidth
Typical FPGA Dynamic Power
123%How much is used for system-level communication?
3. Area/Power Analysis
Largest Stratix-III device
30
NoC Power BudgetSoft NoC Mixed NoC Hard NoC Hard NoC (Low-V)
17.4 W
NoC
250 GB/s total bandwidth 15%
Typical FPGA Dynamic Power
3. Area/Power Analysis
123%
31
NoC Power Budget
NoC
17.4 WTypical FPGA
Dynamic Power
Soft NoC Mixed NoC Hard NoC Hard NoC (Low-V)250 GB/s total bandwidth 15%123% 11%
3. Area/Power Analysis
32
NoC Power Budget
NoC
17.4 WTypical FPGA
Dynamic Power
Soft NoC Mixed NoC Hard NoC Hard NoC (Low-V)250 GB/s total bandwidth 15%123% 11% 7%
3. Area/Power Analysis
33
Bandwidth in Perspective
14.6 GB/s
14.6 GB/s
14.6 GB/s
14.6 GB/s
17 G
B/s
17 G
B/s
17 G
B/s
17 G
B/s
DDR3 Module 1
PCIe Module 2
Full theoretical BW
126 GB/sAggregate Bandwidth
3.5%NoC Power Budget
Cross whole chip!
3. Area/Power Analysis
34
Outline
Why NoCs on FPGAs?
Embedded NoCs
1
2
Design Effort
Comparison Against Buses3
Area/PowerEfficiency
35
DDR3: Qsys Bus vs. NoC4. Comparison
Qsys bus: Build logical bus from fabric
Embedded NoC: 16 Nodes, hard routers & links
36
DDR3: Qsys Bus vs. NoC4. Comparison
Qsys bus: Build logical bus from fabric
Embedded NoC: 16 Nodes, hard routers & links
“The Case for Embedded Networks-on-Chip on FPGAs”To appear in IEEE Micro Magazine (February)
37
Design Effort4. Comparison
• Steps to close timing using Qsys
close
FPGA
38
Design Effort4. Comparison
• Steps to close timing using Qsys
far
FPGA
39
Design Effort4. Comparison
• Steps to close timing using Qsys
far
FPGA
Timing closure can be simplified with an embedded NoC
40
Area Comparison4. Comparison
41
Area Comparison4. Comparison
42
Area Comparison4. Comparison
Entire NoC smaller than bus for 3 modules!
43
Area Comparison4. Comparison
1/8 Hard NoC BW used already less area for most systems
44
Power Comparison4. Comparison
Hard NoC saves power for even the simplest systems
1
2
Big city needs freeways to handle traffic
Area: 20-23X
Why NoCs on FPGAs?
Embedded NoCs: Mixed & Hard
Speed: 5-6X Power: 9-15X• Area Budget for 64 nodes: ~1%• Power Budget for 100 GB/s: 3-7%
Comparison Against P2P/Buses3• Raw efficiency close to simplest P2P links• NoC more efficient & lower design effort.
46
Thank You!
www.eecg.utoronto.ca/~mohamed