mohamed abdelfattah vaughn betz. 2 why nocs on fpgas? embedded nocs area & power analysis 1 1 2...
TRANSCRIPT
The Case for Embedded NoCs on FPGAs
Mohamed ABDELFATTAHVaughn BETZ
2
Outline
Why NoCs on FPGAs?
Embedded NoCs
Area & Power Analysis
1
2
3
Comparison Against P2P/Buses4
3
Interconnect
Motivation1. Why NoCs on FPGAs?
Logic Blocks
Switch Blocks
Wires
4
Motivation1. Why NoCs on FPGAs?
Logic Blocks
Switch Blocks
Wires
Hard Blocks:• Memory• Multiplier• Processor
5
Motivation1. Why NoCs on FPGAs?
Logic Blocks
Switch Blocks
Wires
Hard InterfacesDDR/PCIe ..
Interconnect still the same
Hard Blocks:• Memory• Multiplier• Processor
1600 MHz
200 MHz
800 MHz
6
MotivationDDR3 PHY and Controller
Problems:1. Bandwidth requirements for
hard logic/interfaces2. Timing closure
1. Why NoCs on FPGAs?PCIe Controller
Gigabit Ethernet
1600 MHz
200 MHz
800 MHz
7
MotivationDDR3 PHY and Controller
Problems:1. Bandwidth requirements for
hard logic/interfaces2. Timing closure3. High interconnect utilization:
– Huge CAD Problem– Slow compilation– Power/area utilization
4. Wire speed not scaling:– Delay is interconnect-dominated
1. Why NoCs on FPGAs?PCIe Controller
Gigabit Ethernet
Barcelona Los Angeles
Keep the “roads”, but add “freeways”.
Hard Blocks
Logic Cluster
Source: Google Earth
9
DDR3 PHY and Controller
1. Why NoCs on FPGAs?PCIe Controller
Gigabit Ethernet
Problems:1. Bandwidth requirements for
hard logic/interfaces2. Timing closure3. High interconnect utilization:
– Huge CAD Problem– Slow compilation– Power/area utilization
4. Wire speed not scaling:– Delay is interconnect-dominated
FPGA with NoCNoC
Routers
Links Router forwards data packet
Router moves data to local interconnect
10
DDR3 PHY and Controller
1. Why NoCs on FPGAs?PCIe Controller
Gigabit Ethernet
Problems:1. Bandwidth requirements for
hard logic/interfaces2. Timing closure3. High interconnect utilization:
– Huge CAD Problem– Slow compilation– Power/area utilization
4. Wire speed not scaling:– Delay is interconnect-dominated
5. Abstraction favours modularity:– Parallel compilation– Partial reconfiguration– Multi-chip interconnect
FPGA with NoC
Pre-design NoC to requirements NoC links are “re-usable” NoC is heavily “pipelined” NoC abstraction favors modularity
High bandwidth endpoints known
11
DDR3 PHY and Controller
1. Why NoCs on FPGAs?PCIe Controller
Gigabit Ethernet
FPGA with NoC
Latency-tolerant communication NoC abstraction favors modularity
Problems:1. Bandwidth requirements for
hard logic/interfaces2. Timing closure3. High interconnect utilization:
– Huge CAD Problem– Slow compilation– Power/area utilization
4. Wire speed not scaling:– Delay is interconnect-dominated
5. Abstraction favours modularity:– Parallel compilation– Partial reconfiguration– Multi-chip interconnect
NoCs can simplify FPGA design
Does the NoC abstraction come at a high area/power cost?
How to integrate NoCs in FPGAs?
How do embedded NoCs compare to current interconnects?
12
OutlineWhy NoCs on FPGAs?
Embedded NoCs
Area & Power Analysis
1
2
3
Mixed NoCs Hard NoCs
Comparison Against P2P/Buses4
Embedded NoCsFPGA
DD
Rx In
terf
ace
PCIe
Inte
rfac
e
Router
Compute Module
Links(Hard or Soft)
Fabric
Port
(Hard or Soft)
2. Embedded NoCs
“Mixed” NoC
“Hard” NoC
Soft LinksHard Routers
Hard LinksHard Routers =++
=“Soft” NoCSoft LinksSoft Routers + =
14
Soft Hard
FPGA CAD Tools ASIC CAD Tools
Design Compiler
Area
Speed
Power?Power
Methodology
Toggle rates
Gate-level simulation Gate-level simulation
Mixed
HSPICE
15
Router Logic
Programmable Interconnect
FPGA
Router
Mixed NoCs2. Embedded NoCs
Logic blocks
Baseline Router
Programmable“soft” interconnect
Width VCs Ports Buffer
32 2 5 10/VC
“Mixed” NoCSoft LinksHard Routers + =
16
Router Logic
Programmable Interconnect
FPGA
Router
Mixed NoCs2. Embedded NoCs
Router Logic
16“Mixed” NoCSoft LinksHard Routers + =
17
Router Logic
Programmable Interconnect
Router
Assumed a mesh Can form any topology
FPGA
Mixed NoCs2. Embedded NoCs
Special FeatureConfigurable topology
18
Router Logic
Dedicated Interconnect
FPGA
Router
Hard NoCs2. Embedded NoCs
Logic blocks
Dedicated “hard” interconnect
Programmable“soft” interconnect
18“Hard” NoCHard LinksHard Routers + =
19
Router Logic
Dedicated Interconnect
FPGA
Router
Hard NoCs2. Embedded NoCs
Router Logic
19“Hard” NoCHard LinksHard Routers + =
20
Router Logic
Dedicated Interconnect
FPGA
Router
Hard NoCs2. Embedded NoCs
Low-V mode
1.1 V0.9 V
Save 33% Dynamic Power
Special Feature
~15% slower
20“Hard” NoCHard LinksHard Routers + =
21
Fabric Port2. Embedded NoCs
21
Router
Compute Module
Links(Hard or Soft)
Fabric
Port
(Hard or Soft)• Width adaptation
• Frequency adaptation
• Voltage adaptation
Bridge NoC and FPGA fabric:
• Bus protocol e.g. AXI
22
OutlineWhy NoCs on FPGAs?
Embedded NoCs
1
2
Area & Power Analysis
Soft vs. mixed vs.Hard
3
System Area/Power
Comparison Against P2P/Buses4
23
Router Microarchitecture
State-of-the-art router architecture from Stanford:1. NoC community have excelled at building on-chip routers:
We just use it2. To meet FPGA bandwidth requirements:
High-performance router3. Complex functionality such as virtual channels:
Assigning traffic priority could be useful
3. Area/Power Analysis
Input Modules Output Modules
Virtual Channel (VC) Allocator
Switch Allocator
Crossbar Switch
1
5
1
5
Routers and Links
24
3. Area/Power Analysis
Hard Router vs. Soft Router
9X smaller, 2.4X faster, 1.4X lower power
30X smaller, 6X faster, 14X lower power
Hard Links vs. Soft Links
Soft, Mixed and Hard
Mixed Hard Soft
Speed
Speed
Bisection BW
~ 1.5% of FPGA33% of FPGA
730 – 940 MHz166 MHz
~ 50 GB/s~ 10 GB/s
64 –
NoC
[65 nm]
3. Area/Power Analysis
576 LBs~12,500 LBsArea
448 LBs
64-node NoC on Stratix III
Soft, Mixed and Hard
Mixed Hard (Low-V)Soft
Speed
Speed
Bisection BW
~ 1.5% of FPGA33% of FPGA
730 – 940 MHz166 MHz
~ 50 GB/s~ 10 GB/s
64 –
NoC
[65 nm]
3. Area/Power Analysis
576 LBs~12,500 LBsArea
448 LBs
Provides ~50GB/s peak bisection bandwidth
Very Cheap! Less than cost of 3 soft nodes
64-node NoC on Stratix III
28
NoC Power BudgetSoft NoC Mixed NoC Hard NoC Hard NoC (Low-V)
17.4 W
250 GB/s total bandwidth
Typical FPGA Dynamic Power
123%How much is used for system-level communication?
3. Area/Power Analysis
Largest Stratix-III device
29
NoC Power BudgetSoft NoC Mixed NoC Hard NoC Hard NoC (Low-V)
17.4 W
NoC
250 GB/s total bandwidth 15%
Typical FPGA Dynamic Power
3. Area/Power Analysis
123%
30
NoC Power Budget
NoC
17.4 WTypical FPGA
Dynamic Power
Soft NoC Mixed NoC Hard NoC Hard NoC (Low-V)250 GB/s total bandwidth 15%123% 11%
3. Area/Power Analysis
31
NoC Power Budget
NoC
17.4 WTypical FPGA
Dynamic Power
Soft NoC Mixed NoC Hard NoC Hard NoC (Low-V)250 GB/s total bandwidth 15%123% 11% 7%
3. Area/Power Analysis
32
Bandwidth in Perspective
14.6 GB/s
14.6 GB/s
14.6 GB/s
14.6 GB/s
17 G
B/s
17 G
B/s
17 G
B/s
17 G
B/s
DDR3 Module 1
PCIe Module 2
Full theoretical BW
126 GB/sAggregate Bandwidth
3.5%NoC Power Budget
Cross whole chip!
3. Area/Power Analysis
33
OutlineWhy NoCs on FPGAs?
Embedded NoCs
1
2
Area &Power Analysis
Point-to-point links
3
Comparison Against P2P/Buses4
Qsys Buses
34
FPGA Interconnect
1 1
Point-to-point Links
Broadcast
1 1
n
Multiple Masters
1
1Mux + Arbiter
n
Multiple Masters, Multiple Slaves
1 1Mux + Arbiter
n nMux + Arbiter
Interconnect = Just wires Interconnect = Wires + Logic Interconnect = NoC
1 .. .. ..
.. .. .. ..
.. .. ..
.. .. .. n
..Compare “wires” interconnect to NoCs
4. Comparison
35
NoC Power vs. FPGA Interconnect
Hard and Mixed NoCs Area/Power Efficient
Length of 1 NoC Link1 % area overhead on Stratix 5
Runs at 730-943 MHz
Power on-par with simplest FPGA interconnect
200 MHz
High Performance / Packet Switched
4. Comparison
36
DDR3: Qsys Bus vs. NoC4. Comparison
Qsys bus: Build logical bus from fabric
Embedded NoC: 16 Nodes, hard routers & links
37
Design Effort4. Comparison
• Steps to close timing using Qsys
close
FPGA
38
Design Effort4. Comparison
• Steps to close timing using Qsys
far
FPGA
39
Design Effort4. Comparison
• Steps to close timing using Qsys
far
FPGA
Timing closure can be simplified with an embedded NoC
40
Area Comparison4. Comparison
41
Area Comparison4. Comparison
42
Area Comparison4. Comparison
Entire NoC smaller than bus for 3 modules!
43
Area Comparison4. Comparison
1/8 Hard NoC BW used already less area for most systems
44
Power Comparison4. Comparison
Hard NoC saves power for even the simplest systems
1
2
3
Big city needs freeways to handle traffic
Area: 20-23X
Why NoCs on FPGAs?
Embedded NoCs: Mixed & Hard
Area & Power Analysis
Speed: 5-6X Power: 9-15X
• Area Budget for 64 nodes: ~1%• Power Budget for 100 GB/s: 3-7%
Comparison Against P2P/Buses4• Raw efficiency close to simplest P2P links• NoC more efficient & lower design effort
46
eecg.utoronto.ca/~mohamed/noc_designer.html
47
Thank You!
eecg.utoronto.ca/~mohamed/noc_designer.html
48
200 MHz 128-bit module, 900 MHz 32-bit router? Configurable time-domain mux / demux: match bandwidth Asynchronous FIFO: cross clock domains Full NoC bandwidth, w/o clock restrictions on modules
2. Embedded NoCs
Fabric Port
49
1. Why NoCs on FPGAs?
Compute Acceleration
• Maxeler• Geoscience (14x, 70x)• Financial analysis (5x, 163x)
• Altera OpenCL• Video compression (3x, 114x)• Information filtering (5.5x)
GPU CPU
50
1. Why NoCs on FPGAs?
Compute Acceleration
51
1. Why NoCs on FPGAs?
Compute Acceleration
52
1. Why NoCs on FPGAs?
Compute Acceleration
NoC