high-level interconnect architectures for fpgas
DESCRIPTION
High-Level Interconnect Architectures for FPGAs. Nick Barrow-Williams. Introduction. Semiconductor industry has grown rapidly for several decades Continued shrinking of device dimension introduces new design challenges Moving data around a chip can now be the limiting factor of performance - PowerPoint PPT PresentationTRANSCRIPT
High-Level Interconnect Architectures for FPGAs
Nick Barrow-Williams
2
Introduction Semiconductor industry has grown rapidly for
several decades
Continued shrinking of device dimension introduces new design challenges
Moving data around a chip can now be the limiting factor of performance
Existing solutions do not scale well
3
Why do existing solutions not scale?
Global connections are longer
Wire depth increased to counter width decrease
Parasitic capacitive effects increase and cause slow signal propagation
4
Why do existing solutions not scale?
Existing system-level connection uses buses
Buses increase resource efficiency and decrease wiring congestion
Not suitable for a large number of modules
A network based alternative would offer higher aggregate bandwidth
5
Why design for FPGA systems?
FPGA silicon area already dominated by wiring
Global wires are limited in number
Increasing gate count only increases wiring congestion
6
The Solution: Network-on-Chip
Use technologies from network systems
Replace inefficient global wiring with high-level interconnection network
Create scalable systems to handle large numbers of modules
7
Existing Solutions Most existing systems are for ASIC designs
Stanford Interconnect RAW SCALE SPIN
PNoC: An solution for FPGAs Complex High hardware cost
Other simulated solutions exist but few are implemented
8
Proposal: Two network systems
Existing solutions use either packet switching or circuit switching techniques
Design, implement, test and synthesise one of each to compare performance and hardware cost
Map solutions to an FPGA platform to evaluate hardware cost in current generation systems
9
Network Architecture Design
Topology Simple Scalable 2 Dimensional
Solution: 2D mesh Topology
10
Network Architecture Design
Routing Algorithm Deterministic
Data always follows same path through network Simple hardware Sensitive to congestion
Adaptive Paths through network can change according to load Complex hardware Avoids congestion
11
Network Architecture Design When choosing routing algorithms must avoid:
Deadlock:
Livelock
Solution: Use unidirectional wiring and allow each node to make two connections
Solution: Use deterministic routing
12
Network Architecture Design Flow control methods
Circuit switched Circuit request propagates through network Path reserved to destination Grant signal propagates back Data sent then circuit deallocated
Packet switched Use header, body and tail Wormhole routing
Forward header and body without waiting for tail Need buffers to store stalled packets
13
Router Design Each router contains a number of modules
FIFOs (only present in packet switched router)
Address to port-request decoder
Arbiter
Control finite state machines
Crossbar
14
Circuit Switched Router Structure
Request In
Request In
Request Out
Grant In
Grant Out
Data In
Data Out
Data In
In & Out Ports
Crossbar
FSM
Arbiter Address to Port Decoder
15
Packet Switched Router Structure
Request From
FIFOs
Request In
Write Out
Full In
Grant Out
Data From
FIFOs
Data Out
Data From FIFOs
In & Out Ports
Crossbar
Control
Arbiter Address to Port DecoderFIFO FSMData In
Full
Write
Grant
Req
Data
5 Qu
eue
Mod
ules
16
Router Implementation and Testing
Both routers were coded using VHDL
Simulation and testing used a combination of ModelSim and Xilinx ISE 9.1
Ad-hoc tests used for individual modules
VHDL testbench used for system verification
17
Testbench Structure
Mesh Network
ReadInput
Input Tables
TestTable
Source
OutputTable
Sink
Compare
TESTBENCH
Command File
Output File
Clock Gen
Reset Gen
Cycle Count
Success: ID: 1 Source : (0,3) Dest : (1,0) Hops : 4 Latency: 34Success: ID: 2 Source : (0,2) Dest : (1,0) Hops : 3 Latency: 27Success: ID: 3 Source : (3,2) Dest : (1,1) Hops : 3 Latency: 22Success: ID: 4 Source : (1,3) Dest : (0,1) Hops : 3 Latency: 22Success: ID: 5 Source : (3,0) Dest : (3,1) Hops : 1 Latency: 12
#START SOURCE DEST SIZE ID# ------------------------------------------------------ 2 3 0 0 1 8 1 3 2 0 0 1 2 2 3 2 3 1 1 2 3 4 3 1 1 0 8 4 5 0 3 1 3 7 5
18
Synthesis
Each router was synthesised for a Virtex-4 LX platform
Post-synthesis verification
Resource usage
Timing
19
Circuit Switched Resource Usage
LUTs Flip-Flops
223
60
203
62
Crossbar
Address to Port Decoder
Arbiter
FSM
16080
23
24
CrossbarAddress to Port DecoderArbiterFSM
Total of 586 4 Input LUTS
~0.1% of a Virtex 5Total of 202 Flip Flops
20
Packet Switched Resource Usage
LUTs Flip-Flops
Total of 786 4 Input LUTS
+34% compared to circuit switched
Total of 237Flip Flops
211
34
203
40
240CrossbarAddress to Port DecoderArbiterControlQueue
90
15185
105CrossbarAddress to Port DecoderArbiterControlQueue
21
Timing Results
Circuit Switched Packet Switched
Max Freq 126.330MHz
Setup time 5.308ns
Hold time 0.272ns
Max Freq 144.533MHz
Setup time 6.125ns
Hold time 0.272nsCritical path is through Arbiter in both
designs
22
Project Appraisal Maintaining an accurate software simulation
proved difficult
A great deal was learnt during the implementation of the circuit switched network
HDL implementations are only prototypes
Testbench provides a good framework but more time is needed to gather performance data
23
Conclusions
Possible to make low complexity network-on-chip systems suitable for FPGAs
Latency has to be traded for throughput
Hard to collect performance data without application driven benchmarks
Both networks are viable so why not use both?
24
Future Work
Cycle accurate software simulations
Application driven benchmarking
Serial transmission
Power efficiency
Industry standard solution