high-level interconnect architectures for fpgas

High-Level Interconnect Architectures for FPGAs

Nick Barrow-Williams

2

Introduction Semiconductor industry has grown rapidly for

several decades

Continued shrinking of device dimension introduces new design challenges

Moving data around a chip can now be the limiting factor of performance

Existing solutions do not scale well

3

Why do existing solutions not scale?

Global connections are longer

Wire depth increased to counter width decrease

Parasitic capacitive effects increase and cause slow signal propagation

4

Why do existing solutions not scale?

Existing system-level connection uses buses

Buses increase resource efficiency and decrease wiring congestion

Not suitable for a large number of modules

A network based alternative would offer higher aggregate bandwidth

5

Why design for FPGA systems?

FPGA silicon area already dominated by wiring

Global wires are limited in number

Increasing gate count only increases wiring congestion

6

The Solution: Network-on-Chip

Use technologies from network systems

Replace inefficient global wiring with high-level interconnection network

Create scalable systems to handle large numbers of modules

7

Existing Solutions Most existing systems are for ASIC designs

Stanford Interconnect RAW SCALE SPIN

PNoC: An solution for FPGAs Complex High hardware cost

Other simulated solutions exist but few are implemented

8

Proposal: Two network systems

Existing solutions use either packet switching or circuit switching techniques

Design, implement, test and synthesise one of each to compare performance and hardware cost

Map solutions to an FPGA platform to evaluate hardware cost in current generation systems

9

Network Architecture Design

Topology Simple Scalable 2 Dimensional

Solution: 2D mesh Topology

10

Network Architecture Design

Routing Algorithm Deterministic

Data always follows same path through network Simple hardware Sensitive to congestion

Adaptive Paths through network can change according to load Complex hardware Avoids congestion

11

Network Architecture Design When choosing routing algorithms must avoid:

Deadlock:

Livelock

Solution: Use unidirectional wiring and allow each node to make two connections

Solution: Use deterministic routing

12

Network Architecture Design Flow control methods

Circuit switched Circuit request propagates through network Path reserved to destination Grant signal propagates back Data sent then circuit deallocated

Packet switched Use header, body and tail Wormhole routing

Forward header and body without waiting for tail Need buffers to store stalled packets

13

Router Design Each router contains a number of modules

FIFOs (only present in packet switched router)

Address to port-request decoder

Arbiter

Control finite state machines

Crossbar

14

Circuit Switched Router Structure

Request In

Request In

Request Out

Grant In

Grant Out

Data In

Data Out

Data In

In & Out Ports

Crossbar

FSM

Arbiter Address to Port Decoder

15

Packet Switched Router Structure

Request From

FIFOs

Request In

Write Out

Full In

Grant Out

Data From

FIFOs

Data Out

Data From FIFOs

In & Out Ports

Crossbar

Control

Arbiter Address to Port DecoderFIFO FSMData In

Full

Write

Grant

Req

Data

5 Qu

eue

Mod

ules

16

Router Implementation and Testing

Both routers were coded using VHDL

Simulation and testing used a combination of ModelSim and Xilinx ISE 9.1

Ad-hoc tests used for individual modules

VHDL testbench used for system verification

17

Testbench Structure

Mesh Network

ReadInput

Input Tables

TestTable

Source

OutputTable

Sink

Compare

TESTBENCH

Command File

Output File

Clock Gen

Reset Gen

Cycle Count

Success: ID: 1 Source : (0,3) Dest : (1,0) Hops : 4 Latency: 34Success: ID: 2 Source : (0,2) Dest : (1,0) Hops : 3 Latency: 27Success: ID: 3 Source : (3,2) Dest : (1,1) Hops : 3 Latency: 22Success: ID: 4 Source : (1,3) Dest : (0,1) Hops : 3 Latency: 22Success: ID: 5 Source : (3,0) Dest : (3,1) Hops : 1 Latency: 12

#START SOURCE DEST SIZE ID# ------------------------------------------------------ 2 3 0 0 1 8 1 3 2 0 0 1 2 2 3 2 3 1 1 2 3 4 3 1 1 0 8 4 5 0 3 1 3 7 5

18

Synthesis

Each router was synthesised for a Virtex-4 LX platform

Post-synthesis verification

Resource usage

Timing

19

Circuit Switched Resource Usage

LUTs Flip-Flops

223

60

203

62

Crossbar

Address to Port Decoder

Arbiter

FSM

16080

23

24

CrossbarAddress to Port DecoderArbiterFSM

Total of 586 4 Input LUTS

~0.1% of a Virtex 5Total of 202 Flip Flops

20

Packet Switched Resource Usage

LUTs Flip-Flops

Total of 786 4 Input LUTS

+34% compared to circuit switched

Total of 237Flip Flops

211

34

203

40

240CrossbarAddress to Port DecoderArbiterControlQueue

90

15185

105CrossbarAddress to Port DecoderArbiterControlQueue

21

Timing Results

Circuit Switched Packet Switched

Max Freq 126.330MHz

Setup time 5.308ns

Hold time 0.272ns

Max Freq 144.533MHz

Setup time 6.125ns

Hold time 0.272nsCritical path is through Arbiter in both

designs

22

Project Appraisal Maintaining an accurate software simulation

proved difficult

A great deal was learnt during the implementation of the circuit switched network

HDL implementations are only prototypes

Testbench provides a good framework but more time is needed to gather performance data

23

Conclusions

Possible to make low complexity network-on-chip systems suitable for FPGAs

Latency has to be traded for throughput

Hard to collect performance data without application driven benchmarks

Both networks are viable so why not use both?

24

Future Work

Cycle accurate software simulations

Application driven benchmarking

Serial transmission

Power efficiency

Industry standard solution

high-level interconnect architectures for fpgas

Documents

fpga systems

wiring congestion

simulated solutions

hardware cost map solutions

network based alternative

inefficient global wiring

unidirectional wiring

large number of modules