gordon brebner xilinx research labs ee 392 i field programmable technology for mainstream processing...

Gordon BrebnerXilinx Research Labs

EE 392 I

Field Programmable Technology for Mainstream Processing

27 April 2010

© Copyright 2009 Xilinx

What this talk is about

Overview of the Field Programmable Gate Array (FPGA)

Processing opportunities for FPGAs

The relationship between FPGAs and CPUs

Programming models for FPGAs

2


Field programmable technology

“Field programmable” means “programmable in the field”

In other words, it is a “soft hardware” technology

Lingering historical stereotype:

More advanced stereotype: the sea of programmable logic gates

Today’s focus: the Field Programmable Gate Array (FPGA)

The Programmable Logic Array (PLA)

3


A contemporary FPGA: the Xilinx Virtex-6

Example: XC6VHX565T– 566,784 programmable logic cells

– 32.832 Mb block memory

– 864 DSP blocks

– 720 parallel input/outputs

– 24 GTH transceivers (to 11.2G)

– 48 GTX transceivers (to 6.6G)

– 4 PCI Express blocks– 4 Tri-mode MAC blocks

Importantly, the components are interconnected by a very large amount of programmable wiring

Programmable logic

DSP block

Block memory

High-speed transceiver

Parallel input/output

System Monitor

Tri-mode Ethernet MAC

PCI Express interface

High performance clockingHigh performance clocking

4


Original FPGA use situations

Prototyping ASICs– Classic style of chip design flow

– Generate FPGA programming information instead of mask data

Providing glue logic for systems– Implement random logic to connect together other chips

– Typically small-scale functions

Implementing input/output interfaces– Do heavy lifting for getting bits in and out quickly

– Exploit the support for numerous physical i/o standards

5


Positioning FPGAs for mainstream processing

Moving the FPGA to the heart of the system– As a first-order processing component

Background: FPGAs have the raw processing capacity– Approaching millions of programmable logic cells

– Other hardened processing components

– Embedded memory blocks

Question: how do you organize this low-level capacity?– What soft processing architectures might be configured?

– How can a high level programming experience be provided?

6


FPGAs for Digital Signal Processing (DSP)

Best known FPGA processing domain: streaming data flow– Low level: filters, DFT/FFT, coders/decoders, etc.

– High level: wireless (e.g. 802.16, LTE), video (e.g., H.264, MPEG)

Example of benefit:

7


FPGAs for High Performance Computing (HPC)

Increasing area of application– Enabled by the increasing capabilities of FPGA

– Scale down supercomputing techniques to FPGA-style parallelism

Example areas:– Finance: automated stock trading, etc.

– Bioinformatics: gene sequencing, etc.

– Cryptography

– Meteorology

Lack of native support for floating point arithmetic has retarded adoption in certain areas

8

SAN JOSE, Calif., April 6, 2010

Xilinx Helps University of Regensburg Launch the World's Most Power-Efficient Supercomputer

… QPACE; a bespoke supercomputer developed to unlock the mysteries of Quantum Chromodynamics.


Network processing contexts

9

NIC card

Standalone switch/router

Telecom line card


The NetFPGA platform

• Stanford/Xilinx collaboration• Worldwide teaching and research ecosystem• Four gigabit Ethernet ports; PCI interface to host PC• Four x 10G version imminent

netfpga.org

10


NetFPGA projects

11


Processing on a telecom line card

12

Bac

kpla

ne

Framer /MAC /

Mapper

Framer /MAC /

MapperPHYPHY Packet

ProcessorPacket

ProcessorTraffic

ManagerTraffic

Manager

Bridging /BackplaneInterface

Bridging /BackplaneInterface

OpticsOptics

Frame data in physical blocksCompute error check codes

Police and shape flow ratesQueue and schedule packets

Classify packets into flows

Parse and edit packets


FPGAs for Network Processing

Physical line interface and backplane interface– Standard input/output interfacing use case for FPGAs

Framing and error checking– Moving bits to the right places

– Computing standard coding functions

Packet processing– Classification: computing hashing functions, lookup in binary/ternary CAM

– Inspection and surgery: programmable parsing and editing of packets

Traffic management– Moving packets to the right places, and storing them

– Computing policing and scheduling functions

Functions can be carried out at line rates of 100 Gb/s and above

13


Example of FPGA in virtual networking node

Dynamically managed physical Hardware

High throughput data plane

Triply redundant datapath

Security algorithm

acceleration

Virtual node 1

Virtual node 2

Virtual node 3

Highly secure networkHigh reliability sensor network

High speed e-science network

14


Relationship between FPGAs and CPUs

FPGA processing and CPU processing are complementary– FPGA better for bit-level, high-speed streaming, regular, settings

– CPU better for word-level, random access, irregular, settings

Processors can be embedded into FPGAs– Hardened in the silicon

• Example: PowerPC included in several Xilinx Virtex generations

• Example: ARM – just announced by Xilinx today

– Soft• Built out of programmable logic

• Can now fit in multi-core architectures with 100s of processors

In principle, programmable logic can be embedded into CPUs

15


Cohabitation model #1: Instruction acceleration

CPU instruction set augmented by custom instructions

When a custom instruction is executed:

1. Instruction operands are presented over a CPU data bus

2. Become inputs to a processing block on the FPGA

3. Processing operation is carried out

4. Outputs directed from processing block to data bus

5. Instruction result(s) are obtained from CPU data bus

Model applicable whether CPU is internal or external to the FPGA– Latency is higher in the latter case of course

16


Cohabitation model #2: Function acceleration

More loosely-coupled hardware acceleration

CPU requests function computation by FPGA– Library function call; or

– Operating system function; or

– Input/output operation

FPGA receives function request, with function arguments

FPGA returns function result(s)

Latency may be hidden by multi-threading or multi-tasking

17


Cohabitation model #3: Peer processing

Example: Xilinx/Intel

Use Intel Front Side Bus (FSB)

Cache to cache data transfer– Bypass slow system memory

Heterogeneous Multi-core– x86 cores, FPGA processing elements

Performance– 8.5 GB/s bandwidth– 105ns latency for 64 bytes

Programming Model– Global shared memory space– Always consistent (HW coherency)

18


Cohabitation model #4: Software deceleration

Inverse of conventional hardware acceleration model

Mainstream high-speed processing functions are in the FPGA

CPU helps out the FPGA– Infrequent processing corner cases

– Back office data processing after initial data filtering

– Acceptable to have a slower (decelerated) processing rate

Notable use case is in network processing– FPGA does generic processing at high speed line rates

– CPU gets involved with low probability, or at lower speed data rate

19


Cohabitation model #5: Data plane, control plane

Loosely coupled relationship

FPGA is carrying out processing functions

CPU performs control and configuration functions– Initial configuration of the FPGA

– Updates to FPGA configuration, or to shared memories

CPU handles events– Error situations detected by FPGA

– Status updates from the FPGA

20


Programming models for FPGAs

Programming an FPGA is typically a hardware design experience– Reflecting the FPGA seen as an ASIC substitute

As FPGAs get more complex, so this gets more complex– More modular approaches are now used, to encourage design re-use

Deterrent if FPGA is to be used for mainstream processing– Many more designers have a software background

Programming challenges are not restricted to FPGAs– New models are needed to exploit multi-core and GPUs, for example

21


Standard FPGA programming model

22


High-level synthesis: “C to gates”

Idea is to compile C/C++ to FPGA as alternative to CPU

Usually only involves ‘easier’ subsets of full C

Best for loop unrolling, to yield pipelined parallel hardware architectures– For example, in DSP

– Can match hand hardware designs for quality of results

23

Example: Mentor Catapult C


Xilinx Labs packet processing design environment

Overall system design:– Click (invented at MIT) is used to

describe a system in terms of inter-connected programmable components and subsystems

Packet processing functions:– G (invented at Xilinx) is a packet-centric

language used for describing and generating high speed packet processing components for Xilinx FPGAs

// Packet processing for M-SHIM

handle PBpacket on packetin {

var result : MAPresult;

var B_TAG : VLANtag;

var I_TAG : Itag;

forward /* PBBpacket */ on packetout;

remove S_TAG;

insert result.B_DA, B_SA, B_TAG, I_TAG;

read result from MAPtable [S_TAG.VID];

set B_TAG.type = 0xAAAA;

set B_TAG.pcfi = 0x0;

set B_TAG.VID = result.B_VID;

set I_TAG.type = 0xBBBB;

set I_TAG.pcfi = 0x0;

set I_TAG.VID = result.I_SID;

}

24


G description of a packet format

Source Address

Type

Control

IP Header

Destination Address

Payload

format ethernet_packet = ( dmac : 48, smac : 48, type : 16, control : 24, ip: ip_header, : * );

• Format describes fields within the packet• No specific standard protocol knowledge is built into G• Can have alternative choices within packet formats

25


Basic G packet parsing and editing rules

Payload

Source AddressType

Control

IP Header

Packet

Destination Address

New Address

Type

Control

IP Header

Payload

Packet

Destination Address

Test Values Change Values

Source AddressType

Control

IP Header

Payload

Packet Destination Address

Type

Control

Insert New Fields

Source Address

IP Header

Payload

Packet

Destination Address

Remove Fields

[packet.type == 0x8100] { // conditional operations

}

set packet.SA = 0x010203040506 set TC.Type= 0x8100;set TC.Control = 0x123456;insert TC after TC1;

remove TC1;remove TC2;

• G is a declarative language• Rules can have conditional guards, and can be organized into blocks• Concurrent execution of rules is the default semantics

26


Other candidate programming models

From the HPC community– Message Passing Interface (MPI) – message passing model

– Open Multi-Processing (OpenMP) – shared memory model

From the General Purpose GPU (GPGPU) community– Open Computing Language (OpenCL) – originally from Apple

– Compute Unified Device Architecture (CUDA) – from Nvidia

From the streaming programming community– StreamC and Brook – from Stanford

– StreamIt – from MIT

FPGA could emulate the presumed underlying architectures– Or maybe FPGAs deserve their own distinctive programming model

27


Conclusions

FPGAs have advanced far beyond the early programmable logic technologies – in terms of functions and complexity

FPGAs have attractive fine-grain parallel computing capabilities that have moved them beyond their traditional niche applications into mainstream processing

FPGAs can cohabit naturally with CPUs to bring together the strengths of the respective technologies

FPGAs are typically programmed today using a hardware design flow, but more software-like approaches are feasible

28

gordon brebner xilinx research labs ee 392 i field programmable technology for mainstream processing...

Documents

programmable logic cells32

programmable wiring

fpga processing domain

ifield programmable

soft processing architectures

systemsimplement random

xilinxa contemporary

xilinxpositioning fpgas