jeremy espenshade may 7, 2009. high performance computing performing difficult computations in an...

Scalable Framework for Heterogeneous Clustering of Commodity FPGAs

Jeremy EspenshadeMay 7, 2009

Motivation

High Performance Computing Performing difficult computations in an acceptable time

period Example Areas of Interest:▪ Cryptanalysis▪ Bioinformatics▪ Molecular Dynamics▪ Image Processing

Specialized Hardware Architectural differences can provide orders of magnitude

speedup on suitable applications▪ GPGPUs: Visualization, Linear Algebra, etc through data

parallelism▪ Cell Processor: Video Encoding, Linear Algebra, etc through data

parallelism▪ FPGAs: DSP, Image Processing, Cryptography, etc through bit-

level parallelism

Outline Background

Cluster Computing FPGA Technology Commercial FPGA Supercomputers

Proposed Framework Requirements and Motivation Hardware and Software Organization HW/SW Interaction

Application Case Studies DES Cryptanalysis▪ Step-by-Step Design Flow▪ Demonstration FPGA Cluster▪ Performance Comparison

Matrix Multiplication▪ Platform Performance Characterization

Conclusion and Future Work

Cluster Computing

Historical monolithic supercomputers have given way to networks of smaller computers “If you were plowing a field, which would you rather

use? Two strong oxen or 1024 chickens?” - Seymour Cray

Middleware technologies have made cluster construction and programming easy and efficient Message Passing – MPI, PVM Shared Memory – OpenMP Remote Procedure Call – Java RMI, CORBA Grid Organization – Condor, Globus

Message Passing Interface De facto standard API for inter-process

communication over distributed memory Language-independent library with

point-to-point and collective operations MPI_Send/MPI_Recv MPI_Bcast/MPI_Reduce MPI_Scatter/MPI_Gather MPI_Barrier

OpenMPI Open source implementation in native C

Message Passing Interface

Creates “Virtual Topology” of computation environment with process ranks

R0

R1

R2

R3

R0

R1

R2

R3

R0

R1R2R3

Process Tree:MPI_Send(data, child)MPI_Recv(data,parent)

Process Ring:MPI_Send(data, (myrank +1) % size)MPI_Recv(data, (myrank-1)%size)

Master/Slave:MPI_Bcast(data, to slaves)MPI_Reduce(data, to master)

FPGAs

Field-Programmable Gate Arrays

Devices in which function can be specified through a hardware description language (VHDL/Verilog)

Slower than custom ASICs but much more flexible

Large Degree of fine-grained parallelism

Basic Logic block

Configurable I/0 Block

Interconnects

Configurable Logic Blocks FPGAs realize computation over a network of CLBs Each CLB contains:

Eight 6-Input Look-Up-Tables

Eight Flip-Flops Control Multiplexors Arithmetic and Carry Logic

LUTs preconfigured to implement any 6-inputlogical function

‘A = B ⊕ C ⊕ D ⊕ E ⊕ F ⊕ G’can be calculated in a single cycle even over large operand lengths Bit-Level Parallelism

Xilinx Virtex-5 FXT

• Hybrid FPGA– Hardwired PowerPC

Cores– 2-15k CLBs– ArithmeticDSP Slices– EthernetMAC Units

FPGA Supercomputers

Cray XT5h AMD Opteron and Xilinx

Virtex-4 FPGAs on single blade

Cray SeaStar2+™ Interconnect

Custom API for RPUs

http://www.cray.com/Assets/PDF/products/xt/CrayXT5hBrochure.pdf

SRC6/7 Altera Stratix II FPGAs

and Intel Xeon or AMD Opteron Processors

SRC HI-BAR® Interface Carte® Programming

Environmenthttp://www.srccomp.com/products/src7_hardware.asp

Framework Motivation

Motivating Concepts FPGAs have great performance potential ▪ Especially applications with high bit, instruction, and data

level parallelism Many FPGAs working together would allow even

better parallelism exploitation▪ Increased data parallelism through multi-node partitioning▪ Task parallelism through independent heterogeneous

nodes FPGA cluster frameworks are currently limited to

proprietary supercomputers▪ Commodity clusters will reduce the barrier to entry and

promote FPGA integration and use▪ Occam’s Razor applies to programming environments -

Simple is good.

Framework Requirements Easily Programmable

Common API for both parallel programming and hardware access

Modular hardware supported without modification Hardware/Software Design Independent

Interface access independent of application and implementation

Minimal Framework Overhead System should exhibit acceptable performance

Commodity technologies Ethernet networking, Linux OS, open software

Scalable and Flexible Additional FPGAs easily integrated Heterogeneous nodes seamlessly supported

Extensible Future improvements possible without harsh restrictions

Physical Organization

…

10/100/1000T Ethernet

Ethernet MAC

HW

HW

HW

…

Ethernet Network Hard-wired MAC

on chip

Compact Flash

Single-Node Software Environment

Embedded Linux OS Root File System on Compact Flash (256MB) BusyBox Utilities Minimal Libraries

OpenMPI OpenSSH: Certificate-based security for shell access OpenSSL: TCP/IP security Various support libraries (zlib,etc)

Special Device Files in /dev Hardware devices mapped to character devices fopen, fwrite, fread, and fclose commands Dynamics major number reported in /proc/devices FILE * AllocateResource(char * base_name) ▪ Lock-based Arbitration

Interaction Stack

User Application

Kernel Driver

HW FIFO

HW Unit

MPI Application

fopen fwrite fread fclose

open

Reset FIFOs

Enable Interrupts

Interrupt Controller Write FIFO Read FIFO

State Machines

PLB

write read release

Application-Specific Hardware

MPI_Recv

MPI_Send

MPI_Init MPI_Finalize

ConstantInterface

SoftwareDesign

HardwareDesign

Driver Details

Xilinx provides minimal set of drivers SysAce CF, Ethernet MAC (MII/GMII), etc

Custom FIFO Driver for HW abstraction Boot time▪ Registers platform device and character driver▪ Maps physical memory address and IRQ to virtual space ▪ Constructs address offsets for registers

Device Open▪ Resets hardware accelerator and FIFOs▪ Enable Interrupts

Interrupt Handling▪ Read waits on data in Read FIFO▪ Hardware generates interrupt

Virtual Organization

MAC CF

MAC CF

MAC CF

MAC CF

Each hardware unit can behave as an independent node Hardware units can perform different functions and

similar functions at different speeds Each FPGA can host as many units as fit in the device Configurations with multiple hardware units/process or

other exotic setups are also inherently supported

P0 P1 P2 P3 P4 P4 P5 P6 P7 P8 P9

Application Case Study

The Data Encryption Standard (DES) is a widely used and longstanding block cypher

Due to insufficient cryptographic strength resulting from a 56 bit key, DES has been successfully broken by brute force exhaustive key searches in the past decade

Past approaches: Distributed computing (a la Folding@Home) Custom DES ASICs in a cluster A hybrid of the above two approaches Custom system of 120 Spartan-IIe FPGAs

DES Algorithm

System Development Process

Embedded

Platform

Hardware Accelerat

or

MPI Applica

tion

System Integration

Cross Compilation

Deployment

Independent HW/SW Design

Hardware Algorithm

Implementation Search keys as

fast as possible Software

Partition Key Space

Coordinate Results

Hardware Design

Software Design

Embedded Platform

Xilinx Base System Builder Generates PowerPC, DDR, Compact Flash and

Ethernet Interfaces

Hardware Device Creation Xilinx Peripheral Creation Wizard

PLB Slave Interface Read/Write FIFOs Interrupt Controller Software Reset Software Accessible

Registers State Machine

centric design FIFO Access Interface/AcceleratorInteraction Interrupt Generation

DES Hardware Implementation Key Guesser top level

Contains 2X DES encryption engines 18 stage pipeline (16 rounds plus 1 input, 1

output) Initialized with known plaintext and ciphertext 24 high bits of key expected as input Each DES engine checks lowest (32 – X) bits with

assigned middle bits based on component number

If key is found, return it, otherwise return zero after key space is checked

High 24 Mid X Low 32 - X

FIFO Interface and State Machine

DES Encrypt

000DES Encrypt

001DES Encrypt

010DES Encrypt

011DES Encrypt

100DES Encrypt

101DES Encrypt

110DES Encrypt

111

Plain Text

High 24 bits

Cypher Text

Correct

Guess

SearchComplet

e

ResultKey

State Machine Design

Xilinx ISE VHDL

model of interfaceand DESguessing unitinteraction

Simulate & synthesize Timing and

Resource Utilization

Receive Configuration

Start Searching and Wait for Guessing Results

Return Result or Failure Notification

For Each Configuration Parameter

Read Req Read Ack

While key not found and still searching

Write ReqWrite Ack

For Second Key Half

Generate Interrupt

System Integration

Processor Local Bus Connection (PLB)

XPS Interrupt Controller Connection

Physical Address Assignment

Integrated Bus Structure

PowerPC

Processor Local Bus (PLB)

Arb

iter

EthernetMAC

CompactFlash

HAU_1

UserLogi

cHAU_0

UserLogi

c HAU_N

UserLogi

c

…

DDRRAM

Device Tree Structure

Linux kernel build targets DTS file created as Xilinx library

Driver extracts memory addresses, IRQs, and name information

Example unit description:

plb_des_1: plb-des@c9c20000 {compatible = “xlnx,plb-des-1.00.a”;interrupt-parent = <&xps_intc_0>;interrupts = < 1 2 >;reg = < 0xc9c20000 0x10000 >;xlnx,family = “virtex5”;xlnx,include-dphase-timer = <0x1>;

};

Deployment

Linux Kernel Build targeting specific platform and DTS Include device driver for hardware accelerators▪ make ARCH=powerpc CONFIG_FIFO_DRIVER=y

Generates .ELF programming file Bit-stream Generation

Synthesize, Map, Place, and Route Design Generates .BIT configuration file

Create a System ACE file Merges .BIT and .ELF into .ACE file Place .ACE file onto compact flash and boot

Application Development Master-Slave Structure

Master coordinates dynamic work queue Slaves wait for work or stop condition

Program Flow Master:▪ Send 24-bit key space indicator (KSI) to each slave▪ Wait for a response:▪ If key found, break out, report results and distribute stop

conditions to all slaves▪ If key not found, send next KSI to slave

Slave▪ Allocate and initialize hardware unit▪ Wait for work or stop condition▪ If new work arrives, send KSI to hardware and send back the result

M

SnS2S1 …

Pseudo-Code Structure

#include <stdio.h>#include “mpi.h”#include “fpga_util.h”int main(int argc, char * argv[]){

FILE * my_dev;MPI_Init(argc, argv);MPI_Rank(MPI_COMM_WORLD, &rank);if(rank ==0){ //Master Process// For each slave MPI_Send(key_space_indicator, 1, MPI_INT, slave_rank, 0, MPI_COMM_WORLD);// While work remains in queue and key not found MPI_Recv(result, 3, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);MPI_Send(new_key_space_indicator, 1, MPI_INT, slave_rank, 0, MPI_COMM_WORLD);// Once key foundprintf(“The answer is %d!\n”, data_in);}else if(rank ==1){my_dev = AllocateResource(“des_unit”);setvbuf(my_dev, null, _IONBF, sizeof(int));// Until stop condition received MPI_Recv(KSI, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);fwriteKSI, sizeof(int), 1, my_dev);fread(result, sizeof(int), 2, my_dev); MPI_Send(result, 3,MPI_INT, 0, 0, MPI_COMM_WORLD);}MPI_Finalize();

}

Testbed

Four Virtex-5 devices PowerPC 440 Processor @ 400 MHz Two ML507 and two ML510 boards 256/512 MB DDR2

One Virtex-4 device PowerPC 405 Processor @ 300 MHz ML410 board 256 MB DDR2

100 MHz Processor Local Bus (PLB) RIT Ethernet Network DES Resource Usage

ML510: 3 Units each, ML507: 2 Units each, ML410: 1 Unit 11 Units total over 5 FPGAs

DES Application Scalability

0 2 4 6 8 10 120

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Ideal ScalingActual Scaling

Hardware Accelerators

Tota

l K

eys G

uessed /

Second

(Millions)

Hardware DES unit can guess 8 keys/cycle*100MHz = 800 Million keys/sec

11 Distributed Hardware Units => 8800 M keys/sec ideally

Actual performance is 8548.55 M keys/sec = 97.14%

Performance Comparison DES search application developed for cluster of

2.33 GHz Xeon processors with same program structure

Single node performance = 0.735 M Keys/sec Scales to 7.722 M Keys/sec across 11 cores @

95.4% efficiency

System Performance

Speedup

Cost (approx.)

Price/Peformance

Commodity FPGAs

8548.552 MK/s

1107x ~$11K 371x

Cray XD1 7200 MK/s 930X ~100K 42x

SRC-6 4000 MK/s 518X N/A N/A

Xeon 2.3 GHz 7.722 MK/s 1x ~$4K 1x

A standard test of computational capability is matrix multiplication A*B=C

Highly data parallel Each index of the result matrix can be computed independently: C[i][j] = A[i][] * B[][j]

Hardware Design Store multiple rows of A and compute a running dot

product for each row when receiving a column of B Software Design

Statically partition work across available FPGAs and aggregate results

Matrix Multiplication

Single Node Results

FPGAs fare poorly in comparison to Xeon GPP Largest FPGA with most available hardware comparable

to single core, but poor price/performance With greater concurrency, the FPGA should have

performed better. Why didn’t it?

0 500 1000 1500 2000 25000

0.2

0.4

0.6

0.8

1

1.2

ML410 (32 MACs)ML507 (32 MACs)ML510 (32 MACs)ML510 (64 MACs)2.33 GHz Xeon

Matrix Dimensions

Speedup O

ver

Xeon

Single Node Execution Time (s)

Board 256 512 1024 2048

ML410 0.162 0.957 6.562 47.139

ML507 0.071 0.470 3.427 26.015

ML510 (32) 0.071 0.471 3.410 26.403

ML510 (64) 0.043 0.269 1.875 14.269

Xeon 2.33 0.031 0.258 1.956 13.405

Analysis

Bandwidth is the limiting factor Data is communicated word by word over a

32-bit arbitrated bus (PLB) Processor independent DMA required for

improved performance

256x256 (Mbps)

512x512 (Mbps)

1024x1024 (Mbps)

2048x2048(Mbps)

ML410 16.15 19.73 21.73 23.49ML507 37.03 40.18 41.61 42.56ML510 (32) 36.92 40.08 41.82 41.94ML510 (64) 36.46 38.99 40.27 39.98

Scalability Results

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.50

5

10

15

20

25

30

ML507

2xML507

2xML507,ML510

2xML507,2xML510

2xML507,2xML510,

ML410

256 x 256512 x 5121024 x 10242048 x 2048

FPGAs

Executi

on T

ime

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.50

0.2

0.4

0.6

0.8

1

1.2

256 x 256512 x 5121024 x 10242048 x 2048

FPGAs

Scaling F

acto

r Problem scales well

across Virtex-5 devices 83% across 4 FPGAs

Virtex-4 addition causes worse performance Static Partitioning

Conclusions

Scalable framework for clustering of FPGAs developed and demonstrated Flexible application development with decoupled HW/SW Co-

design Standard MPI programming interface and simple hardware

interaction abstractions/APIs Commodity hardware technologies and open software allows

low barrier to entry for further research and development Application Case Studies

Well-suited applications like cryptanalysis perform admirably▪ Performance improvements of > 1100x▪ Price/Performance improvements of > 370x

Current bandwidth limitations holding back other applications

Future Work

Framework Infrastructure Hardware Communication▪ DMA data transfer from PowerPC to HW▪ Fast HW->HW Interconnection on single FPGA▪ Dynamic Reconfiguration of Hardware Accelerators

Cluster Management▪ Job Submission and Resource Matching with Condor or similar▪ Monitoring with Ganglia or similar

Robustness▪ Fault Tolerance▪ Correct HW Usage Enforcement

New Applications Performance study with more inter-process communication▪ Image Processing could be a good place to start

Neural Simulation Platform (Dmitri Yudanov, CE Dept) Expressed interest from EE and Astrophysics departments

Design Flow Improvements Integrated tool chain and improved deployment procedure

Questions?

Thank you for listening

Contact: Jeremy [email protected] Computer EngineeringHardware Design LabPrimary Advisor: Dr. Marcin Lukowiak

Espenshade, Jeremy. Scalable Framework for Heterogeneous Clustering of Commodity FPGAs. Master’s thesis, Rochester Institute of Technology, 2009.Cray Inc. Cray XD1 Supercomputer Outscores Competition in HPC Challenge Benchmark Tests. Business Wire. Feb 15 2005. http://investors.cray.com/phoenix.zhtml?c=98390&p=irol-newsArticle&ID=674199&highlight=.Tarek El-Ghazawi, Esam El-Araby, Miaoqing Huang, Kris Gaj, Volodymyr Kindratenko, and Duncan Buell. The Promise of High-Performance Reconfigurable Computing. IEEE Computer Magazine,41(2):69–76, 2008.Xilinx Corp. Virtex-5 Multi-Platform FPGAs, 2009. http://www.xilinx.com/products/virtex5/.

mailto:[email protected]

http://investors.cray.com/phoenix.zhtml?c=98390&p=irol-newsArticle&ID=674199&highlight=

http://www.xilinx.com/products/virtex5/

jeremy espenshade may 7, 2009. high performance computing performing difficult computations in an...

Documents