scalable framework for heterogeneous clustering of commodity fpgas
DESCRIPTION
Jeremy Espenshade May 7, 2009. Scalable Framework for Heterogeneous Clustering of Commodity FPGAs. Motivation. High Performance Computing Performing difficult computations in an acceptable time period Example Areas of Interest: Cryptanalysis Bioinformatics Molecular Dynamics - PowerPoint PPT PresentationTRANSCRIPT
Scalable Framework for Heterogeneous Clustering of Commodity FPGAs
Jeremy EspenshadeMay 7, 2009
Motivation High Performance Computing
Performing difficult computations in an acceptable time period
Example Areas of Interest:▪ Cryptanalysis▪ Bioinformatics▪ Molecular Dynamics▪ Image Processing
Specialized Hardware Architectural differences can provide orders of magnitude
speedup on suitable applications▪ GPGPUs: Visualization, Linear Algebra, etc through data parallelism▪ Cell Processor: Video Encoding, Linear Algebra, etc through data
parallelism▪ FPGAs: DSP, Image Processing, Cryptography, etc through bit-level
parallelism
Outline Background
Cluster Computing FPGA Technology Commercial FPGA Supercomputers
Proposed Framework Requirements and Motivation Hardware and Software Organization HW/SW Interaction
Application Case Studies DES Cryptanalysis▪ Step-by-Step Design Flow▪ Demonstration FPGA Cluster▪ Performance Comparison
Matrix Multiplication▪ Platform Performance Characterization
Conclusion and Future Work
Cluster Computing Historical monolithic supercomputers have given
way to networks of smaller computers “If you were plowing a field, which would you rather
use? Two strong oxen or 1024 chickens?” - Seymour Cray
Middleware technologies have made cluster construction and programming easy and efficient Message Passing – MPI, PVM Shared Memory – OpenMP Remote Procedure Call – Java RMI, CORBA Grid Organization – Condor, Globus
Message Passing Interface De facto standard API for inter-process
communication over distributed memory Language-independent library with point-
to-point and collective operations MPI_Send/MPI_Recv MPI_Bcast/MPI_Reduce MPI_Scatter/MPI_Gather MPI_Barrier
OpenMPI Open source implementation in native C
Message Passing Interface Creates “Virtual Topology” of
computation environment with process ranks
R0
R1
R2
R3
R0
R1
R2
R3
R0
R1R2R3
Process Tree:MPI_Send(data, child)MPI_Recv(data,parent)
Process Ring:MPI_Send(data, (myrank +1) % size)MPI_Recv(data, (myrank-1)%size)
Master/Slave:MPI_Bcast(data, to slaves)MPI_Reduce(data, to master)
FPGAs Field-Programmable Gate
Arrays Devices in which function
can be specified through a hardware description language (VHDL/Verilog)
Slower than custom ASICs but much more flexible
Large Degree of fine-grained parallelism
Basic Logic block
Configurable I/0 BlockInterconnects
Configurable Logic Blocks FPGAs realize computation over a network of CLBs Each CLB contains:
Eight 6-Input Look-Up-Tables
Eight Flip-Flops Control Multiplexors Arithmetic and Carry Logic
LUTs preconfigured to implement any 6-inputlogical function
‘A = B ⊕ C ⊕ D ⊕ E ⊕ F ⊕ G’can be calculated in a single cycle even over large operand lengths Bit-Level Parallelism
Xilinx Virtex-5 FXT• Hybrid FPGA– Hardwired PowerPC
Cores– 2-15k CLBs– ArithmeticDSP Slices– EthernetMAC Units
FPGA Supercomputers Cray XT5h
AMD Opteron and Xilinx Virtex-4 FPGAs on single blade
Cray SeaStar2+™ Interconnect
Custom API for RPUs
http://www.cray.com/Assets/PDF/products/xt/CrayXT5hBrochure.pdf SRC6/7
Altera Stratix II FPGAs and Intel Xeon or AMD Opteron Processors
SRC HI-BAR® Interface Carte® Programming
Environmenthttp://www.srccomp.com/products/src7_hardware.asp
Framework Motivation Motivating Concepts
FPGAs have great performance potential ▪ Especially applications with high bit, instruction, and data
level parallelism Many FPGAs working together would allow even
better parallelism exploitation▪ Increased data parallelism through multi-node partitioning▪ Task parallelism through independent heterogeneous nodes
FPGA cluster frameworks are currently limited to proprietary supercomputers▪ Commodity clusters will reduce the barrier to entry and
promote FPGA integration and use▪ Occam’s Razor applies to programming environments -
Simple is good.
Framework Requirements Easily Programmable
Common API for both parallel programming and hardware access
Modular hardware supported without modification Hardware/Software Design Independent
Interface access independent of application and implementation Minimal Framework Overhead
System should exhibit acceptable performance Commodity technologies
Ethernet networking, Linux OS, open software Scalable and Flexible
Additional FPGAs easily integrated Heterogeneous nodes seamlessly supported
Extensible Future improvements possible without harsh restrictions
Physical Organization
…
10/100/1000T Ethernet
Ethernet MAC
HW
HW
HW
… Ethernet Network
Hard-wired MAC on chip
Compact Flash
Single-Node Software Environment Embedded Linux OS
Root File System on Compact Flash (256MB) BusyBox Utilities Minimal Libraries
OpenMPI OpenSSH: Certificate-based security for shell access OpenSSL: TCP/IP security Various support libraries (zlib,etc)
Special Device Files in /dev Hardware devices mapped to character devices fopen, fwrite, fread, and fclose commands Dynamics major number reported in /proc/devices FILE * AllocateResource(char * base_name) ▪ Lock-based Arbitration
Interaction Stack
User Application
Kernel Driver
HW FIFO
HW Unit
MPI Application
fopen fwrite fread fclose
open
Reset FIFOs
Enable Interrupts
Interrupt Controller Write FIFO Read FIFO
State Machines
PLB
write read release
Application-Specific Hardware
MPI_Recv
MPI_Send
MPI_Init MPI_Finalize
ConstantInterface
SoftwareDesign
HardwareDesign
Driver Details Xilinx provides minimal set of drivers
SysAce CF, Ethernet MAC (MII/GMII), etc Custom FIFO Driver for HW abstraction
Boot time▪ Registers platform device and character driver▪ Maps physical memory address and IRQ to virtual space ▪ Constructs address offsets for registers
Device Open▪ Resets hardware accelerator and FIFOs▪ Enable Interrupts
Interrupt Handling▪ Read waits on data in Read FIFO▪ Hardware generates interrupt
Virtual Organization
MAC CF
MAC CF
MAC CF
MAC CF
Each hardware unit can behave as an independent node Hardware units can perform different functions and similar
functions at different speeds Each FPGA can host as many units as fit in the device Configurations with multiple hardware units/process or
other exotic setups are also inherently supported
P0 P1 P2 P3 P4 P4 P5 P6 P7 P8 P9
Application Case Study The Data Encryption Standard (DES) is a
widely used and longstanding block cypher Due to insufficient cryptographic strength
resulting from a 56 bit key, DES has been successfully broken by brute force exhaustive key searches in the past decade
Past approaches: Distributed computing (a la Folding@Home) Custom DES ASICs in a cluster A hybrid of the above two approaches Custom system of 120 Spartan-IIe FPGAs
DES Algorithm
System Development Process
Embedded
Platform
Hardware Accelerat
or
MPI Applica
tion
System Integration
Cross Compilation
Deployment
Independent HW/SW Design
Hardware Algorithm
Implementation Search keys as
fast as possible Software
Partition Key Space
Coordinate Results
Hardware Design
Software Design
Embedded Platform Xilinx Base System Builder
Generates PowerPC, DDR, Compact Flash and Ethernet Interfaces
Hardware Device Creation Xilinx Peripheral Creation Wizard
PLB Slave Interface Read/Write FIFOs Interrupt Controller Software Reset Software Accessible
Registers State Machine
centric design FIFO Access Interface/AcceleratorInteraction Interrupt Generation
DES Hardware Implementation Key Guesser top level
Contains 2X DES encryption engines 18 stage pipeline (16 rounds plus 1 input, 1
output) Initialized with known plaintext and ciphertext 24 high bits of key expected as input Each DES engine checks lowest (32 – X) bits with
assigned middle bits based on component number
If key is found, return it, otherwise return zero after key space is checked
High 24 Mid X Low 32 - X
FIFO Interface and State Machine
DES Encrypt
000DES Encrypt
001DES Encrypt
010DES Encrypt
011DES Encrypt
100DES Encrypt
101DES Encrypt
110DES Encrypt
111
Plain Text
High 24 bits
Cypher Text
Correct
Guess
SearchComplet
e
ResultKey
State Machine Design Xilinx ISE
VHDL model of interfaceand DESguessing unitinteraction
Simulate & synthesize Timing and
Resource Utilization
Receive Configuration
Start Searching and Wait for Guessing Results
Return Result or Failure Notification
For Each Configuration Parameter
Read Req Read Ack
While key not found and still searching
Write ReqWrite Ack
For Second Key Half
Generate Interrupt
System Integration
Processor Local Bus Connection (PLB)
XPS Interrupt Controller Connection
Physical Address Assignment
Integrated Bus Structure
PowerPC
Processor Local Bus (PLB)
Arbi
ter
EthernetMAC
CompactFlash
HAU_1
UserLogi
cHAU_0
UserLogi
c HAU_N
UserLogi
c…
DDRRAM
Device Tree Structure Linux kernel build targets DTS file created as Xilinx
library Driver extracts memory addresses, IRQs, and name
information Example unit description:
plb_des_1: plb-des@c9c20000 {compatible = “xlnx,plb-des-1.00.a”;interrupt-parent = <&xps_intc_0>;interrupts = < 1 2 >;reg = < 0xc9c20000 0x10000 >;xlnx,family = “virtex5”;xlnx,include-dphase-timer = <0x1>;
};
Deployment Linux Kernel
Build targeting specific platform and DTS Include device driver for hardware accelerators▪ make ARCH=powerpc CONFIG_FIFO_DRIVER=y
Generates .ELF programming file Bit-stream Generation
Synthesize, Map, Place, and Route Design Generates .BIT configuration file
Create a System ACE file Merges .BIT and .ELF into .ACE file Place .ACE file onto compact flash and boot
Application Development Master-Slave Structure
Master coordinates dynamic work queue Slaves wait for work or stop condition
Program Flow Master:▪ Send 24-bit key space indicator (KSI) to each slave▪ Wait for a response:▪ If key found, break out, report results and distribute stop conditions
to all slaves▪ If key not found, send next KSI to slave
Slave▪ Allocate and initialize hardware unit▪ Wait for work or stop condition▪ If new work arrives, send KSI to hardware and send back the result
M
SnS2S1 …
Pseudo-Code Structure#include <stdio.h>#include “mpi.h”#include “fpga_util.h”int main(int argc, char * argv[]){
FILE * my_dev;MPI_Init(argc, argv);MPI_Rank(MPI_COMM_WORLD, &rank);if(rank ==0){ //Master Process// For each slave MPI_Send(key_space_indicator, 1, MPI_INT, slave_rank, 0, MPI_COMM_WORLD);// While work remains in queue and key not found MPI_Recv(result, 3, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);MPI_Send(new_key_space_indicator, 1, MPI_INT, slave_rank, 0, MPI_COMM_WORLD);// Once key foundprintf(“The answer is %d!\n”, data_in);}else if(rank ==1){my_dev = AllocateResource(“des_unit”);setvbuf(my_dev, null, _IONBF, sizeof(int));// Until stop condition received MPI_Recv(KSI, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);fwriteKSI, sizeof(int), 1, my_dev);fread(result, sizeof(int), 2, my_dev); MPI_Send(result, 3,MPI_INT, 0, 0, MPI_COMM_WORLD);}MPI_Finalize();
}
Testbed Four Virtex-5 devices
PowerPC 440 Processor @ 400 MHz Two ML507 and two ML510 boards 256/512 MB DDR2
One Virtex-4 device PowerPC 405 Processor @ 300 MHz ML410 board 256 MB DDR2
100 MHz Processor Local Bus (PLB) RIT Ethernet Network DES Resource Usage
ML510: 3 Units each, ML507: 2 Units each, ML410: 1 Unit 11 Units total over 5 FPGAs
DES Application Scalability
0 2 4 6 8 10 120100020003000400050006000700080009000
10000
Ideal ScalingActual Scaling
Hardware Accelerators
Tota
l Key
s G
uess
ed /
Seco
nd
(Mill
ions
)
Hardware DES unit can guess 8 keys/cycle*100MHz = 800 Million keys/sec
11 Distributed Hardware Units => 8800 M keys/sec ideally
Actual performance is 8548.55 M keys/sec = 97.14%
Performance Comparison DES search application developed for cluster of
2.33 GHz Xeon processors with same program structure
Single node performance = 0.735 M Keys/sec Scales to 7.722 M Keys/sec across 11 cores @
95.4% efficiencySystem Performance
Speedup
Cost (approx.)
Price/Peformance
Commodity FPGAs
8548.552 MK/s
1107x ~$11K 371x
Cray XD1 7200 MK/s 930X ~100K 42xSRC-6 4000 MK/s 518X N/A N/AXeon 2.3 GHz 7.722 MK/s 1x ~$4K 1x
A standard test of computational capability is matrix multiplication A*B=C
Highly data parallel Each index of the result matrix can be computed independently: C[i][j] = A[i][] * B[][j]
Hardware Design Store multiple rows of A and compute a running dot
product for each row when receiving a column of B Software Design
Statically partition work across available FPGAs and aggregate results
Matrix Multiplication
Single Node Results
FPGAs fare poorly in comparison to Xeon GPP Largest FPGA with most available hardware comparable
to single core, but poor price/performance With greater concurrency, the FPGA should have
performed better. Why didn’t it?
0 500 1000 1500 2000 25000
0.2
0.4
0.6
0.8
1
1.2
ML410 (32 MACs)ML507 (32 MACs)ML510 (32 MACs)ML510 (64 MACs)2.33 GHz Xeon
Matrix Dimensions
Spee
dup
Ove
r Xe
on
Single Node Execution Time (s)
Board 256 512 1024 2048
ML410 0.162 0.957 6.562 47.139
ML507 0.071 0.470 3.427 26.015
ML510 (32) 0.071 0.471 3.410 26.403
ML510 (64) 0.043 0.269 1.875 14.269
Xeon 2.33 0.031 0.258 1.956 13.405
Analysis
Bandwidth is the limiting factor Data is communicated word by word over a
32-bit arbitrated bus (PLB) Processor independent DMA required for
improved performance
256x256 (Mbps)
512x512 (Mbps)
1024x1024 (Mbps)
2048x2048(Mbps)
ML410 16.15 19.73 21.73 23.49ML507 37.03 40.18 41.61 42.56ML510 (32) 36.92 40.08 41.82 41.94ML510 (64) 36.46 38.99 40.27 39.98
Scalability Results
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.50
5
10
15
20
25
30
ML507
2xML507
2xML507,ML510
2xML507,2xML510
2xML507,2xML510,
ML410
256 x 256512 x 5121024 x 10242048 x 2048
FPGAs
Exec
utio
n Ti
me
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.50
0.2
0.4
0.6
0.8
1
1.2
256 x 256512 x 5121024 x 10242048 x 2048
FPGAs
Scal
ing
Fact
or Problem scales well
across Virtex-5 devices 83% across 4 FPGAs
Virtex-4 addition causes worse performance Static Partitioning
Conclusions Scalable framework for clustering of FPGAs developed
and demonstrated Flexible application development with decoupled HW/SW Co-
design Standard MPI programming interface and simple hardware
interaction abstractions/APIs Commodity hardware technologies and open software allows
low barrier to entry for further research and development Application Case Studies
Well-suited applications like cryptanalysis perform admirably▪ Performance improvements of > 1100x▪ Price/Performance improvements of > 370x
Current bandwidth limitations holding back other applications
Future Work Framework Infrastructure
Hardware Communication▪ DMA data transfer from PowerPC to HW▪ Fast HW->HW Interconnection on single FPGA▪ Dynamic Reconfiguration of Hardware Accelerators
Cluster Management▪ Job Submission and Resource Matching with Condor or similar▪ Monitoring with Ganglia or similar
Robustness▪ Fault Tolerance▪ Correct HW Usage Enforcement
New Applications Performance study with more inter-process communication▪ Image Processing could be a good place to start
Neural Simulation Platform (Dmitri Yudanov, CE Dept) Expressed interest from EE and Astrophysics departments
Design Flow Improvements Integrated tool chain and improved deployment procedure
Questions?
Thank you for listening
Contact: Jeremy [email protected] Computer EngineeringHardware Design LabPrimary Advisor: Dr. Marcin LukowiakEspenshade, Jeremy. Scalable Framework for Heterogeneous Clustering of Commodity FPGAs. Master’s thesis, Rochester Institute of Technology, 2009.Cray Inc. Cray XD1 Supercomputer Outscores Competition in HPC Challenge Benchmark Tests. Business Wire. Feb 15 2005. http://investors.cray.com/phoenix.zhtml?c=98390&p=irol-newsArticle&ID=674199&highlight=.Tarek El-Ghazawi, Esam El-Araby, Miaoqing Huang, Kris Gaj, Volodymyr Kindratenko, and Duncan Buell. The Promise of High-Performance Reconfigurable Computing. IEEE Computer Magazine,41(2):69–76, 2008.Xilinx Corp. Virtex-5 Multi-Platform FPGAs, 2009. http://www.xilinx.com/products/virtex5/.