automatic generation of platform architectures using open cl and fpga roadmap
TRANSCRIPT
Automatic generation of platform architectures using OpenCL
FPGA roadmap
Department of Electrical and Computer EngineeringUniversity of Thessaly
Volos, Greece
Nikolaos Bellas
What is an FPGA?• Field Programmable Gate Array (FPGA) is the best
known example of Reconfigurable Logic• Hardware can be modified post chip fabrication• Tailor the Hardware to the application
– Fixed logic processors (CPUs/GPUs) only modify their software (via programming)
• FPGAs can offer superior performance, performance/power, or performance/cost compared to CPUs and GPUs.
2
FPGA architecture
• A generic island-style FPGA fabric
• Configurable Logic Blocks (CLB) and Programmable
Switch Matrices (PSM)• Bitstream configures
functionality of each CLB and interconnection between logic blocks
3
The Xilinx Slice• Xilinx slice features
– LUTs– MUXF5, MUXF6,
MUXF7, MUXF8 (only the F5 and F6 MUX are shown in this diagram)
– Carry Logic– MULT_ANDs– Sequential Elements
•Detailed Structure
LUTLUT
Example 2-input LUT
• Lookup table: a b out
0 0
0 1
1 0
1 1
a
bout
0
0
0
1
0 0 0 1
1
0
0
1
1 0 0 1
5
configuration input
Modern FPGA architectureXilinx Virtex family
6
•Columns of on-chips SRAMs, hard IP cores (PPC 405), and•DSP slices (Multiply-Accumulate) units
FPGA discussion
• Advantages– Potential for (near) optimal performance for a given
application– Various forms of parallelisms can be exploited
• Disadvantages– Programmable mainly at the hardware level using
Hardware Description Languages (BUT, this can change)
– Lower clock frequency (200-300 MHz) compared to CPUs (~ 3GHz) and GPUs (~1.5 GHz)
7
MATENVMED
Silicon OpenCL: Automatic generation of platform
architectures using OpenCL
8
18-19/7/2013 MATENVMED Plenary Meeting
Introduction
• Automatic generation of hardware at the research forefront in the last 10 years.
• Variety of High Level Programming Models: C/C++, C-like Languages, MATLAB
• Obstacles:– Parallelism Extraction for larger applications– Extensive Compiler Transformations & Optimizations
• Parallel Programming Models to the Rescue: – CUDA, OpenCL.
9
18-19/7/2013 MATENVMED Plenary Meeting
Motivation
• Parallel programming models are for reconfigurable platforms.
• A major shift of Computing industry toward many-core computing systems.
• Reconfigurable fabrics bear a strong resemblance to many core systems.
10
18-19/7/2013 MATENVMED Plenary Meeting
Vision• Provide the tools and methodology to enable the large
pool of software developers and domain experts, who do not necessarily have expertise on hardware design, to architect whole accelerator-based systems– Borrowed from advances in massively parallel programming
models
11FPGA
PCI express
GPU
CPU
PCI express
18-19/7/2013 MATENVMED Plenary Meeting
Silicon OpenCL
• Silicon-OpenCL “SOpenCL”.
• A tool flow to convert an unmodified OpenCL application into a SoC design with HW/SW components.
OpenCL-to-C Frontend
OpenCL Kernel
Architectural Synthesis Backend
Simulation & Verification
Drivers & Runtime
On-Chip CPU
HW Accelerator
HW Accelerator
On-chip Bus
Off-Chip Memory
C Function
System on Chip(SoC)
18-19/7/2013
Contribution
• Architectural Synthesis methodology:– Code Transformations.– Architectural Template.
13
C Kernel
LLVM Compilation
Optimized LLVM-IR
Bitwidth Optimization
Predication
Code Slicing
Instruction Clustering
Verilog Generation
SchedulingFPGA
BitstreamSynthesis,
P&RSynthesizable
Verilog
Simulation TestbenchAccelerator Template
User Performance Requirements
Transformations
Hardware Generation
OpenCL Kernel
OpenCL-to-C Frontend
Architectural Synthesis Backend
Streaming Unit
Datapath
Input Data
Output Data
MATENVMED Plenary Meeting
18-19/7/2013 MATENVMED Plenary Meeting
OpenCL for Heterogeneous Systems• OpenCL (Open Computing Language) : A unified programming
model aims at letting a programmer write a portable program once and deploy it on any heterogeneous system with CPUs and GPUs.
• Became an important industry standard after release due to substantial industry support.
14
18-19/7/2013 MATENVMED Plenary Meeting
OpenCL Platform Model
One host and one or more Compute Devices (CD) Each CD consists of one or more Compute Units (CU)Each CU is further divided into one or more Processing Elements (PE)
15
Main Program
ComputationsKernels
18-19/7/2013 MATENVMED Plenary Meeting
OpenCL Kernel Execution Geometry• OpenCL defines a geometric partitioning of grid of computations• Grid consists of N dimensional space of work-groups• Each work-group consists of N dimensional space of work-items.
work-group
grid work-item
16
18-19/7/2013 MATENVMED Plenary Meeting
OpenCL Simple Example
__kernel void vadd(
__global int* a,
__global int* b,
__global int* c) {
int idx= get_global_id(0);
c[idx] = a[idx] + b[idx];
}
• OpenCL kernel describes the computation of a work-item• Finest parallelism granularity
• e.g. add two integer vectors (N=1)
void add(int* a,
int* b,
int* c) {
for (int idx=0; idx<sizeof(a); idx++)
c[idx] = a[idx] + b[idx];
}
C code OpenCL kernel code
Run-time callUsed to differentiate execution for each work-item
17
18-19/7/2013 MATENVMED Plenary Meeting
Why OpenCL as an HDL?
• OpenCL exposes parallelism at the finest granularity– Allows easy hardware generation at different levels of
granularity
– One accelerator per work-item, one accelerator per work-group, one accelerator per multiple work-groups, etc.
• OpenCL exposes data communication– Critical to transfer and stage data across platforms
• We target unmodified OpenCL to enable hardware design to software engineers– No need for hardware/architectural expertise
18
18-19/7/2013 MATENVMED Plenary Meeting
SOpenCL Tool Flow
OpenCL-to-C Frontend
OpenCL Kernel
Architectural Synthesis Backend
Simulation & Verification
Drivers & Runtime
On-Chip CPU
HW Accelerator
HW Accelerator
On-chip Bus
Off-Chip Memory
C Function
System on Chip(SoC)
Granularity Management
work-group
FPGA
Optimal thread granularity depends on hardware platform
GPUCPU
We select a hardware accelerator to process one work-group per invocation. Smaller invocation overhead18-19/7/2013 MATENVMED Plenary Meeting
18-19/7/2013 MATENVMED Plenary Meeting
Granularity Coarsening
OpenCL-to-C Frontend
OpenCL Kernel
C Function
Work-item thread
Work-group thread
18-19/7/2013 MATENVMED Plenary Meeting
Serialization of Work Items
__kernel void vadd(…) {
int idx = get_global_id(0); c[idx] = a[idx] + b[idx];
}
__kernel void Vadd(…) { int idx; for( i = 0; i < get_local_size(2); i++) for( j = 0; j < get_local_size(1); j++) for( k = 0; k < get_local_size(0); k++) { idx = get_item_gid(0); c[idx] = a[idx] + b[idx]; }}
OpenCL code
C code
22
idx = (global_id2(0) + i) * Grid_Width * Grid_Height + (global_id1(0) + j) * Grid_Width + (global_id0(0) + k);
18-19/7/2013 MATENVMED Plenary Meeting
Architectural Synthesis• Exploit available parallelism and application specific
features.• Apply a series of transformations to generate customized
hardware accelerators.
23
C Kernel
LLVM Compilation
Optimized LLVM-IR
Bitwidth Optimization
PredicationCode
SlicingInstruction Clustering
Verilog Generation
SchedulingFPGA
BitstreamSynthesis,
P&RSynthesizable
Verilog
Simulation TestbenchAccelerator Template
User Performance Requirements
Transformations
Hardware Generation
• Uses LLVM Compiler Infrastructure.• Generate synthesizable Verilog & Test bench.
Arbiter
Sin Align UnitSout Align
Unit
Sin Requests
Generator
Cache Unit
Sout AGU
Sin AGU
Data_lineData_line
AddressAddress Data_inData_in
Data_outData_out
AddressAddress
Sin0Sin0 Sin1Sin1 Sout0Sout0
Streaming UnitStreaming UnitSystem InterconnectSystem Interconnect
Local requestLocal request
FU
Data
TerminateTerminate Sin0Sin0 Sin1Sin1 Sout0Sout0
Data PathData Path
Named Register
Named Register
Memory Mapped Registers
Memory Mapped Registers
Multiplexer
TunnelTunnel
Data
FU
Multiplexer
Data
FU
Multiplexer
DataData
Sin Kernel: ind = phi [0, preh], [i2, body] i0 = add a0, ind i2 = add ind, 1 i3 = add a0, i2 gep0 = getelementptr i8* x0, i0 gep1 = getelementptr i8* x0, i3 i7 = load i8* gep0 i10 = load i8* gep1
Computational Kernel: i46 = phi [true, preh], [i41, body] ind = phi [0, preh], [i2, body] i2 = add ind, 1 i7 = pop i8* gep0 i9 = mul i7, a3 i10 = pop i8* gep1 i12 = mul i10, a4 i19 = add i9, 32 i20 = add i19, i12 i23 = ashr i22, 6 push i23, i8* gep4 i40 = icmpeq i2, 8 i41 = xor i40, true br i40, exit, body
Sout Kernel: ind = phi [0, preh], [i2, body] i2 = add ind, 1 i6 = add a2, ind gep4 = getelementptr i8* x1, i6 store i23, i8* gep4
Feed Data in Order
Write Data in Order
•FU types,•Bitwidths,
•I/O Bandwidth
2404/12/23
Verilog Generation: PE Architecture•Predication
•Code
•slicing
•SMS mod
•scheduling
•Verilog
•generation
MATENVMED Kickc Off Meeting
Roadmap for FPGA implementation
FPGA Implementation
• Our plan is to use the same code base (e.g. OpenCL) to explore different architectures– OpenCL used for multicore CPU, GPU, FPGA
(SOpenCL)
• Fast exploration based on area, performance and power requirements
18-19/7/2013 MATENVMED Plenary Meeting
FPGA Implementation• Monte-Carlo simulations can exploit multi-level
parallelism of FPGAs – Multiple MC simulations per point– Multiple points simultaneously– Double precision Trigonometric, Log, Additions,
Multiplications functions for each walk– FP operations with double precision are not FPGAs
strong point, but still SOpenCL can handle it.
18-19/7/2013 MATENVMED Plenary Meeting