automatic generation of platform architectures using open cl and fpga roadmap

Automatic generation of platform architectures using OpenCL

FPGA roadmap

Department of Electrical and Computer EngineeringUniversity of Thessaly

Volos, Greece

Nikolaos Bellas

What is an FPGA?• Field Programmable Gate Array (FPGA) is the best

known example of Reconfigurable Logic• Hardware can be modified post chip fabrication• Tailor the Hardware to the application

– Fixed logic processors (CPUs/GPUs) only modify their software (via programming)

• FPGAs can offer superior performance, performance/power, or performance/cost compared to CPUs and GPUs.

2

FPGA architecture

• A generic island-style FPGA fabric

• Configurable Logic Blocks (CLB) and Programmable

Switch Matrices (PSM)• Bitstream configures

functionality of each CLB and interconnection between logic blocks

3

The Xilinx Slice• Xilinx slice features

– LUTs– MUXF5, MUXF6,

MUXF7, MUXF8 (only the F5 and F6 MUX are shown in this diagram)

– Carry Logic– MULT_ANDs– Sequential Elements

•Detailed Structure

LUTLUT

Example 2-input LUT

• Lookup table: a b out

0 0

0 1

1 0

1 1

a

bout

0

0

0

1

0 0 0 1

1

0

0

1

1 0 0 1

5

configuration input

Modern FPGA architectureXilinx Virtex family

6

•Columns of on-chips SRAMs, hard IP cores (PPC 405), and•DSP slices (Multiply-Accumulate) units

FPGA discussion

• Advantages– Potential for (near) optimal performance for a given

application– Various forms of parallelisms can be exploited

• Disadvantages– Programmable mainly at the hardware level using

Hardware Description Languages (BUT, this can change)

– Lower clock frequency (200-300 MHz) compared to CPUs (~ 3GHz) and GPUs (~1.5 GHz)

7

MATENVMED

Silicon OpenCL: Automatic generation of platform

architectures using OpenCL

8

18-19/7/2013 MATENVMED Plenary Meeting

Introduction

• Automatic generation of hardware at the research forefront in the last 10 years.

• Variety of High Level Programming Models: C/C++, C-like Languages, MATLAB

• Obstacles:– Parallelism Extraction for larger applications– Extensive Compiler Transformations & Optimizations

• Parallel Programming Models to the Rescue: – CUDA, OpenCL.

9


Motivation

• Parallel programming models are for reconfigurable platforms.

• A major shift of Computing industry toward many-core computing systems.

• Reconfigurable fabrics bear a strong resemblance to many core systems.

10


Vision• Provide the tools and methodology to enable the large

pool of software developers and domain experts, who do not necessarily have expertise on hardware design, to architect whole accelerator-based systems– Borrowed from advances in massively parallel programming

models

11FPGA

PCI express

GPU

CPU

PCI express


Silicon OpenCL

• Silicon-OpenCL “SOpenCL”.

• A tool flow to convert an unmodified OpenCL application into a SoC design with HW/SW components.

OpenCL-to-C Frontend

OpenCL Kernel

Architectural Synthesis Backend

Simulation & Verification

Drivers & Runtime

On-Chip CPU

HW Accelerator

HW Accelerator

On-chip Bus

Off-Chip Memory

C Function

System on Chip(SoC)

18-19/7/2013

Contribution

• Architectural Synthesis methodology:– Code Transformations.– Architectural Template.

13

C Kernel

LLVM Compilation

Optimized LLVM-IR

Bitwidth Optimization

Predication

Code Slicing

Instruction Clustering

Verilog Generation

SchedulingFPGA

BitstreamSynthesis,

P&RSynthesizable

Verilog

Simulation TestbenchAccelerator Template

User Performance Requirements

Transformations

Hardware Generation

OpenCL Kernel



Streaming Unit

Datapath

Input Data

Output Data

MATENVMED Plenary Meeting


OpenCL for Heterogeneous Systems• OpenCL (Open Computing Language) : A unified programming

model aims at letting a programmer write a portable program once and deploy it on any heterogeneous system with CPUs and GPUs.

• Became an important industry standard after release due to substantial industry support.

14


OpenCL Platform Model

One host and one or more Compute Devices (CD) Each CD consists of one or more Compute Units (CU)Each CU is further divided into one or more Processing Elements (PE)

15

Main Program

ComputationsKernels


OpenCL Kernel Execution Geometry• OpenCL defines a geometric partitioning of grid of computations• Grid consists of N dimensional space of work-groups• Each work-group consists of N dimensional space of work-items.

work-group

grid work-item

16


OpenCL Simple Example

__kernel void vadd(

__global int* a,

__global int* b,

__global int* c) {

int idx= get_global_id(0);

c[idx] = a[idx] + b[idx];

}

• OpenCL kernel describes the computation of a work-item• Finest parallelism granularity

• e.g. add two integer vectors (N=1)

void add(int* a,

int* b,

int* c) {

for (int idx=0; idx<sizeof(a); idx++)

c[idx] = a[idx] + b[idx];

}

C code OpenCL kernel code

Run-time callUsed to differentiate execution for each work-item

17


Why OpenCL as an HDL?

• OpenCL exposes parallelism at the finest granularity– Allows easy hardware generation at different levels of

granularity

– One accelerator per work-item, one accelerator per work-group, one accelerator per multiple work-groups, etc.

• OpenCL exposes data communication– Critical to transfer and stage data across platforms

• We target unmodified OpenCL to enable hardware design to software engineers– No need for hardware/architectural expertise

18


SOpenCL Tool Flow


OpenCL Kernel


Simulation & Verification

Drivers & Runtime

On-Chip CPU

HW Accelerator

HW Accelerator

On-chip Bus

Off-Chip Memory

C Function

System on Chip(SoC)

Granularity Management

work-group

FPGA

Optimal thread granularity depends on hardware platform

GPUCPU

We select a hardware accelerator to process one work-group per invocation. Smaller invocation overhead18-19/7/2013 MATENVMED Plenary Meeting


Granularity Coarsening


OpenCL Kernel

C Function

Work-item thread

Work-group thread


Serialization of Work Items

__kernel void vadd(…) {

int idx = get_global_id(0); c[idx] = a[idx] + b[idx];

}

__kernel void Vadd(…) { int idx; for( i = 0; i < get_local_size(2); i++) for( j = 0; j < get_local_size(1); j++) for( k = 0; k < get_local_size(0); k++) { idx = get_item_gid(0); c[idx] = a[idx] + b[idx]; }}

OpenCL code

C code

22

idx = (global_id2(0) + i) * Grid_Width * Grid_Height + (global_id1(0) + j) * Grid_Width + (global_id0(0) + k);


Architectural Synthesis• Exploit available parallelism and application specific

features.• Apply a series of transformations to generate customized

hardware accelerators.

23

C Kernel

LLVM Compilation

Optimized LLVM-IR

Bitwidth Optimization

PredicationCode

SlicingInstruction Clustering

Verilog Generation

SchedulingFPGA

BitstreamSynthesis,

P&RSynthesizable

Verilog

Simulation TestbenchAccelerator Template

User Performance Requirements

Transformations

Hardware Generation

• Uses LLVM Compiler Infrastructure.• Generate synthesizable Verilog & Test bench.

Arbiter

Sin Align UnitSout Align

Unit

Sin Requests

Generator

Cache Unit

Sout AGU

Sin AGU

Data_lineData_line

AddressAddress Data_inData_in

Data_outData_out

AddressAddress

Sin0Sin0 Sin1Sin1 Sout0Sout0

Streaming UnitStreaming UnitSystem InterconnectSystem Interconnect

Local requestLocal request

FU

Data

TerminateTerminate Sin0Sin0 Sin1Sin1 Sout0Sout0

Data PathData Path

Named Register

Named Register

Memory Mapped Registers

Memory Mapped Registers

Multiplexer

TunnelTunnel

Data

FU

Multiplexer

Data

FU

Multiplexer

DataData

Sin Kernel: ind = phi [0, preh], [i2, body] i0 = add a0, ind i2 = add ind, 1 i3 = add a0, i2 gep0 = getelementptr i8* x0, i0 gep1 = getelementptr i8* x0, i3 i7 = load i8* gep0 i10 = load i8* gep1

Computational Kernel: i46 = phi [true, preh], [i41, body] ind = phi [0, preh], [i2, body] i2 = add ind, 1 i7 = pop i8* gep0 i9 = mul i7, a3 i10 = pop i8* gep1 i12 = mul i10, a4 i19 = add i9, 32 i20 = add i19, i12 i23 = ashr i22, 6 push i23, i8* gep4 i40 = icmpeq i2, 8 i41 = xor i40, true br i40, exit, body

Sout Kernel: ind = phi [0, preh], [i2, body] i2 = add ind, 1 i6 = add a2, ind gep4 = getelementptr i8* x1, i6 store i23, i8* gep4

Feed Data in Order

Write Data in Order

•FU types,•Bitwidths,

•I/O Bandwidth

2404/12/23

Verilog Generation: PE Architecture•Predication

•Code

•slicing

•SMS mod

•scheduling

•Verilog

•generation

MATENVMED Kickc Off Meeting

Roadmap for FPGA implementation

FPGA Implementation

• Our plan is to use the same code base (e.g. OpenCL) to explore different architectures– OpenCL used for multicore CPU, GPU, FPGA

(SOpenCL)

• Fast exploration based on area, performance and power requirements


FPGA Implementation• Monte-Carlo simulations can exploit multi-level

parallelism of FPGAs – Multiple MC simulations per point– Multiple points simultaneously– Double precision Trigonometric, Log, Additions,

Multiplications functions for each walk– FP operations with double precision are not FPGAs

strong point, but still SOpenCL can handle it.


automatic generation of platform architectures using open cl and fpga roadmap

Technology

matenvmed silicon opencl

unmodified opencl application

global int

int idx

hardware level

hardware design

workgroup grid workitem

programming fpgas