performance and programming environment of a combined gpu

7/31/2019 Performance and Programming Environment of a Combined GPU

1/17

Performance and Programming

environment of a combined

GPU/FPGA architecture


2/17

Supercomputing 1969-2018

1969: MFlops

1985: GFlops

1997: PFlops 2008: TFlops

2018: EFlops?

1.E+03

1.E+06

1.E+09

1.E+12

1.E+15

1.E+18

CDC7600

CDCSTAR

CrayX-MP

Cray-2

FujitsuNWT

HitachiSR2201

IntelASCI

NECEarthSimulator

IBM

BlueGene

IBM

Roadrunner

TianheI K

1969 1974 1982 1985 1990 1996 1997 2004 2005 2008 2010 2011

MFLOPS(y) = 1.72(y-1969)


3/17

Trendlines

Supercomputing FLOPS > Moores law

Memory speed increase


4/17

Super desktop with GPU

2009: NVIDIA: personal supercomputer

Tesla C2050

515 DP Gflops, 1TFlop (MADD)

144 GB/s mem. bandwidth

384 bit bus (3GB memory)

448 thread processors

Limitations: regular SIMDapplications

PC accelerator transfer: PCie x16


5/17

Extended with FPGA processing

Field programmable gate array

Up to 6 FPGA modules

512 MB DDR3/module PCIe x8 per module

PCIe x16 per board


6/17

Combining GPU and FPGA strenghts

Image processing + Bio-informatics

Face recognition + Security

Audio processing + HMM speech recognition Traffic analysis + Neural network control


7/17

Super desktop architecture


8/17

Programming language: C

GPU: CUDA, OpenCL

C PTX (Parallel Thread Execution)

FPGA: HLS (High Level Synthesis)

C VHDL (VHSIC Hardware Description Language)

History: AutoESL (Xilinx) Vivado HLSCatapult C tool from Mentor Graphics

C-to HDL tool from Politecnico di Milano (Italy)

C-to-Verilog tool from www.c-to-verilog.com

DIME-C from NallatechHandel-C from Celoxica (defunct)

HercuLeS (C/assembly-to-VHDL) tool

Impulse C from Impulse Accelerated Technologies

Nios II C-to-Hardware Acceleration Compiler from Altera

ROCCC 2.0 (free and open source C to HDL tool) from Jacquard Computing Inc.

SPARK (a C-to-VHDL) from University Of California, San Diego

SystemC from Celoxica (defunct)


9/17

FPGA programming environment

ROCCC: target:

platform dependent modules (IP cores) into library

platform independent systems use modules as functions replicate, parallelize and pipeline

optimizations low level: arithmetic balancing

high level: loop unrolling, fusion, wavefront, mul/div elimination,subexpression elimination

data optimizations: stream with smart buffer

output vhdl design + testbench

PCore (Xilinx)


10/17

FPGA programming environment

AutoESL: target:

Xilinx FPGAs

optimizations

code: loop unroll, fusion, pipeline, inline data: remap, partition, arrays, reshape, resource, stream

interface selection: handshake, fifo, bus, register,

output vhdl design

performance report: timing, design and loops latency, utilization,area, power, interface

design viewer with timeline, regs and interfaces, with feed back tosource code


11/17

AutoESL ROCCC

Compiler optimizations

Optimization AutoESL ROCCCSoftware pipelining x x

Arithmetic balancing x x

Loop unrolling x x

Loop flatten hierarchy x

Loop fusion (merge) x

Function inlining x

Array map (combine arrays H or V) x

Array partition (into smaller, // arrays)xArray reshape (cyclic, block) x

Array resource (e.g. single or DP RAM) x

Array streaming (FIFOs instead of RAMs) x

Smart Buffer x

Interface (handshake, none, stream, ) x


12/17

Tuning design for performance

Simple example: sum of array (N=1.e8)

for(i=0; i


13/17


Unroll 8 times arith. balancing

87e6 cycles

gain = 2.3

only 2 // adds?


14/17


Dual-port memory: only 2 loads at a time!

I/O bottleneck


15/17


Partition A over 4 memories (=8 ports, 256 bits)

8 loads, 4 // adds

63e6 cycles

gain = 4.8


16/17


Balancing Unrolling and Partitioning

0.E+00

1.E+08

2.E+08

3.E+08

1 10 100 1000

# cycles

Unroll factor 1, 8, 64, 512

Unrolling

2 PORTS ONLY

Partition=2 , 4 // streams (DP)



Partition=16, 32 // streams (DP)

I/O

boundResource

bound


17/17

0.E+00

1.E+08

2.E+08

3.E+08

1 10 100 1000

# cycles

Unroll factor 1, 8, 64, 512

Unrolling

Partition=16, 32 // streams (DP)

Virtex 6 Partition=16, 32 // streams (DP)


Resource saturation at 512 times unroll

Spartan3e

Virtex 6

performance and programming environment of a combined gpu

Documents