performance and programming environment of a combined gpu
TRANSCRIPT
-
7/31/2019 Performance and Programming Environment of a Combined GPU
1/17
Performance and Programming
environment of a combined
GPU/FPGA architecture
-
7/31/2019 Performance and Programming Environment of a Combined GPU
2/17
Supercomputing 1969-2018
1969: MFlops
1985: GFlops
1997: PFlops 2008: TFlops
2018: EFlops?
1.E+03
1.E+06
1.E+09
1.E+12
1.E+15
1.E+18
CDC7600
CDCSTAR
CrayX-MP
Cray-2
FujitsuNWT
HitachiSR2201
IntelASCI
NECEarthSimulator
IBM
BlueGene
IBM
Roadrunner
TianheI K
1969 1974 1982 1985 1990 1996 1997 2004 2005 2008 2010 2011
MFLOPS(y) = 1.72(y-1969)
-
7/31/2019 Performance and Programming Environment of a Combined GPU
3/17
Trendlines
Supercomputing FLOPS > Moores law
Memory speed increase
-
7/31/2019 Performance and Programming Environment of a Combined GPU
4/17
Super desktop with GPU
2009: NVIDIA: personal supercomputer
Tesla C2050
515 DP Gflops, 1TFlop (MADD)
144 GB/s mem. bandwidth
384 bit bus (3GB memory)
448 thread processors
Limitations: regular SIMDapplications
PC accelerator transfer: PCie x16
-
7/31/2019 Performance and Programming Environment of a Combined GPU
5/17
Extended with FPGA processing
Field programmable gate array
Up to 6 FPGA modules
512 MB DDR3/module PCIe x8 per module
PCIe x16 per board
-
7/31/2019 Performance and Programming Environment of a Combined GPU
6/17
Combining GPU and FPGA strenghts
Image processing + Bio-informatics
Face recognition + Security
Audio processing + HMM speech recognition Traffic analysis + Neural network control
-
7/31/2019 Performance and Programming Environment of a Combined GPU
7/17
Super desktop architecture
-
7/31/2019 Performance and Programming Environment of a Combined GPU
8/17
Programming language: C
GPU: CUDA, OpenCL
C PTX (Parallel Thread Execution)
FPGA: HLS (High Level Synthesis)
C VHDL (VHSIC Hardware Description Language)
History: AutoESL (Xilinx) Vivado HLSCatapult C tool from Mentor Graphics
C-to HDL tool from Politecnico di Milano (Italy)
C-to-Verilog tool from www.c-to-verilog.com
DIME-C from NallatechHandel-C from Celoxica (defunct)
HercuLeS (C/assembly-to-VHDL) tool
Impulse C from Impulse Accelerated Technologies
Nios II C-to-Hardware Acceleration Compiler from Altera
ROCCC 2.0 (free and open source C to HDL tool) from Jacquard Computing Inc.
SPARK (a C-to-VHDL) from University Of California, San Diego
SystemC from Celoxica (defunct)
-
7/31/2019 Performance and Programming Environment of a Combined GPU
9/17
FPGA programming environment
ROCCC: target:
platform dependent modules (IP cores) into library
platform independent systems use modules as functions replicate, parallelize and pipeline
optimizations low level: arithmetic balancing
high level: loop unrolling, fusion, wavefront, mul/div elimination,subexpression elimination
data optimizations: stream with smart buffer
output vhdl design + testbench
PCore (Xilinx)
-
7/31/2019 Performance and Programming Environment of a Combined GPU
10/17
FPGA programming environment
AutoESL: target:
Xilinx FPGAs
optimizations
code: loop unroll, fusion, pipeline, inline data: remap, partition, arrays, reshape, resource, stream
interface selection: handshake, fifo, bus, register,
output vhdl design
performance report: timing, design and loops latency, utilization,area, power, interface
design viewer with timeline, regs and interfaces, with feed back tosource code
-
7/31/2019 Performance and Programming Environment of a Combined GPU
11/17
AutoESL ROCCC
Compiler optimizations
Optimization AutoESL ROCCCSoftware pipelining x x
Arithmetic balancing x x
Loop unrolling x x
Loop flatten hierarchy x
Loop fusion (merge) x
Function inlining x
Array map (combine arrays H or V) x
Array partition (into smaller, // arrays)xArray reshape (cyclic, block) x
Array resource (e.g. single or DP RAM) x
Array streaming (FIFOs instead of RAMs) x
Smart Buffer x
Interface (handshake, none, stream, ) x
-
7/31/2019 Performance and Programming Environment of a Combined GPU
12/17
Tuning design for performance
Simple example: sum of array (N=1.e8)
for(i=0; i
-
7/31/2019 Performance and Programming Environment of a Combined GPU
13/17
Tuning design for performance
Unroll 8 times arith. balancing
87e6 cycles
gain = 2.3
only 2 // adds?
-
7/31/2019 Performance and Programming Environment of a Combined GPU
14/17
Tuning design for performance
Dual-port memory: only 2 loads at a time!
I/O bottleneck
-
7/31/2019 Performance and Programming Environment of a Combined GPU
15/17
Tuning design for performance
Partition A over 4 memories (=8 ports, 256 bits)
8 loads, 4 // adds
63e6 cycles
gain = 4.8
-
7/31/2019 Performance and Programming Environment of a Combined GPU
16/17
Tuning design for performance
Balancing Unrolling and Partitioning
0.E+00
1.E+08
2.E+08
3.E+08
1 10 100 1000
# cycles
Unroll factor 1, 8, 64, 512
Unrolling
2 PORTS ONLY
Partition=2 , 4 // streams (DP)
Partition=4 , 8 // streams (DP)
Partition=8 , 16 // streams (DP)
Partition=16, 32 // streams (DP)
I/O
boundResource
bound
-
7/31/2019 Performance and Programming Environment of a Combined GPU
17/17
0.E+00
1.E+08
2.E+08
3.E+08
1 10 100 1000
# cycles
Unroll factor 1, 8, 64, 512
Unrolling
Partition=16, 32 // streams (DP)
Virtex 6 Partition=16, 32 // streams (DP)
Tuning design for performance
Resource saturation at 512 times unroll
Spartan3e
Virtex 6