performance and programming environment of a combined gpu

Upload: scribd2004

Post on 05-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Performance and Programming Environment of a Combined GPU

    1/17

    Performance and Programming

    environment of a combined

    GPU/FPGA architecture

  • 7/31/2019 Performance and Programming Environment of a Combined GPU

    2/17

    Supercomputing 1969-2018

    1969: MFlops

    1985: GFlops

    1997: PFlops 2008: TFlops

    2018: EFlops?

    1.E+03

    1.E+06

    1.E+09

    1.E+12

    1.E+15

    1.E+18

    CDC7600

    CDCSTAR

    CrayX-MP

    Cray-2

    FujitsuNWT

    HitachiSR2201

    IntelASCI

    NECEarthSimulator

    IBM

    BlueGene

    IBM

    Roadrunner

    TianheI K

    1969 1974 1982 1985 1990 1996 1997 2004 2005 2008 2010 2011

    MFLOPS(y) = 1.72(y-1969)

  • 7/31/2019 Performance and Programming Environment of a Combined GPU

    3/17

    Trendlines

    Supercomputing FLOPS > Moores law

    Memory speed increase

  • 7/31/2019 Performance and Programming Environment of a Combined GPU

    4/17

    Super desktop with GPU

    2009: NVIDIA: personal supercomputer

    Tesla C2050

    515 DP Gflops, 1TFlop (MADD)

    144 GB/s mem. bandwidth

    384 bit bus (3GB memory)

    448 thread processors

    Limitations: regular SIMDapplications

    PC accelerator transfer: PCie x16

  • 7/31/2019 Performance and Programming Environment of a Combined GPU

    5/17

    Extended with FPGA processing

    Field programmable gate array

    Up to 6 FPGA modules

    512 MB DDR3/module PCIe x8 per module

    PCIe x16 per board

  • 7/31/2019 Performance and Programming Environment of a Combined GPU

    6/17

    Combining GPU and FPGA strenghts

    Image processing + Bio-informatics

    Face recognition + Security

    Audio processing + HMM speech recognition Traffic analysis + Neural network control

  • 7/31/2019 Performance and Programming Environment of a Combined GPU

    7/17

    Super desktop architecture

  • 7/31/2019 Performance and Programming Environment of a Combined GPU

    8/17

    Programming language: C

    GPU: CUDA, OpenCL

    C PTX (Parallel Thread Execution)

    FPGA: HLS (High Level Synthesis)

    C VHDL (VHSIC Hardware Description Language)

    History: AutoESL (Xilinx) Vivado HLSCatapult C tool from Mentor Graphics

    C-to HDL tool from Politecnico di Milano (Italy)

    C-to-Verilog tool from www.c-to-verilog.com

    DIME-C from NallatechHandel-C from Celoxica (defunct)

    HercuLeS (C/assembly-to-VHDL) tool

    Impulse C from Impulse Accelerated Technologies

    Nios II C-to-Hardware Acceleration Compiler from Altera

    ROCCC 2.0 (free and open source C to HDL tool) from Jacquard Computing Inc.

    SPARK (a C-to-VHDL) from University Of California, San Diego

    SystemC from Celoxica (defunct)

  • 7/31/2019 Performance and Programming Environment of a Combined GPU

    9/17

    FPGA programming environment

    ROCCC: target:

    platform dependent modules (IP cores) into library

    platform independent systems use modules as functions replicate, parallelize and pipeline

    optimizations low level: arithmetic balancing

    high level: loop unrolling, fusion, wavefront, mul/div elimination,subexpression elimination

    data optimizations: stream with smart buffer

    output vhdl design + testbench

    PCore (Xilinx)

  • 7/31/2019 Performance and Programming Environment of a Combined GPU

    10/17

    FPGA programming environment

    AutoESL: target:

    Xilinx FPGAs

    optimizations

    code: loop unroll, fusion, pipeline, inline data: remap, partition, arrays, reshape, resource, stream

    interface selection: handshake, fifo, bus, register,

    output vhdl design

    performance report: timing, design and loops latency, utilization,area, power, interface

    design viewer with timeline, regs and interfaces, with feed back tosource code

  • 7/31/2019 Performance and Programming Environment of a Combined GPU

    11/17

    AutoESL ROCCC

    Compiler optimizations

    Optimization AutoESL ROCCCSoftware pipelining x x

    Arithmetic balancing x x

    Loop unrolling x x

    Loop flatten hierarchy x

    Loop fusion (merge) x

    Function inlining x

    Array map (combine arrays H or V) x

    Array partition (into smaller, // arrays)xArray reshape (cyclic, block) x

    Array resource (e.g. single or DP RAM) x

    Array streaming (FIFOs instead of RAMs) x

    Smart Buffer x

    Interface (handshake, none, stream, ) x

  • 7/31/2019 Performance and Programming Environment of a Combined GPU

    12/17

    Tuning design for performance

    Simple example: sum of array (N=1.e8)

    for(i=0; i

  • 7/31/2019 Performance and Programming Environment of a Combined GPU

    13/17

    Tuning design for performance

    Unroll 8 times arith. balancing

    87e6 cycles

    gain = 2.3

    only 2 // adds?

  • 7/31/2019 Performance and Programming Environment of a Combined GPU

    14/17

    Tuning design for performance

    Dual-port memory: only 2 loads at a time!

    I/O bottleneck

  • 7/31/2019 Performance and Programming Environment of a Combined GPU

    15/17

    Tuning design for performance

    Partition A over 4 memories (=8 ports, 256 bits)

    8 loads, 4 // adds

    63e6 cycles

    gain = 4.8

  • 7/31/2019 Performance and Programming Environment of a Combined GPU

    16/17

    Tuning design for performance

    Balancing Unrolling and Partitioning

    0.E+00

    1.E+08

    2.E+08

    3.E+08

    1 10 100 1000

    # cycles

    Unroll factor 1, 8, 64, 512

    Unrolling

    2 PORTS ONLY

    Partition=2 , 4 // streams (DP)

    Partition=4 , 8 // streams (DP)

    Partition=8 , 16 // streams (DP)

    Partition=16, 32 // streams (DP)

    I/O

    boundResource

    bound

  • 7/31/2019 Performance and Programming Environment of a Combined GPU

    17/17

    0.E+00

    1.E+08

    2.E+08

    3.E+08

    1 10 100 1000

    # cycles

    Unroll factor 1, 8, 64, 512

    Unrolling

    Partition=16, 32 // streams (DP)

    Virtex 6 Partition=16, 32 // streams (DP)

    Tuning design for performance

    Resource saturation at 512 times unroll

    Spartan3e

    Virtex 6