taming gpu threads with f# and alea gpu · taming gpu threads with f# and alea gpu ... alea...

Post on 05-Apr-2018

300 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Taming GPU Threads with F# and Alea GPU CodeMesh 2014 London November 5th 2014 Dr. Daniel Egloff daniel.egloff@quantalea.net +41 44 520 0117 +41 79 430 036141

About Us

Technology driven financial services company

Software and solution provider for high performance and GPU computing

!   GPGPU !   The Alea Platform !   Taming GPU Kernels with F#

!   Alea Reactive Dataflow

Topics

!   Increasing importance !  More data !  More compute intestine applications !  Mobile devices !   Embedded systems

!   GPGPU programming APIs CUDA and OpenCL

!   Simplify general purpose GPU programming !   Very popular

!  Simple !  Scalable !  Easy to learn !  Good tooling !  Good platform support

GPGPU

!   GPGPU !   The Alea Platform !   Taming GPU Kernels with F#

!   Alea Reactive Dataflow

Topics

!   CUDA / OpenCL development environment provided by vendors based on C/C++

!   Anatomy of a GPU accelerated code base !   Data preparation !   Delayed computations !   Dispatching for different problem size and hardware architectures !   Small fraction of code is highly optimized GPU code

!   Modern runtimes and languages such as F# are a better fit !   Functional patterns for delayed computation !   Pattern matching !   Containers, LINQ !   GC

!   Developer benefit !   Productivity !   Cross platform

Why Modern Languages for GPUs?

Alea GPU Platform Language

C#,  F#,  VB.NET   Alea  CUDA  language  extension  

Tooling

Visual  Studio  Xamarin  Studio  

Alea  compilers  CUDA  tools  (Nsight,  profilers,  etc.)  

Framew

ork

.NET  /  Mono  Alea  runGme  CUDA  driver  

Hardw

are

CPU   CUDA  enabled  GPU  

Library

Cross  language  .NET  libraries  

Alea  reacGve  dataflow  framework  Alea  performance  primiGves  Binding  for  NVIDIA  libraries  (cuBLAS,  cuFFT,  …)  

!   Different compilation schemes !   Dynamic compilation !   Ahead of time compilation

!   Same performance as CUDA C/C++

!   Relying on LLVM technology

Alea Compilers

CUDA  Driver

Alea.GPU  Compilers

SSAS

PTX

LLVM  IR

Quotations

F# F#  C#VB.NET

IL

GPU

!   GPGPU !   The Alea Platform !   Taming GPU Kernels with F#

!   Alea Reactive Dataflow

Topics

!   Kernel = function executing sequential code in parallel with many threads !   Threads grouped in blocks, multiple blocks form grid !   Thread in a block identified by (threadIdx.x, threadIdx.y, threadIdx.z) !   Block size (blockDim.x, blockDim.y, blockDim.z) !   Block in grid identified by (blockIdx.x, blockIdx.y, blockInx.z) !   Grid size (gridDim.x, gridDim.y, gridDim.z) !   Threads are always scheduled in groups, so called warps

CUDA Crash Course

Thread Thread Block Grid

!   Transform a large array of data with an unary function in parallel !   Simple data parallel problem !   Standard approach

!   Fix block and grid size based on GPU hardware capability !   Each thread does multiple array elements to process all elements

!   Example: 3 blocks with 4 threads each

Parallel Transform Example

Quotation based approach <@ … @> or [<ReflectedDefinition>] Benefits !   Discriminated unions !   Records !   Higher order functions !   Interoperability with IL based approach

Alea GPU with F# - Kernel

let  kernel  transform  =          <@  fun  (n:int)  (input:deviceptr<float>)  (output:deviceptr<float>)  -­‐>                  let  start  =  blockIdx.x  *  blockDim.x  +  threadIdx.x                  let  stride  =  gridDim.x  *  blockDim.x                  let  mutable  i  =  start                  while  i  <  n  do                          output.[i]  <-­‐  (%transform)  input.[i]                          i  <-­‐  i  +  stride            @>  

Alea GPU with F# - Workflow

let  transform  transform  =  cuda  {          let!  kernel  =  transform  |>  kernel  |>  Compiler.DefineKernel              return  Entry(fun  (program:Program)  -­‐>                  let  worker  =  program.Worker                  let  kernel  =  program.Apply(kernel)                    fun  (input:float[])  -­‐>                          let  n  =  input.Length                          use  input  =  worker.Malloc(input)                          use  output  =  worker.Malloc(n)                            let  blockSize  =  256                          let  numSm  =  program.Worker.Device.Attributes.MULTIPROCESSOR_COUNT                          let  gridSize  =  min  (16*numSm)  (divup  n  blockSize)                          let  lp  =  LaunchParam(gridSize,  blockSize)                                              kernel.Launch  lp  n  input.Ptr  output.Ptr                                            output.Gather()  )  }      

         use  transform  =  Worker.Default.LoadProgram(           transform  (kernel  <@  fun  x  -­‐>  x*x  @>))      let  input  =  Array.init  10  (fun  i  -­‐>  float(i))  let  output  =  transform.Run  input    printfn  "%A"  output        

Alea GPU with F# - Compilation / Exec

!   Functional style !   GPU resource definition !   GPU algorithm composition !   Data preparation

!   Imperative style !   GPU kernel algorithms !   Fine grained control about GPU data, memory !   Index calculation for difficult numerical calculations !   Some functional patterns are also possible in kernels, like pattern matching

Programming Style

!   Warp scan performance primitive !   Advanced GPU programming at warp level !   Optimized version for compute architecture 3 with warp shuffle !   Dispatching logic

Example II

__shfl_up  output  1  1

__shfl_up  output  2  1

__shfl_up  output  4  1

__shfl_up  output  8  1

!   Special Matrix Multiplication X = A .* v * B !   Very long and slim matrices !   Weighting with a probability vector v !   Implementation relies on low level block reduce performance primitive !   Used in large scale optimization problem for asset management

Example III

A v B X

!   Two step reduction ! Num elements after 1st reduction = divUp (num Cols, 8 * 256)

Example III

A

v

       

B

X’

!   GPGPU !   The Alea Platform !   Taming GPU Kernels with F#

!   Alea Reactive Dataflow

Topics

GPU parallel programming for (almost) everyone !   Radical simplification

!   No GPU experience required !   Fast development !   High performance comes automatically !   Guaranteed memory safety

!   Cross platform !   Multiple targets

!   Single GPU !   Multiple GPUs !   Clusters and grids of GPUs

!   Based on the Alea GPU platform

Alea Reactive Dataflow – Goals

!   Dataflow !   Graph of operations !   Data propagated through graph

!   Reactive !   Feed input in arbitrary intervals !   Listen for asynchronous output

!   Program is purely descriptive !   What, not how

!   Efficient execution behind the scenes !   Vector-parallel operations !   Stream operations on GPU !   Minimize memory copying !   Hybrid multi-platform scheduling

Alea Reactive Dataflow – Model

   

   

   

   

   

   

!   Unit of calculation (typically vector-parallel) !   Input and output ports !   Port = stream of typed data !   Consumes input, produces output

Alea Reactive Dataflow – Operations

Map  

Input:  T[]  

Output:  U[]  

MatrixProduct  

LeW:  T[,]   Right:  T[,]  

Output:  T[,]  

SpliYer  

Input:  Tuple<T,  U>  

First:    T   Second:  U  

Alea Reactive Dataflow – Graph

Random  

Pairing  

Map  

Average  

float[]  

int[]  

Pair<float>[]  

let  randoms  =  Random<_>(0.0,  1.0)  let  coordinates  =  Pairing<_>()  let  inUnitCircle  =  Map<_,  _>        (fun  p  -­‐>  if  p.Left  *  p.Left  +                                    p.Right  *  p.Right  <=  1.0                              then  1.0  else  0.0)  let  average  =  new  Average<_>()    randoms.Output.ConnectTo(coordinates.Input)  coordinates.Output.ConnectTo(inUnitCircle.Input)  inUnitCircle.Output.ConnectTo(average.Input)  

!   Send data to input port !   Receive from output port !   All asynchronous

Alea Reactive Dataflow – Dataflow

Random  

Pairing  

Map  

Average  

1000  

Write(…)  

1000000  

Write(…)  

average.Output.OnReceive(fun  x  -­‐>                        printfn  "Pi  =  %f"  (4.0*x))    random.Input.Send(1000)  random.Input.Send(1000000)  

!   Operation implement GPU and/or CPU

!   GPU operations combined to stream

!   Memory copy only when needed

!   Host scheduling with .NET TPL !   Automatic memory

management

Alea Reactive Dataflow – Scheduling

   

   

   

   

   

GPU  

    CPU  

COPY  

COPY  

COPY  

COPY  

!   Alea GPU platform is a professional GPU development stack for .NET / Mono !   Productivity with modern languages and first class tooling !   Simplifies GPU cross platform and portability !   Same performance as CUDA C/C++ !   Unique features such as GPU scripting !   Full flexibility of CUDA for advanced GPU coding on .NET

!   Alea reactive dataflow !   Makes GPU acceleration accessible to programmers without GPU knowledge !   Fully independent of GPU programming model CUDA

!   Alea GPU V 2 is scheduled for February 2015

Conclusion

Thank you

top related