taming gpu threads with f# and alea gpu · taming gpu threads with f# and alea gpu ... alea...

Taming GPU Threads with F# and Alea GPU CodeMesh 2014 London November 5th 2014 Dr. Daniel Egloff daniel.egloff@quantalea.net +41 44 520 0117 +41 79 430 036141

About Us

Technology driven financial services company

Software and solution provider for high performance and GPU computing

!   GPGPU !   The Alea Platform !   Taming GPU Kernels with F#

!   Alea Reactive Dataflow

Topics

!   Increasing importance !  More data !  More compute intestine applications !  Mobile devices !   Embedded systems

!   GPGPU programming APIs CUDA and OpenCL

!   Simplify general purpose GPU programming !   Very popular

!  Simple !  Scalable !  Easy to learn !  Good tooling !  Good platform support

Topics

!   CUDA / OpenCL development environment provided by vendors based on C/C++

!   Anatomy of a GPU accelerated code base !   Data preparation !   Delayed computations !   Dispatching for different problem size and hardware architectures !   Small fraction of code is highly optimized GPU code

!   Modern runtimes and languages such as F# are a better fit !   Functional patterns for delayed computation !   Pattern matching !   Containers, LINQ !   GC

!   Developer benefit !   Productivity !   Cross platform

Why Modern Languages for GPUs?

Alea GPU Platform Language

C#, F#, VB.NET Alea CUDA language extension

Tooling

Visual Studio Xamarin Studio

Alea compilers CUDA tools (Nsight, profilers, etc.)

Framew

.NET / Mono Alea runGme CUDA driver

CPU CUDA enabled GPU

Library

Cross language .NET libraries

Alea reacGve dataflow framework Alea performance primiGves Binding for NVIDIA libraries (cuBLAS, cuFFT, …)

!   Different compilation schemes !   Dynamic compilation !   Ahead of time compilation

!   Same performance as CUDA C/C++

!   Relying on LLVM technology

Alea Compilers

CUDA Driver

Alea.GPU Compilers

LLVM IR

Quotations

F# F# C#VB.NET

Topics

!   Kernel = function executing sequential code in parallel with many threads !   Threads grouped in blocks, multiple blocks form grid !   Thread in a block identified by (threadIdx.x, threadIdx.y, threadIdx.z) !   Block size (blockDim.x, blockDim.y, blockDim.z) !   Block in grid identified by (blockIdx.x, blockIdx.y, blockInx.z) !   Grid size (gridDim.x, gridDim.y, gridDim.z) !   Threads are always scheduled in groups, so called warps

CUDA Crash Course

Thread Thread Block Grid

!   Transform a large array of data with an unary function in parallel !   Simple data parallel problem !   Standard approach

!   Fix block and grid size based on GPU hardware capability !   Each thread does multiple array elements to process all elements

!   Example: 3 blocks with 4 threads each

Parallel Transform Example

Quotation based approach <@ … @> or [<ReflectedDefinition>] Benefits !   Discriminated unions !   Records !   Higher order functions !   Interoperability with IL based approach

Alea GPU with F# - Kernel

let kernel transform = <@ fun (n:int) (input:deviceptr<float>) (output:deviceptr<float>) -‐> let start = blockIdx.x * blockDim.x + threadIdx.x let stride = gridDim.x * blockDim.x let mutable i = start while i < n do output.[i] <-‐ (%transform) input.[i] i <-‐ i + stride @>

Alea GPU with F# - Workflow

let transform transform = cuda { let! kernel = transform |> kernel |> Compiler.DefineKernel return Entry(fun (program:Program) -‐> let worker = program.Worker let kernel = program.Apply(kernel) fun (input:float[]) -‐> let n = input.Length use input = worker.Malloc(input) use output = worker.Malloc(n) let blockSize = 256 let numSm = program.Worker.Device.Attributes.MULTIPROCESSOR_COUNT let gridSize = min (16*numSm) (divup n blockSize) let lp = LaunchParam(gridSize, blockSize) kernel.Launch lp n input.Ptr output.Ptr output.Gather() ) }

use transform = Worker.Default.LoadProgram( transform (kernel <@ fun x -‐> x*x @>)) let input = Array.init 10 (fun i -‐> float(i)) let output = transform.Run input printfn "%A" output

Alea GPU with F# - Compilation / Exec

!   Functional style !   GPU resource definition !   GPU algorithm composition !   Data preparation

!   Imperative style !   GPU kernel algorithms !   Fine grained control about GPU data, memory !   Index calculation for difficult numerical calculations !   Some functional patterns are also possible in kernels, like pattern matching

Programming Style

!   Warp scan performance primitive !   Advanced GPU programming at warp level !   Optimized version for compute architecture 3 with warp shuffle !   Dispatching logic

Example II

__shfl_up output 1 1

!   Special Matrix Multiplication X = A .* v * B !   Very long and slim matrices !   Weighting with a probability vector v !   Implementation relies on low level block reduce performance primitive !   Used in large scale optimization problem for asset management

Example III

A v B X

!   Two step reduction ! Num elements after 1st reduction = divUp (num Cols, 8 * 256)

Example III

Topics

GPU parallel programming for (almost) everyone !   Radical simplification

!   No GPU experience required !   Fast development !   High performance comes automatically !   Guaranteed memory safety

!   Cross platform !   Multiple targets

!   Single GPU !   Multiple GPUs !   Clusters and grids of GPUs

!   Based on the Alea GPU platform

Alea Reactive Dataflow – Goals

!   Dataflow !   Graph of operations !   Data propagated through graph

!   Reactive !   Feed input in arbitrary intervals !   Listen for asynchronous output

!   Program is purely descriptive !   What, not how

!   Efficient execution behind the scenes !   Vector-parallel operations !   Stream operations on GPU !   Minimize memory copying !   Hybrid multi-platform scheduling

Alea Reactive Dataflow – Model

!   Unit of calculation (typically vector-parallel) !   Input and output ports !   Port = stream of typed data !   Consumes input, produces output

Alea Reactive Dataflow – Operations

Input: T[]

Output: U[]

MatrixProduct

LeW: T[,] Right: T[,]

Output: T[,]

SpliYer

Input: Tuple<T, U>

First: T Second: U

Alea Reactive Dataflow – Graph

Random

Pairing

Average

float[]

Pair<float>[]

let randoms = Random<_>(0.0, 1.0) let coordinates = Pairing<_>() let inUnitCircle = Map<_, _> (fun p -‐> if p.Left * p.Left + p.Right * p.Right <= 1.0 then 1.0 else 0.0) let average = new Average<_>() randoms.Output.ConnectTo(coordinates.Input) coordinates.Output.ConnectTo(inUnitCircle.Input) inUnitCircle.Output.ConnectTo(average.Input)

!   Send data to input port !   Receive from output port !   All asynchronous

Alea Reactive Dataflow – Dataflow

Random

Pairing

Average

Write(…)

1000000

Write(…)

average.Output.OnReceive(fun x -‐> printfn "Pi = %f" (4.0*x)) random.Input.Send(1000) random.Input.Send(1000000)

!   Operation implement GPU and/or CPU

!   GPU operations combined to stream

!   Memory copy only when needed

!   Host scheduling with .NET TPL !   Automatic memory

management

Alea Reactive Dataflow – Scheduling

!   Alea GPU platform is a professional GPU development stack for .NET / Mono !   Productivity with modern languages and first class tooling !   Simplifies GPU cross platform and portability !   Same performance as CUDA C/C++ !   Unique features such as GPU scripting !   Full flexibility of CUDA for advanced GPU coding on .NET

!   Alea reactive dataflow !   Makes GPU acceleration accessible to programmers without GPU knowledge !   Fully independent of GPU programming model CUDA

!   Alea GPU V 2 is scheduled for February 2015

Conclusion

Thank you

taming gpu threads with f# and alea gpu · taming gpu threads with f# and alea gpu ... alea...

Documents

lecture 12 vector machines/gpus overview topicsreadings:...

suso33 josé ramón bañales urkullu - bilbao art...

gpu computing overview - sea computing.pdfa gpu is a...

parallel processing of ray tracing on gpu with dynamic ......

technical report: spidal summer reu 2016: distance array and...

the alea reactive dataflow system for gpu...

multi-gpunael/217-f19/lectures/multi-gpu.pdf · gpu...

process & alea

generating a fast gpu median filter with haskell ·...

gpgpu - cvg · 4 cuda “compute unified...

agenda cpu threads flip queue cpu queues gpu hardware queue

advances in the use of gpu acceleration for financial...

you might also like: a multi-gpu recommendation...

jaietako alea

lecture 3 control flow and synchronisation · lecture 3 9...

study: persistent threads style programming model for gpu...

uztaileko alea

cse 591/cse 392: gpu programming threads - stony...

egunsentia 3. alea

2007cleau alea