Efficient Abstractionsfor GPGPU Programming
Mathias Bourgoin & Emmanuel Chailloux
16.12.2015
Efficient abstractions for GPGPU programming
PhD (LIP6/UPMC)GPGPU programming→ general purpose computations on the GPU
Abstractions→ languages and algorithmic constructs
Efficient→ High Performance Computing
Applications→ computational science and numerical simulation
OpenGPU projectSystematic Cluster
Academic and Industrial partners
Goal : provide open-source solutions for GPGPU programming
Success : develop real size numerical applications
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 2 / 25
Graphic card
Properties of a dedicated graphic cardSeveral multi-processors
Dedicated memory
Connected to a host (CPU) via a PCI-Express bus
Implies data transfers between host and graphic card memories
Complex and specific programming
Current hardware
CPU GPU
# cores 4-18 300-3000Max Memory 8GB-1.5TB 1GB-12GB
GFLOPS SP 200-720 1000-6000GFLOPS DP 100-360 100-2000
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 3 / 25
GPGPU Programming
Two main frameworks
Cuda (NVidia)OpenCL (Consortium OpenCL)
Different languages
To write kernelsAssembly (PTX, SPIR, IL,…)Subsets of C/C++
To manage kernelsC/C++/Objective-CBindings : Fortran, Python, Java, …
Stream ProcessingFrom a data set (stream), a series of computations (kernel) is applied to eachelement of the stream.
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 4 / 25
GPGPU programming in practice 1
Grid
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 5 / 25
GPGPU programming in practice 1
GridBlock 0 Block 1
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 5 / 25
GPGPU programming in practice 1
GridBlock 0
Thread (0,0) Thread (1,0)
Block 1
Thread (0,1) Thread (1,1)
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 5 / 25
GPGPU programming in practice 1
GridBlock 0
Registers
Thread (0,0)
Localmem.
Registers
Thread (1,0)
Localmem.
Block 1
Registers
Thread (0,1)
Localmem.
Registers
Thread (1,1)
Localmem.
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 5 / 25
GPGPU programming in practice 1
GridBlock 0
Shared memory
Registers
Thread (0,0)
Localmem.
Registers
Thread (1,0)
Localmem.
Block 1
Shared memory
Registers
Thread (0,1)
Localmem.
Registers
Thread (1,1)
Localmem.
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 5 / 25
GPGPU programming in practice 1
Grid
Global memory
Block 0
Shared memory
Registers
Thread (0,0)
Localmem.
Registers
Thread (1,0)
Localmem.
Block 1
Shared memory
Registers
Thread (0,1)
Localmem.
Registers
Thread (1,1)
Localmem.
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 5 / 25
GPGPU programming in practice 1Grid
Global memory
Block 0
Shared memory
Registers
Thread (0,0)
Localmem.
Registers
Thread (1,0)
Localmem.
Block 1
Shared memory
Registers
Thread (0,1)
Localmem.
Registers
Thread (1,1)
Localmem.
Do not forget transfers between the host and its guestsCPU-X86 GPU Mobile GPU Gamer GPU HPCXeon E7 v3 GTX 980M GTX 980 Ti Radeon R9 Fury X Tesla K40
Memory102GB/s 160GB/s 336.5GB/s 512GB/s 288GB/s
bandwidthPCI-Express 3.0 maximum bandwidth is 16GB/s
PCI-Express 4.0 (not before 2017) 32GB/s
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 5 / 25
GPGPU programming in practice 2Kernel : small example using OpenCL
Vector addition
__kernel void vec_add ( _ _ g l o b a l const double * a ,_ _ g l o b a l const double * b ,_ _ g l o b a l double * c , in t N)
{in t nIndex = g e t _ g l o b a l _ i d ( 0 ) ;i f ( n Index >= N)
return ;c [ n Index ] = a [ nIndex ] + b [ nIndex ] ;
}
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 6 / 25
GPGPU programming in practice 2Host : small example using C
/ / c r e a t e OpenCL d e v i c e & c o n t e x tc l _ c o n t e x t hContext ;hContext = c lCreateContex tFromType ( 0 , CL_DEVICE_TYPE_GPU ,
0 , 0 , 0 ) ;/ / que ry a l l d e v i c e s a v a i l a b l e t o t h e c o n t e x ts i z e _ t nCon t e x tDe s c r i p t o r S i z e ;c lGe tCon t e x t I n f o ( hContext , CL_CONTEXT_DEVICES ,
0 , 0 , &nCon t e x tDe s c r i p t o r S i z e ) ;c l _ d e v i c e _ i d * aDev i c e s = ma l l oc ( nCon t e x tDe s c r i p t o r S i z e ) ;c lG e tCon t e x t I n f o ( hContext , CL_CONTEXT_DEVICES ,
nCon t e x tDe s c r i p t o r S i z e , aDev ices , 0 ) ;/ / c r e a t e a command queue f o r f i r s t d e v i c e t h e c o n t e x t r e p o r t e dcl_command_queue hCmdQueue ;hCmdQueue = clCreateCommandQueue ( hContext , aDev i c e s [ 0 ] , 0 , 0 ) ;/ / c r e a t e & comp i l e programc l_program hProgram ;hProgram = c lCreateProgramWithSource ( hContext , 1 ,
sProgramSource , 0 , 0 ) ;c lBu i l dP rog ram ( hProgram , 0 , 0 , 0 , 0 , 0 ) ;
/ / c r e a t e k e r n e lc l _ k e r n e l hKerne l ;hKerne l = c lC r e a t eK e r n e l ( hProgram , “”vec_add , 0 ) ;
/ / a l l o c a t e d e v i c e memorycl_mem hDeviceMemA , hDeviceMemB , hDeviceMemC ;hDeviceMemA = c lC r e a t e B u f f e r ( hContext ,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR ,cnDimension * s i zeo f ( c l _ d oub l e ) ,pA ,0 ) ;
hDeviceMemB = c lC r e a t e Bu f f e r ( hContext ,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR ,cnDimension * s i zeo f ( c l _ d oub l e ) ,pA ,0 ) ;
hDeviceMemC = c lC r e a t e Bu f f e r ( hContext ,CL_MEM_WRITE_ONLY ,cnDimension * s i zeo f ( c l _ d oub l e ) ,0 , 0 ) ;
/ / s e t u p pa r ame t e r v a l u e sc l S e tK e r n e l A r g ( hKernel , 0 , s i zeo f ( cl_mem ) , ( void *)&hDeviceMemA ) ;c l S e tK e r n e l A r g ( hKernel , 1 , s i zeo f ( cl_mem ) , ( void *)&hDeviceMemB ) ;c l S e tK e r n e l A r g ( hKernel , 2 , s i zeo f ( cl_mem ) , ( void *)&hDeviceMemC ) ;/ / e x e c u t e k e r n e lclEnqueueNDRangeKernel ( hCmdQueue , hKerne l , 1 , 0 ,
&cnDimension , 0 , 0 , 0 , 0 ) ;/ / copy r e s u l t s from d e v i c e back t o h o s tc lEnqueueReadBuf f e r ( hContext , hDeviceMemC , CL_TRUE , 0 ,
cnDimension * s i zeo f ( c l _ d oub l e ) ,pC , 0 , 0 , 0 ) ;
c lReleaseMemObj ( hDeviceMemA ) ;c lReleaseMemObj ( hDeviceMemB ) ;c lReleaseMemObj ( hDeviceMemC ) ;
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 7 / 25
GPGPU Programming with OCaml
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 8 / 25
OCaml
High-level general-purpose programming languageEfficient sequential computationsStatically typedType inferenceMultiparadigm(imperative, object oriented, functional, modular)Compile to bytecode/native codeAutomatic memory management (with very efficient Garbage Collector)Interactive Toplevel (to learn, test and debug)Interoperable with C
PortableOS : Windows - Unix (OS-X, Linux…)Architectures : x86, x86-64, PowerPC, ARM…
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 9 / 25
OCaml - Usage
DomainsTeaching
Research
Industry
The OCaml consortium
Soware examples :The Coq proof assistant (Inria)
LexiFi’s Modeling Language for Finance
The FFTW library (MIT)
Facebook’s Hack and Flow compilers
XenServer (Citrix)
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 10 / 25
Main Goals
Target Cuda/OpenCL frameworks with OCaml
Unify these two frmeworks
Abstract memory transfers
Use static type checking to verify kernels
Propose abstractions for GPGPU programming
Keep the high performance
Host-side solution : an OCaml library
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 10 / 25
SPOC overview
Abstract frameworksUnify both APIs (Cuda/OpenCL), dynamic linking.Portable solution, multi-GPGPU, heterogeneous
Abstract transfersVectors move automatically between CPU and GPGPUs
On-demand (lazy) transfers
Automatic allocation/dealloction of the memory space used by vectors (onthe host as well as on GPGPU devices)
Failure during allocation on a GPGPU triggers a garbage collection
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 11 / 25
A small example
CPU RAM
GPU1 RAM
GPU0 RAM
Example
l e t dev = Dev i c e s . i n i t ( )l e t n = 1 _000_000l e t v1 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v2 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v3 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 n
l e t k = vec_add ( v1 , v2 , v3 , n )l e t b l o ck = { b lockX = 1 0 2 4 ; b lockY = 1 ; b lockZ = 1 }l e t g r i d = { g r i dX = ( n +1024−1 ) /1024 ; g r i dY =1 ; g r idZ =1 }
l e t main ( ) =r a nd om_ f i l l v1 ;r a n d om_ f i l l v2 ;Ke rne l . run k ( b lock , g r i d ) dev . ( 0 ) ;for i = 0 to Vec to r . l e ng th v3 − 1 doP r i n t f . p r i n t f ” r e s [%d ] = %f ; ” i v3 . [ < i > ]
done ;
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 12 / 25
A small example
CPU RAM
GPU1 RAM
GPU0 RAM
v1v2v3
Example
l e t dev = Dev i c e s . i n i t ( )l e t n = 1 _000_000l e t v1 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v2 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v3 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 n
l e t k = vec_add ( v1 , v2 , v3 , n )l e t b l o ck = { b lockX = 1 0 2 4 ; b lockY = 1 ; b lockZ = 1 }l e t g r i d = { g r i dX = ( n +1024−1 ) /1024 ; g r i dY =1 ; g r idZ =1 }
l e t main ( ) =r a nd om_ f i l l v1 ;r a n d om_ f i l l v2 ;Ke rne l . run k ( b lock , g r i d ) dev . ( 0 ) ;for i = 0 to Vec to r . l e ng th v3 − 1 doP r i n t f . p r i n t f ” r e s [%d ] = %f ; ” i v3 . [ < i > ]
done ;
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 12 / 25
A small example
CPU RAM
GPU1 RAM
GPU0 RAM
v1v2v3
Example
l e t dev = Dev i c e s . i n i t ( )l e t n = 1 _000_000l e t v1 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v2 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v3 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 n
l e t k = vec_add ( v1 , v2 , v3 , n )l e t b l o ck = { b lockX = 1 0 2 4 ; b lockY = 1 ; b lockZ = 1 }l e t g r i d = { g r i dX = ( n +1024−1 ) /1024 ; g r i dY =1 ; g r idZ =1 }
l e t main ( ) =r a nd om_ f i l l v1 ;r a n d om_ f i l l v2 ;Ke rne l . run k ( b lock , g r i d ) dev . ( 0 ) ;for i = 0 to Vec to r . l e ng th v3 − 1 doP r i n t f . p r i n t f ” r e s [%d ] = %f ; ” i v3 . [ < i > ]
done ;
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 12 / 25
A small example
CPU RAM
GPU1 RAM
GPU0 RAM
v1v2v3
Example
l e t dev = Dev i c e s . i n i t ( )l e t n = 1 _000_000l e t v1 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v2 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v3 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 n
l e t k = vec_add ( v1 , v2 , v3 , n )l e t b l o ck = { b lockX = 1 0 2 4 ; b lockY = 1 ; b lockZ = 1 }l e t g r i d = { g r i dX = ( n +1024−1 ) /1024 ; g r i dY =1 ; g r idZ =1 }
l e t main ( ) =r a nd om_ f i l l v1 ;r a n d om_ f i l l v2 ;Ke rne l . run k ( b lock , g r i d ) dev . ( 0 ) ;for i = 0 to Vec to r . l e ng th v3 − 1 doP r i n t f . p r i n t f ” r e s [%d ] = %f ; ” i v3 . [ < i > ]
done ;
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 12 / 25
A small example
CPU RAM
GPU1 RAM
GPU0 RAM
v1v2v3
Example
l e t dev = Dev i c e s . i n i t ( )l e t n = 1 _000_000l e t v1 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v2 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v3 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 n
l e t k = vec_add ( v1 , v2 , v3 , n )l e t b l o ck = { b lockX = 1 0 2 4 ; b lockY = 1 ; b lockZ = 1 }l e t g r i d = { g r i dX = ( n +1024−1 ) /1024 ; g r i dY =1 ; g r idZ =1 }
l e t main ( ) =r a nd om_ f i l l v1 ;r a n d om_ f i l l v2 ;Ke rne l . run k ( b lock , g r i d ) dev . ( 0 ) ;for i = 0 to Vec to r . l e ng th v3 − 1 doP r i n t f . p r i n t f ” r e s [%d ] = %f ; ” i v3 . [ < i > ]
done ;
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 12 / 25
A small example
CPU RAM
GPU1 RAM
GPU0 RAM
v3
v1v2
Example
l e t dev = Dev i c e s . i n i t ( )l e t n = 1 _000_000l e t v1 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v2 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v3 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 n
l e t k = vec_add ( v1 , v2 , v3 , n )l e t b l o ck = { b lockX = 1 0 2 4 ; b lockY = 1 ; b lockZ = 1 }l e t g r i d = { g r i dX = ( n +1024−1 ) /1024 ; g r i dY =1 ; g r idZ =1 }
l e t main ( ) =r a nd om_ f i l l v1 ;r a n d om_ f i l l v2 ;Ke rne l . run k ( b lock , g r i d ) dev . ( 0 ) ;for i = 0 to Vec to r . l e ng th v3 − 1 doP r i n t f . p r i n t f ” r e s [%d ] = %f ; ” i v3 . [ < i > ]
done ;
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 12 / 25
Sarek : Stream ARchitecture using Extensible Kernels
Vector addition with Sarek
l e t vec_add = kern a b c n −>l e t open Std inl e t open Math . F l o a t 6 4 inl e t i d x = g l o b a l _ t h r e a d _ i d ini f i d x < n then
c . [ < idx >] <− add a . [ < idx >] b . [ < idx >]
Vector addition with OpenCL
__kernel void vec_add ( _ _ g l o b a l const double * a ,_ _ g l o b a l const double * b ,_ _ g l o b a l double * c , in t N)
{in t nIndex = g e t _ g l o b a l _ i d ( 0 ) ;i f ( n Index >= N)
return ;c [ n Index ] = a [ nIndex ] + b [ nIndex ] ;
}
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 13 / 25
Sarek
Vector addition with Sarek
l e t vec_add = kern a b c n −>l e t open Std inl e t open Math . F l o a t 6 4 inl e t i d x = g l o b a l _ t h r e a d _ i d ini f i d x < n then
c . [ < idx >] <− add a . [ < idx >] b . [ < idx >]
Sarek featuresML-like syntax
type inference
static type checking
static compilation to OCaml code
dynamic compilation to Cuda/OpenCL
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 14 / 25
Sarek static compilation
Sarek code
kern a → let idx =Std.global_thread_id ()in a.[< idx >] ← 0
IR
Bind( (Id 0), (ModuleAccess((Std),(global_thread_id)),(VecSet(VecAcc…))))
typed IR
OCaml Codefun a − >let idx =Std.global_thread_id ()
in a.[< idx >] < − 0l
KirKernParamsVecVar 0VecVar 1
…
spoc_kernelclass spoc_class1method run = …method compile = …endnew spoc_class1
OCaml code generation
Kir generation
spoc_kernel generation
Typing
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 15 / 25
Sarek dynamic compilation
let my_kernel = kern … − > …
… ;;
Kirc.gen my_kernel ;
Compile to
Cuda C source file
OpenCL C99
Compileto
Cuda ptx assembly
nvcc-O3
-ptx…
Kirc.run my_kernel dev (block,grid) ;
OpenCL Cuda
OpenCL C99 Cuda ptx assembly
Return to OCaml code execution
CompileandRun
device
kernelsource
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 16 / 25
Vectors addition
SPOC + Sarekopen Spocl e t vec_add = kern a b c n −>
l e t open Std inl e t open Math . F l o a t 6 4 inl e t i d x = g l o b a l _ t h r e a d _ i d ini f i d x < n then
c . [ < idx >] <− add a . [ < idx >] b . [ < idx >]
l e t dev = Dev i c e s . i n i t ( )l e t n = 1 _000_000l e t v1 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v2 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v3 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 n
l e t b l o ck = { b lockX = 1 0 2 4 ; b lockY = 1 ; b lockZ = 1 }l e t g r i d = { g r i dX = ( n +1024−1 ) /1024 ; g r i dY =1 ; g r idZ =1 }
l e t main ( ) =r a nd om_ f i l l v1 ;r a n d om_ f i l l v2 ;K i r c . gen vec_add ;K i r c . run vec_add ( v1 , v2 , v3 , n ) ( b lock , g r i d ) dev . ( 0 ) ;for i = 0 to Vec to r . l e ng th v3 − 1 doP r i n t f . p r i n t f ” r e s [%d ] = %f ; ” i v3 . [ < i > ]
done ;
OCamlNo explicit transferType inferenceStatic type checkingPortableHeterogeneous
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 17 / 25
Sarek transformations
Using SarekTransformations are OCaml functions modifying Sarek AST :Example :map ( kern a −> b )Scalar computations (′a → ′b) are transformedinto vector ones (′a vector → ′b vector).
Vector addition
l e t v1 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 10 _000and v2 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 10 _000 inl e t v3 = map2 ( kern a b −> a + b ) v1 v2
val map2 :( ’ a −> ’ b −> ’ c ) s a r e k _ k e r n e l −>?dev : Spoc . Dev i c e s . d e v i c e −>’ a Spoc . Vec to r . v e c t o r −>’ b Spoc . Vec to r . v e c t o r −> ’ c Spoc . Vec to r . v e c t o r
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 18 / 25
Real size example
PROP
Awarded by the UK Research Councils’ HEC Strategy Commiee
Simulates the scaering of e− in H-like ions at intermediates energies
Programmed in FORTRAN
Compatible with : sequential architectures, HPC clusters, super-computeurs
Vers
ions
Time (s)
1 018s
1 195s
951s
4 271sFORTRAN CPU
FORTRAN GPU
OCaml GPU
OCaml GPU (with native kernels)
SPOC+Sarek achieves 80% ofhand-tuned Fortranperformance.SPOC+external kernels is onpar with Fortran (93%)
Type-safe 30% code reductionMemory manager + GC No more transfers
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 19 / 25
Tools - 1
OCaml Toplevel + SPOC + HPC libraries
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 20 / 25
Tools - 2
SPOC for the webAccess GPGPU from web browsers
Using the js_of_ocaml compiler
Translation of the lowl-level part of SPOC + development of a dedicatedmemory manager
Source and web demos/tutorials :http://www.algo-prog.info/spoc/SPOC can be installed via OPAM (OCaml Package Manager)
Accessibility and teachingSimpler than classic tools : no more transfers
Web = instantly accessiblePerfect playground for GPGPU/HPC courses
focused on kernel optimizationbut mostly on algorithms composition
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 21 / 25
SPOC for the web
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 21 / 25
Conclusion
SPOCUnifies Cuda/OpenCL
Automatic transfers
Compatible with existing optimized libraries
SarekOCaml-like syntax
Type inference and static type checking
Easily extensible
Skeletons/TransformationsSimplifies programming
Offers additional automatic optimizations
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 22 / 25
Conclusion
BenchmarksSame performance as with other solutions
Heterogenous
Efficient with GPGPUs as well as with multicore CPUs
Application : PROPMore safety (memory/types)
Keeps the level of performance
Validates our solution
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 23 / 25
Current (and future) work
Extend implementationExtend Sarek : types, functions, recursion, polymorphism…
Optimize code generation
Dynamic and automatic optmizations for multiples architectures
Target new architectures
Extend skeletonsCost model for Sarek
More skeletons based on Sarek
Skeletons dedicated to very heterogeneous architectures (super-computers)
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 24 / 25
Thanks
SPOC : http://www.algo-prog.info/spoc/Spoc is compatible with x86_64 Unix (Linux, Mac OS X), Windows
for more information :[email protected]
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 25 / 25