gmac global memory for accelerators

GMACGlobal Memory for Accelerators

Isaac GeladoPUMPS Summer School - Barcelona

Vector Addition CUDA code• Vector addition– Really simple kernel code– But, what about the CPU code?

• GMAC is a complement to the CUDA run-time– Simplifies the CPU code– Exploits advanced CUDA features for free

__global__ void vector(float *c, float *a, float *b, size_t size){

int idx = threadIdx.x + blockIdx.x * blockDim.x;if(idx < size) c[idx] = a[idx] + b[idx];

}

7/8/10 2PUMPS Summer School

Some easy CUDA code (I)• Read from disk, transfer to GPU and computeint main(int argc, char *argv[]) {

float *h_a, *h_b, *h_c, *d_a, *d_b, *d_c;size_t size = LENGTH * sizeof(float);

assert((h_a = malloc(size) != NULL);assert((h_b = malloc(size) != NULL);assert((h_c = malloc(size) != NULL);

assert(cudaMalloc((void **)&d_a, size) == cudaSuccess));assert(cudaMalloc((void **)&d_b, size) == cudaSuccess));assert(cudaMalloc((void **)&d_c, size) == cudaSuccess));

read_file(argv[A], h_a);read_file(argv[B], h_b);

assert(cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice) == cudaSuccess);

assert(cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice) == cudaSuccess);


Some easy CUDA code (and II)• Read from disk, transfer to GPU and compute

Db(BLOCK_SIZE);Dg(LENGTH / BLOCK_SIZE);if(LENGTH % BLOCK_SIZE) Dg.x++;vector<<<Dg, Db>>>(d_c, d_a, d_b, LENGTH);assert(cudaThreadSynchronize() == cudaSuccess);

assert(cudaMemcpy(d_c, h_c, LENGTH * sizeof(float), cudaMemcpyDeviceToHost) == cudaSuccess);

save_file(argv[C], h_c);

free(h_a); cudaFree(d_a);free(h_b); cudaFree(d_b);free(h_c); cudaFree(d_c);

return 0;}


Some really easy GMAC code


int main(int argc, char *argv[]) {float *a, *b, *c;size_t size = LENGTH * sizeof(float);

assert(gmacMalloc((void **)&a, size) == gmacSuccess));assert(gmacMalloc((void **)&b, size) == gmacSuccess));assert(gmacMalloc((void **)&c, size) == gmacSuccess));

read_file(argv[A], a);read_file(argv[B], b);

Db(BLOCK_SIZE);Dg(LENGTH / BLOCK_SIZE);if(LENGTH % BLOCK_SIZE) Dg.x++;vector<<<Dg, Db>>>(c, a, b, LENGTH);assert(gmacThreadSynchronize() == gmacSuccess);

save_file(argv[C], c);

gmacFree(a); gmacFree(b); gmacFree(c);

return 0;}

There is no memory

copy

There is no memory

copy

Getting GMAC• GMAC is at http://adsm.googlecode.com/

• Debian / Ubuntu binary and development .deb files

• UNIX (also MacOS X) source code package

• Experimental versions from mercurial repository


http://adsm.googlecode.com/

Outline• Introduction• GMAC Memory Model– Asymmetric Memory– Global Memory

• GMAC Execution Model– Multi-threading– Inter-thread communication

• Conclusions


GMAC Memory Model• Unified CPU / GPU virtual address space• Asymmetric address space visibility

CPU

Memory

GPU

Shared Data

CPU Data


GMAC Consistency Model• Implicit acquire / release primitives at

accelerator call / return boundaries

7/8/10 PUMPS Summer School 10

CPU ACC

CPU ACC

GMAC Memory API• Memory allocationgmacError_t gmacMalloc(void **ptr, size_t size)– Allocated memory address (returned by reference)– Gets the size of the data to be allocated– Error code, gmacSuccess if no error

• Example usage#include <gmac.h>

int main(int argc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . .}


GMAC Memory API• Memory releasegmacError_t gmacFree(void *ptr)– Memory address to be release– Error code, gmacSuccess if no error


int main(int argc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . . gmacFree(foo);}


GMAC Memory API• Memory translation (workaround for Fermi)Void *gmacPtr(void *ptr)template<typename T> T *gmacPtr(T *ptr)– CPU memory address– GPU memory address


int main(int argc, char *argv[]) { . . . kernel<<<Dg, Db>>>(gmacPtr(buffer), size); . . .}


GMAC Execution Example• Get advanced CUDA features for free– Asynchronous data transfers– Pinned memory




• Conclusions


GMAC Global Memory• Data accessible by all accelerators, but owned

by the CPU

CPU

Memory

GPU

GPU


GMAC Global memory API• Memory allocationgmacError_t gmacGlobalMalloc(void **ptr, size_t size)– Allocated memory address (returned by reference)– Gets the size of the data to be allocated– Error code, gmacSuccess if no error


int main(int argc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacGlobalMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . .}7/8/10 17PUMPS Summer School



• Conclusions


GMAC and Multi-threading• In the past, one thread one CPU• In GMAC, one thread:– One CPU– One GPU

• A thread is running in the GPU or the CPU, but not in both at the same time

• Create threads using what you already know– pthread_create(...)


GMAC and Multi-threading• Virtual memory accessibility:– Complete address space in CPU code– Partial address space in GPU code

CPU CPU

GPU GPUMemory




• Conclusions


GPU Passing and Copying• GPU passing:– Send the thread’s virtual GPU to another thread– Do not move data, move computation

• API Calls– Virtual GPU sendinggmacError_t gmacSend(thread_id dest)– Virtual GPU receivinggmacError_t gmacReceive()– Virtual GPU copyinggmacError_t gmacCopy(thread_id dest)




• Conclusions


Conclusions• Single virtual address space for CPUs and

GPUs• Use CUDA advanced features– Automatic overlap data communication and

computation– Get access to any GPU from any CPU thread

• Get more performance from your application more easily

• Go: http://adsm.googlecode.com


http://adsm.googlecode.com/

Backup Slides

GMAC Unified Address Space• When allocating memory

1. Allocate accelerator memory2. Allocate CPU memory at the same virtual address

CPU AcceleratorSystemMemory

AcceleratorMemory


GMAC Unified Address Space• Use fixed-size segments to map accelerator memory• Implement and export Accelerator Virtual Memory

Accelerator

SystemMemory

CPU

AcceleratorMemory

AcceleratorAcceleratorMemory

0x100100000 0x00100000

0x001000000x200100000


GMAC Data Transfers• Avoid unnecessary data copies• Lazy-update:

– Call: transfer modified data– Return: transfer when needed


AcceleratorSystemMemory

AcceleratorMemory

CPU

GMAC Data Transfers• Overlap CPU execution and data transfers• Minimal transfer on-demand• Rolling-update:

– Memory-block size granularity


AcceleratorSystemMemory

AcceleratorMemory

CPU

gmac global memory for accelerators

Documents