gmac global memory for accelerators

30
GMAC Global Memory for Accelerators Isaac Gelado PUMPS Summer School - Barcelona

Upload: cisco

Post on 23-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

GMAC Global Memory for Accelerators. Isaac Gelado PUMPS Summer School - Barcelona. Vector Addition CUDA code. Vector addition Really simple kernel code But, what about the CPU code? GMAC is a complement to the CUDA run-time Simplifies the CPU code Exploits advanced CUDA features for free. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GMAC Global Memory for Accelerators

GMACGlobal Memory for Accelerators

Isaac GeladoPUMPS Summer School - Barcelona

Page 2: GMAC Global Memory for Accelerators

Vector Addition CUDA code• Vector addition– Really simple kernel code– But, what about the CPU code?

• GMAC is a complement to the CUDA run-time– Simplifies the CPU code– Exploits advanced CUDA features for free

__global__ void vector(float *c, float *a, float *b, size_t size){

int idx = threadIdx.x + blockIdx.x * blockDim.x;if(idx < size) c[idx] = a[idx] + b[idx];

}

7/8/10 2PUMPS Summer School

Page 3: GMAC Global Memory for Accelerators

Some easy CUDA code (I)• Read from disk, transfer to GPU and computeint main(int argc, char *argv[]) {

float *h_a, *h_b, *h_c, *d_a, *d_b, *d_c;size_t size = LENGTH * sizeof(float);

assert((h_a = malloc(size) != NULL);assert((h_b = malloc(size) != NULL);assert((h_c = malloc(size) != NULL);

assert(cudaMalloc((void **)&d_a, size) == cudaSuccess));assert(cudaMalloc((void **)&d_b, size) == cudaSuccess));assert(cudaMalloc((void **)&d_c, size) == cudaSuccess));

read_file(argv[A], h_a);read_file(argv[B], h_b);

assert(cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice) == cudaSuccess);

assert(cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice) == cudaSuccess);

7/8/10 3PUMPS Summer School

Page 4: GMAC Global Memory for Accelerators

Some easy CUDA code (and II)• Read from disk, transfer to GPU and compute

Db(BLOCK_SIZE);Dg(LENGTH / BLOCK_SIZE);if(LENGTH % BLOCK_SIZE) Dg.x++;vector<<<Dg, Db>>>(d_c, d_a, d_b, LENGTH);assert(cudaThreadSynchronize() == cudaSuccess);

assert(cudaMemcpy(d_c, h_c, LENGTH * sizeof(float), cudaMemcpyDeviceToHost) == cudaSuccess);

save_file(argv[C], h_c);

free(h_a); cudaFree(d_a);free(h_b); cudaFree(d_b);free(h_c); cudaFree(d_c);

return 0;}

7/8/10 4PUMPS Summer School

Page 5: GMAC Global Memory for Accelerators

Some really easy GMAC code

7/8/10 6PUMPS Summer School

int main(int argc, char *argv[]) {float *a, *b, *c;size_t size = LENGTH * sizeof(float);

assert(gmacMalloc((void **)&a, size) == gmacSuccess));assert(gmacMalloc((void **)&b, size) == gmacSuccess));assert(gmacMalloc((void **)&c, size) == gmacSuccess));

read_file(argv[A], a);read_file(argv[B], b);

Db(BLOCK_SIZE);Dg(LENGTH / BLOCK_SIZE);if(LENGTH % BLOCK_SIZE) Dg.x++;vector<<<Dg, Db>>>(c, a, b, LENGTH);assert(gmacThreadSynchronize() == gmacSuccess);

save_file(argv[C], c);

gmacFree(a); gmacFree(b); gmacFree(c);

return 0;}

There is no memory

copy

There is no memory

copy

Page 6: GMAC Global Memory for Accelerators

Getting GMAC• GMAC is at http://adsm.googlecode.com/

• Debian / Ubuntu binary and development .deb files

• UNIX (also MacOS X) source code package

• Experimental versions from mercurial repository

7/8/10 7PUMPS Summer School

Page 7: GMAC Global Memory for Accelerators

Outline• Introduction• GMAC Memory Model– Asymmetric Memory– Global Memory

• GMAC Execution Model– Multi-threading– Inter-thread communication

• Conclusions

7/8/10 8PUMPS Summer School

Page 8: GMAC Global Memory for Accelerators

GMAC Memory Model• Unified CPU / GPU virtual address space• Asymmetric address space visibility

CPU

Memory

GPU

Shared Data

CPU Data

7/8/10 9PUMPS Summer School

Page 9: GMAC Global Memory for Accelerators

GMAC Consistency Model• Implicit acquire / release primitives at

accelerator call / return boundaries

7/8/10 PUMPS Summer School 10

CPU ACC

CPU ACC

Page 10: GMAC Global Memory for Accelerators

GMAC Memory API• Memory allocationgmacError_t gmacMalloc(void **ptr, size_t size)– Allocated memory address (returned by reference)– Gets the size of the data to be allocated– Error code, gmacSuccess if no error

• Example usage#include <gmac.h>

int main(int argc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . .}

7/8/10 11PUMPS Summer School

Page 11: GMAC Global Memory for Accelerators

GMAC Memory API• Memory releasegmacError_t gmacFree(void *ptr)– Memory address to be release– Error code, gmacSuccess if no error

• Example usage#include <gmac.h>

int main(int argc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . . gmacFree(foo);}

7/8/10 12PUMPS Summer School

Page 12: GMAC Global Memory for Accelerators

GMAC Memory API• Memory translation (workaround for Fermi)Void *gmacPtr(void *ptr)template<typename T> T *gmacPtr(T *ptr)– CPU memory address– GPU memory address

• Example usage#include <gmac.h>

int main(int argc, char *argv[]) { . . . kernel<<<Dg, Db>>>(gmacPtr(buffer), size); . . .}

7/8/10 13PUMPS Summer School

Page 13: GMAC Global Memory for Accelerators

GMAC Execution Example• Get advanced CUDA features for free– Asynchronous data transfers– Pinned memory

7/8/10 14PUMPS Summer School

Page 14: GMAC Global Memory for Accelerators

Outline• Introduction• GMAC Memory Model– Asymmetric Memory– Global Memory

• GMAC Execution Model– Multi-threading– Inter-thread communication

• Conclusions

7/8/10 15PUMPS Summer School

Page 15: GMAC Global Memory for Accelerators

GMAC Global Memory• Data accessible by all accelerators, but owned

by the CPU

CPU

Memory

GPU

GPU

7/8/10 16PUMPS Summer School

Page 16: GMAC Global Memory for Accelerators

GMAC Global memory API• Memory allocationgmacError_t gmacGlobalMalloc(void **ptr, size_t size)– Allocated memory address (returned by reference)– Gets the size of the data to be allocated– Error code, gmacSuccess if no error

• Example usage#include <gmac.h>

int main(int argc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacGlobalMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . .}7/8/10 17PUMPS Summer School

Page 17: GMAC Global Memory for Accelerators

Outline• Introduction• GMAC Memory Model– Asymmetric Memory– Global Memory

• GMAC Execution Model– Multi-threading– Inter-thread communication

• Conclusions

7/8/10 18PUMPS Summer School

Page 18: GMAC Global Memory for Accelerators

GMAC and Multi-threading• In the past, one thread one CPU• In GMAC, one thread:– One CPU– One GPU

• A thread is running in the GPU or the CPU, but not in both at the same time

• Create threads using what you already know– pthread_create(...)

7/8/10 19PUMPS Summer School

Page 19: GMAC Global Memory for Accelerators

GMAC and Multi-threading• Virtual memory accessibility:– Complete address space in CPU code– Partial address space in GPU code

CPU CPU

GPU GPUMemory

7/8/10 20PUMPS Summer School

Page 20: GMAC Global Memory for Accelerators

Outline• Introduction• GMAC Memory Model– Asymmetric Memory– Global Memory

• GMAC Execution Model– Multi-threading– Inter-thread communication

• Conclusions

7/8/10 21PUMPS Summer School

Page 21: GMAC Global Memory for Accelerators

GPU Passing and Copying• GPU passing:– Send the thread’s virtual GPU to another thread– Do not move data, move computation

• API Calls– Virtual GPU sendinggmacError_t gmacSend(thread_id dest)– Virtual GPU receivinggmacError_t gmacReceive()– Virtual GPU copyinggmacError_t gmacCopy(thread_id dest)

7/8/10 22PUMPS Summer School

Page 22: GMAC Global Memory for Accelerators

Outline• Introduction• GMAC Memory Model– Asymmetric Memory– Global Memory

• GMAC Execution Model– Multi-threading– Inter-thread communication

• Conclusions

7/8/10 23PUMPS Summer School

Page 23: GMAC Global Memory for Accelerators

Conclusions• Single virtual address space for CPUs and

GPUs• Use CUDA advanced features– Automatic overlap data communication and

computation– Get access to any GPU from any CPU thread

• Get more performance from your application more easily

• Go: http://adsm.googlecode.com

7/8/10 24PUMPS Summer School

Page 24: GMAC Global Memory for Accelerators

GMACGlobal Memory for Accelerators

Isaac GeladoPUMPS Summer School - Barcelona

Page 25: GMAC Global Memory for Accelerators

Backup Slides

Page 26: GMAC Global Memory for Accelerators

GMAC Unified Address Space• When allocating memory

1. Allocate accelerator memory2. Allocate CPU memory at the same virtual address

CPU AcceleratorSystemMemory

AcceleratorMemory

7/8/10 27PUMPS Summer School

Page 27: GMAC Global Memory for Accelerators

GMAC Unified Address Space• Use fixed-size segments to map accelerator memory• Implement and export Accelerator Virtual Memory

Accelerator

SystemMemory

CPU

AcceleratorMemory

AcceleratorAcceleratorMemory

0x100100000 0x00100000

0x001000000x200100000

7/8/10 28PUMPS Summer School

Page 28: GMAC Global Memory for Accelerators

GMAC Data Transfers• Avoid unnecessary data copies• Lazy-update:

– Call: transfer modified data– Return: transfer when needed

7/8/10 PUMPS Summer School 29

AcceleratorSystemMemory

AcceleratorMemory

CPU

Page 29: GMAC Global Memory for Accelerators

GMAC Data Transfers• Overlap CPU execution and data transfers• Minimal transfer on-demand• Rolling-update:

– Memory-block size granularity

7/8/10 PUMPS Summer School 30

AcceleratorSystemMemory

AcceleratorMemory

CPU

Page 30: GMAC Global Memory for Accelerators

GMACGlobal Memory for Accelerators

Isaac GeladoPUMPS Summer School - Barcelona