big kernel: high performance cpu-gpu communication pipelining for big data style applications...
TRANSCRIPT
Big Kernel: High Performance CPU-GPU
Communication Pipeliningfor Big Data style
ApplicationsSajitha Naduvil-Vadukootu
CSC 8530 (Parallel Algorithms)Instructor: Dr. Sushil Prasad
Outline• Background on CPU-GPU communication• Problem Statement• What is Big Kernel• How does Big Kernel help?• Implementation Details• Results & Improvements
Review: GPU Programming
• Programming Model:o GPU = Device, CPU = Host, Kernel = Programo GPU/CUDA program should copy the data to GPU, trigger the Kernel, copy data back
after execution.o Threads and Warps (blocks)
• Memory Model:o Registers
• Per thread• Data lifetime = thread lifetime
o Local memory• Per thread off-chip memory (physically in device DRAM)• Data lifetime = thread lifetime
o Shared memory• Per thread block on-chip memory• Data lifetime = block lifetime
o Global (device) memory• Accessible by all threads as well as host (CPU)• Data lifetime = from allocation to deallocation
o Host (CPU) memory• Not directly accessible by CUDA threads
Problem Statement• Scope: Streaming algorithms that process data that
does not fit inside GPU memory. • Problem: suboptimal execution due to issues
stemming from transfer of data between CPU and GPU
• Traditional Solution: Partition data on CPU side, call the Kernel iteratively for each partition.
• Double Buffering Scheme: CPU fills one buffer while GPU consumes data on a second buffer.
• Dynamic Stream Graph[2] API: Programmer specifies high level communication hints for optimization
• Issues: Heavy on programming side, error prone
Introducing Big Kernel• Analyzing the problem statement further:
o Tendency for coding errorso Efficiency of partitioning the datao Physical bandwidth of PCIe Link between the memorieso GPU needs the threads to access memory in adjacent locations.
• Solution: 4 stage pipeline with data prefetching• Acts like a virtual memory for GPU threads• Programmer writes arbitrarily large data structures• Static (compile time) transforms the Kernel into ‘Big
Kernel’.• ie, Data partitioning, data transfer and
communication between CPU-GPU is managed underneath.
Big Kernel: Pipeline• 1) Prefetch Address Generation: GPU threads
calculate the address of the data that is needed for later computation and records it on CPU side address buffer
• 2) Data Assembly: CPU assembles the data in the prefetch buffer based on addresses from (1)
• 3) Data Transfer: GPU DMA engine transfers the contents of prefetch buffer to data buffer in GPU
• 4) Kernel Computation: GPU threads execute the actual computation using the data in data buffer.
Stage1: Prefetch Address Generation
• Compile time: From GPU Kernel code, remove all instructions other than o (1) control flow statements o (2) statements contributing to memory accesso (3) memory access instructions
• Memory access instructions are changed to write the accessed address into an address buffer on CPU side.
• Optimization by applying patterns to encode on GPU side, transfer the pattern to CPU and decode on CPU side.
State 2: Data Assembly
• CPU fetches the address pattern / addresses from address buffer, decodes the pattern, determines the address of data items to be fetched.
• CPU fetches data and assembles it in continuous locations, in the order of the addresses received.
• This helps the GPU threads in one warp (block) to access the memory in one go (once the data gets to GPU memory).
State 3: Data Transfer & Stage 4: Execution
• DMA (Memory Access module) transfers the memory from CPU prefetch buffer to GPU memory via PCIe link.
• One of the advantages is that only the minimal amount of data (that is urgently needed) is transferred through PCIe link.
• Synchronization is required at stage 3 so that data that is still being used by threads is not overwritten. Stage 3 can only proceed when all the data has been consumed from GPU memory
• Synchronization is also required at stage 4. Threads need to wait for the notification that data transfer is complete before they can start consuming the data.
Big Kernel: Data flow in Buffers
Big Kernel Pipeline
Example: K means computation
• K means computation• Taken in a set of data points in numP• Compare with existing cluster centers• FindClosestCluster returns the ID of the cluster
that is closest to the data point x,y,z in numP array.
• Particles array collects the return info.
Example: K means computation
• Assume the NumP array wont fit into GPU memory.
• Partition NumP array into chunks, Kernel executes each chunk at one time.
Example: K means computation
• Big Kernal method: • CPU code uses StreamingMalloc() and
StreamingMap() provided by Big Kernel.• Address prefetching is computed from GPU code
Additional Optimizations
• Pattern Recognition in prefetch address generation: Looks for patterns in address locations, encodes the addresses to be decoded CPU side.
• Data Locality in assembling data: Read all the data needed by 1 GPU thread at one time.
• Synchronization: first 3 stages produce data, 4th stage consumes it. Sync between production and consumption.
• Buffer Allocation: active vs inactive thread blocks : only allocate buffer space to active blocks.
Experimental Results• Big-data / Streaming type application scenarios:• Speed up comparison between (1) Multi threaded
CPU (2) GPU with Single Buffer (3) GPU with Double Buffer (4) GPU with Big Kernel.
Improvements• Consider applying to complex algorithms that
include pointer / complex control instructions in the Kernel.
• Integration with Map Reduce
References• MOKHTARI, R; STUMM, M. BigKernel -- High Performance CPU-GPU
Communication Pipelining for Big Data-Style Applications. 2014 IEEE 28th International Parallel & Distributed Processing Symposium. 819, Jan. 2014. ISSN: 9781479937998
• [2] T. Komoda, S. Miwa, and H. Nakamura. Communication Library to Overlap Computation and Communication for OpenCL Application. In Proc. 26th IEEE Intl. Parallel and Distributed Processing Symp. Workshops PhD Forum (IPDPSW), pages 567–573, 2012.