big kernel: high performance cpu-gpu communication pipelining for big data style applications...

18
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms) Instructor: Dr. Sushil Prasad

Upload: iris-norman

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Big Kernel: High Performance CPU-GPU

Communication Pipeliningfor Big Data style

ApplicationsSajitha Naduvil-Vadukootu

CSC 8530 (Parallel Algorithms)Instructor: Dr. Sushil Prasad

Page 2: Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Outline• Background on CPU-GPU communication• Problem Statement• What is Big Kernel• How does Big Kernel help?• Implementation Details• Results & Improvements

Page 3: Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Review: GPU Programming

• Programming Model:o GPU = Device, CPU = Host, Kernel = Programo GPU/CUDA program should copy the data to GPU, trigger the Kernel, copy data back

after execution.o Threads and Warps (blocks)

• Memory Model:o Registers

• Per thread• Data lifetime = thread lifetime

o Local memory• Per thread off-chip memory (physically in device DRAM)• Data lifetime = thread lifetime

o Shared memory• Per thread block on-chip memory• Data lifetime = block lifetime

o Global (device) memory• Accessible by all threads as well as host (CPU)• Data lifetime = from allocation to deallocation

o Host (CPU) memory• Not directly accessible by CUDA threads

Page 4: Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Problem Statement• Scope: Streaming algorithms that process data that

does not fit inside GPU memory. • Problem: suboptimal execution due to issues

stemming from transfer of data between CPU and GPU

• Traditional Solution: Partition data on CPU side, call the Kernel iteratively for each partition.

• Double Buffering Scheme: CPU fills one buffer while GPU consumes data on a second buffer.

• Dynamic Stream Graph[2] API: Programmer specifies high level communication hints for optimization

• Issues: Heavy on programming side, error prone

Page 5: Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Introducing Big Kernel• Analyzing the problem statement further:

o Tendency for coding errorso Efficiency of partitioning the datao Physical bandwidth of PCIe Link between the memorieso GPU needs the threads to access memory in adjacent locations.

• Solution: 4 stage pipeline with data prefetching• Acts like a virtual memory for GPU threads• Programmer writes arbitrarily large data structures• Static (compile time) transforms the Kernel into ‘Big

Kernel’.• ie, Data partitioning, data transfer and

communication between CPU-GPU is managed underneath.

Page 6: Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Big Kernel: Pipeline• 1) Prefetch Address Generation: GPU threads

calculate the address of the data that is needed for later computation and records it on CPU side address buffer

• 2) Data Assembly: CPU assembles the data in the prefetch buffer based on addresses from (1)

• 3) Data Transfer: GPU DMA engine transfers the contents of prefetch buffer to data buffer in GPU

• 4) Kernel Computation: GPU threads execute the actual computation using the data in data buffer.

Page 7: Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Stage1: Prefetch Address Generation

• Compile time: From GPU Kernel code, remove all instructions other than o (1) control flow statements o (2) statements contributing to memory accesso (3) memory access instructions

• Memory access instructions are changed to write the accessed address into an address buffer on CPU side.

• Optimization by applying patterns to encode on GPU side, transfer the pattern to CPU and decode on CPU side.

Page 8: Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

State 2: Data Assembly

• CPU fetches the address pattern / addresses from address buffer, decodes the pattern, determines the address of data items to be fetched.

• CPU fetches data and assembles it in continuous locations, in the order of the addresses received.

• This helps the GPU threads in one warp (block) to access the memory in one go (once the data gets to GPU memory).

Page 9: Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

State 3: Data Transfer & Stage 4: Execution

• DMA (Memory Access module) transfers the memory from CPU prefetch buffer to GPU memory via PCIe link.

• One of the advantages is that only the minimal amount of data (that is urgently needed) is transferred through PCIe link.

• Synchronization is required at stage 3 so that data that is still being used by threads is not overwritten. Stage 3 can only proceed when all the data has been consumed from GPU memory

• Synchronization is also required at stage 4. Threads need to wait for the notification that data transfer is complete before they can start consuming the data.

Page 10: Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Big Kernel: Data flow in Buffers

Page 11: Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Big Kernel Pipeline

Page 12: Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Example: K means computation

• K means computation• Taken in a set of data points in numP• Compare with existing cluster centers• FindClosestCluster returns the ID of the cluster

that is closest to the data point x,y,z in numP array.

• Particles array collects the return info.

Page 13: Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Example: K means computation

• Assume the NumP array wont fit into GPU memory.

• Partition NumP array into chunks, Kernel executes each chunk at one time.

Page 14: Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Example: K means computation

• Big Kernal method: • CPU code uses StreamingMalloc() and

StreamingMap() provided by Big Kernel.• Address prefetching is computed from GPU code

Page 15: Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Additional Optimizations

• Pattern Recognition in prefetch address generation: Looks for patterns in address locations, encodes the addresses to be decoded CPU side.

• Data Locality in assembling data: Read all the data needed by 1 GPU thread at one time.

• Synchronization: first 3 stages produce data, 4th stage consumes it. Sync between production and consumption.

• Buffer Allocation: active vs inactive thread blocks : only allocate buffer space to active blocks.

Page 16: Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Experimental Results• Big-data / Streaming type application scenarios:• Speed up comparison between (1) Multi threaded

CPU (2) GPU with Single Buffer (3) GPU with Double Buffer (4) GPU with Big Kernel.

Page 17: Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Improvements• Consider applying to complex algorithms that

include pointer / complex control instructions in the Kernel.

• Integration with Map Reduce

Page 18: Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

References• MOKHTARI, R; STUMM, M. BigKernel -- High Performance CPU-GPU

Communication Pipelining for Big Data-Style Applications. 2014 IEEE 28th International Parallel & Distributed Processing Symposium. 819, Jan. 2014. ISSN: 9781479937998

• [2] T. Komoda, S. Miwa, and H. Nakamura. Communication Library to Overlap Computation and Communication for OpenCL Application. In Proc. 26th IEEE Intl. Parallel and Distributed Processing Symp. Workshops PhD Forum (IPDPSW), pages 567–573, 2012.