cs179: gpu programming - mathematical...

16

RECITATION 2 GPU Memory Synchronization Instruction-level parallelism Latency hiding Matrix Transpose CS179: GPU PROGRAMMING

Upload: others

Post on 30-May-2020

11 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: CS179: GPU PROGRAMMING - Mathematical Sciencescourses.cms.caltech.edu/cs179/2019_lectures/cs179_2019_lec06.pdf · SYNCHRONIZATION ADVICE Do more cheap things and fewer expensive things!

RECITATION 2GPU MemorySynchronizationInstruction-level parallelismLatency hidingMatrix Transpose

CS179: GPU PROGRAMMING

Page 2: CS179: GPU PROGRAMMING - Mathematical Sciencescourses.cms.caltech.edu/cs179/2019_lectures/cs179_2019_lec06.pdf · SYNCHRONIZATION ADVICE Do more cheap things and fewer expensive things!

MAIN REQUIREMENTS FOR GPU PERFORMANCE

• Sufficient parallelism• Latency hiding and occupancy• Instruction-level parallelism• Coherent execution within warps of thread

• Efficient memory usage• Coalesced memory access for global memory• Shared memory and bank conflicts

Page 3: CS179: GPU PROGRAMMING - Mathematical Sciencescourses.cms.caltech.edu/cs179/2019_lectures/cs179_2019_lec06.pdf · SYNCHRONIZATION ADVICE Do more cheap things and fewer expensive things!

LATENCY HIDING

Idea: have enough warps to keep the GPU busy during the waiting time.

Page 4: CS179: GPU PROGRAMMING - Mathematical Sciencescourses.cms.caltech.edu/cs179/2019_lectures/cs179_2019_lec06.pdf · SYNCHRONIZATION ADVICE Do more cheap things and fewer expensive things!

LOOP UNROLLING AND ILP

for (i = 0; i < 10; i++) {output[i] = a[i] + b[i];}

output[0] = a[0] + b[0];output[1] = a[1] + b[1];output[2] = a[2] + b[2];…

• Reduce loop overhead• Increase parallelism when each

iteration of the loop is independent

• Can increase register usage

Page 5: CS179: GPU PROGRAMMING - Mathematical Sciencescourses.cms.caltech.edu/cs179/2019_lectures/cs179_2019_lec06.pdf · SYNCHRONIZATION ADVICE Do more cheap things and fewer expensive things!

SYNCHRONIZATION

__syncthreads() • Synchronizes all threads in a block • Warps are already synchronized! (Can reduce __syncthreads() calls)

Atomic{Add, Sub, Exch, Min, Max, Inc, Dec, CAS, And, Or, Xor}

• Works in global and shared memory

Page 6: CS179: GPU PROGRAMMING - Mathematical Sciencescourses.cms.caltech.edu/cs179/2019_lectures/cs179_2019_lec06.pdf · SYNCHRONIZATION ADVICE Do more cheap things and fewer expensive things!

SYNCHRONIZATION ADVICE

Do more cheap things and fewer expensive things!

Example: computing sum of list of numbers

Naive: each thread atomically increments each number to accumulator in global memory

Smarter solution:● Each thread computes its own sum in register● Use shared memory to sum across a block (Next week: Reduction)● Each block does a single atomic increment in global memory

Page 7: CS179: GPU PROGRAMMING - Mathematical Sciencescourses.cms.caltech.edu/cs179/2019_lectures/cs179_2019_lec06.pdf · SYNCHRONIZATION ADVICE Do more cheap things and fewer expensive things!

LAB 2

Part 1: Conceptual questions1. Latency hiding2. Thread divergence3. Coalesced memory access4. Bank conflicts and instruction dependencies

Part 2: Matrix Transpose Optimization1. Naïve matrix transpose (given to you)2. Shared memory matrix transpose3. Optimal matrix transpose

Need to comment on all non-coalesced memory accesses and bank conflicts in provided kernel code

Page 8: CS179: GPU PROGRAMMING - Mathematical Sciencescourses.cms.caltech.edu/cs179/2019_lectures/cs179_2019_lec06.pdf · SYNCHRONIZATION ADVICE Do more cheap things and fewer expensive things!

MATRIX TRANSPOSE

An interesting IO problem, because you have a stride 1 access and a stride n access. Not a trivial access pattern like “blur_v” from Lab 1.

The example output compares performance among CPU implementation and different GPU implementations.

Page 9: CS179: GPU PROGRAMMING - Mathematical Sciencescourses.cms.caltech.edu/cs179/2019_lectures/cs179_2019_lec06.pdf · SYNCHRONIZATION ADVICE Do more cheap things and fewer expensive things!

MATRIX TRANSPOSE

__global__void naiveTransposeKernel(const float *input, float *output, int n) {// launched with (64, 16) block size and (n / 64, n / 64) grid size// each block transposes a 64x64 block

const int i = threadIdx.x + 64 * blockIdx.x;int j = 4 * threadIdx.y + 64 * blockIdx.y;const int end_j = j + 4;

for (; j < end_j; j++) {output[j + n * i] = input[i + n * j];}

}

Page 10: CS179: GPU PROGRAMMING - Mathematical Sciencescourses.cms.caltech.edu/cs179/2019_lectures/cs179_2019_lec06.pdf · SYNCHRONIZATION ADVICE Do more cheap things and fewer expensive things!

SHARED MEMORY MATRIX TRANSPOSE

Idea to avoid non-coalesced accesses:• Load from global memory with stride 1• Store into shared memory with stride x• __syncthreads()• Load from shared memory with stride y• Store to global memory with stride 1

Need to choose values of x and y to perform the transpose

Page 11: CS179: GPU PROGRAMMING - Mathematical Sciencescourses.cms.caltech.edu/cs179/2019_lectures/cs179_2019_lec06.pdf · SYNCHRONIZATION ADVICE Do more cheap things and fewer expensive things!

EXAMPLE OF A SHARED MEMORY CACHE

Let’s populate shared memory with random integers. Here’s what the first 8 of 32 banks look like:

Page 12: CS179: GPU PROGRAMMING - Mathematical Sciencescourses.cms.caltech.edu/cs179/2019_lectures/cs179_2019_lec06.pdf · SYNCHRONIZATION ADVICE Do more cheap things and fewer expensive things!

EXAMPLE OF A SHARED MEMORY CACHE

Page 13: CS179: GPU PROGRAMMING - Mathematical Sciencescourses.cms.caltech.edu/cs179/2019_lectures/cs179_2019_lec06.pdf · SYNCHRONIZATION ADVICE Do more cheap things and fewer expensive things!

EXAMPLE OF A SHARED MEMORY CACHE

Page 14: CS179: GPU PROGRAMMING - Mathematical Sciencescourses.cms.caltech.edu/cs179/2019_lectures/cs179_2019_lec06.pdf · SYNCHRONIZATION ADVICE Do more cheap things and fewer expensive things!

EXAMPLE OF A SHARED MEMORY CACHE

Page 15: CS179: GPU PROGRAMMING - Mathematical Sciencescourses.cms.caltech.edu/cs179/2019_lectures/cs179_2019_lec06.pdf · SYNCHRONIZATION ADVICE Do more cheap things and fewer expensive things!

AVOIDING BANK CONFLICTS

You can choose x and y to avoid bank conflicts.

Remember that there are 32 banks and the GPU runs threads in batches of 32 (called warps).

A stride n access to shared memory avoids bank conflictsiff gcd(n, 32) == 1.

Page 16: CS179: GPU PROGRAMMING - Mathematical Sciencescourses.cms.caltech.edu/cs179/2019_lectures/cs179_2019_lec06.pdf · SYNCHRONIZATION ADVICE Do more cheap things and fewer expensive things!

TA_UTILS.CPP

DO NOT DELETE THIS CODE!

● Included in the UNIX version of this set

● Should minimize lag or infinite waits on GPU function calls.

● Please leave these functions in the code if you are using Titan/Haru/Maki

● Namespace TA_Utilities

CS 179 Lecture 13 - Caltech Computingcourses.cms.caltech.edu/cs179/2015_lectures/cs179_2015_lec13.pdf · Lecture Outline IO strategy CUDA streams CUDA events How it all works: virtual

CS 179: LECTURE 13 - Caltech Computingcourses.cms.caltech.edu/cs179/2019_lectures/cs179_2019_lec13.pdf · CS 179: LECTURE 13 INTRO TO MACHINE LEARNING. GOALS OF WEEKS 5-6 ... Lecture

CS 179 Lecture 5 - Caltech Computingcourses.cms.caltech.edu/cs179/2015_lectures/cs179_2015_lec05.pdfCS 179 Lecture 5 GPU Memory Systems 1. Last time A block executes on a single streaming

CS 179: Introduction to GPU Programming.courses.cms.caltech.edu/cs179/2020_lectures/cs179_2020_lec01.pdfThe GPU The "Graphics Processing Unit" Relatively new technology designed for

CS179: GPU Programming Lecture 16: Final Project Discussion

CS11 – Java - Caltech Computing + Mathematical Sciencescourses.cms.caltech.edu/.../lectures/cs11-java-lec4.pdf · 2014-11-07 · CS11 – Java Fall 2014-2015 Lecture 4 . ... An

CS179: GPU Programming Lecture 11: Lab 5 Recitation

Winter 2008-2009 Lecture 7 - Mathematical Sciencescourses.cms.caltech.edu/cs11/material/erlang/lectures/cs11-erl-lec... · Lecture 7 Erlang provides ... make sure to document them!

SYSTEM CALL IMPLEMENTATION - Mathematical Sciencescourses.cms.caltech.edu/cs124/lectures-wi2016/CS124Lec14.pdf · System Call Mechanics (2) • The ID of the system call is used to

CS 179: GPU Programming - courses.cms.caltech.educourses.cms.caltech.edu/cs179/2018_lectures/cs179_2018_lec01.pdf · Source: Super Mario 64, by Nintendo. GPUs –Brief History

C lecture 3 - Caltech Computing + Mathematical Sciencescourses.cms.caltech.edu/.../c/mike/lectures/C_lecture_3.pdfArrays n What is an "array"? n A way to collect together data of a

CS 179: LECTURE 14 - Caltech Computingcourses.cms.caltech.edu/cs179/2020_lectures/cs179_2020_lec14.pdf · CS 179: LECTURE 14 NEURAL NETWORKS AND ... ReLU = rectified linear units,

Building Block Building Block Things in the UML Things in the UML Structural Things Structural Things Behavioral Things Behavioral Things Grouping

CS 179: GPU Computing - courses.cms.caltech.educourses.cms.caltech.edu/cs179/2020_lectures/cs179_2020_lec02.pdf · CUDA is simply an extension of other bits of code you write!!!!

CS 179: Introduction to GPU Programming.courses.cms.caltech.edu/cs101gpu/2020_lectures/cs179... · 2020-04-06 · •GPU vs CPU fluid mechanics •Ray Traced Quaternion fractals and

JOURNALING FILE SYSTEMS - Mathematical Sciencescourses.cms.caltech.edu/cs124/lectures-wi2016/CS124Lec26.pdf• Solution: record [some] filesystem operations in a journal on disk, before

CS 179: GPU Programmingcourses.cms.caltech.edu/cs101gpu/2017_lectures/cs179... · 2017. 4. 12. · Homework 3, Part 1 •Implement FFT (“large-kernel”) convolution –Use cuFFT

2017 lec 01 - California Institute of Technologycourses.cms.caltech.edu/cs179/2017_lectures/cs179_2017_lec01.pdfLI UD\ LQWHUVHFWV REMHFW FDOFXODWH OLJKWLQJ DW FORVHVW REMHFW VWRUH

CS 179: GPU Programming - Caltech Computingcourses.cms.caltech.edu/cs179/2020_lectures/cs179_2020_lec05.pdf · Announcement Because of the large number of students enrolled in this

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

IS MADE OF MANY THINGS THAT ADD UP.. GOOD THINGS BAD THINGS BAD THINGS and

CS11 – Advanced C++ - Mathematical Sciencescourses.cms.caltech.edu/cs11/material/advcpp/lectures/cs11-advcpp... · The static Keyword C++ provides the static keyword Also in C,

F. istr. 38515 - caleffi.com · MANUALE DI ISTRUZIONE SOFTWARE INDICE/CONTENTS/ INHALTSVERZEICHNIS/ ... RS-485 (fournit) doit-être connecté à la carte CS179 (à l’aide de passe-câbles

Home | Woodcraft Folk · Ice-skating Snowball Making a Things: Things: Things: Things: duster Things: player Things. A ship An aeroplane A hoover A feather A record Tomato Things:

CS 179: LECTURE 13 - California Institute of Technologycourses.cms.caltech.edu/cs179/2020_lectures/cs179_2020_lec13.pdf · Lecture 13: Intro to Machine Learning Author: Aadyot Created

THINGS THINGS THINGS

CS 179: LECTURE 16 - courses.cms.caltech.educourses.cms.caltech.edu/cs179/2020_lectures/cs179_2020_lec16.pdf · cs 179: lecture 16 model complexity, regularization, and convolutional

CS 179: GPU Programmingcourses.cms.caltech.edu/cs179/2020_lectures/cs179_2020_lec04.pdf · A bank conflict occurs when 2 threads in a warp access different elements in the same bank

CS101c GPU Programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 CS179 GPU Programming CS179 GPU Programming Intro to CUDA: Part II Lecture originally by Luke

CS179: GPU Programming

CS101c GPU Programming - Caltech Computingcourses.cms.caltech.edu/cs101gpu/2013/lec12_cuda_memory.pdf · 2013-04-29 · CS179 GPU Programming Syntax: Global Device Memory Just __device__

CS 11 C track: lecture 1 - Mathematical Sciencescourses.cms.caltech.edu/.../java_lecture_5.pdf · threads in java (3) threads are implemented on a per-object basis any class that

CS 179: GPU Programmingcourses.cms.caltech.edu/cs179/2019_lectures/cs179_2019_lec01.pdfUsing GPUs “General-purpose computing on GPUs” (GPGPU) • Hardware has gotten good enough

CS 179 Design of Usable Interactive Systemscdn.cs50.net/2009/fall/lectures/11/CS179.pdf · Bloglines FeedBurner- Projectspaces Yub.com Spot Runn& 90bbnoom Gcast PANDORA 100k late

CS179: GPU Programming Lecture 1: Introduction. Today Course summary Administrative details Brief history of GPU computing Introduction to CUDA