graphics processing unit
DESCRIPTION
Graphics Processing Unit . TU/e 5kk73 Zhenyu Ye Henk Corporaal 2013. A Typical GPU Card. GPU Is In Your Pocket Too. Source: http://www.chipworks.com/blog/recentteardowns/2012/09/21/apple-iphone-5-the-a6-application-processor/. GPU Architecture. NVIDIA Fermi, 512 Processing Elements (PEs). - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/1.jpg)
Graphics Processing Unit
TU/e 5kk73Zhenyu Ye
Henk Corporaal2013
![Page 2: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/2.jpg)
A Typical GPU Card
![Page 3: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/3.jpg)
GPU Is In Your Pocket Too
Source: http://www.chipworks.com/blog/recentteardowns/2012/09/21/apple-iphone-5-the-a6-application-processor/
![Page 4: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/4.jpg)
GPU ArchitectureNVIDIA Fermi, 512 Processing Elements (PEs)
![Page 5: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/5.jpg)
Latest ArchitectureKepler Architecture (GTX 680)
ref: http://www.nvidia.com/object/nvidia-kepler.html
![Page 6: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/6.jpg)
Transistor Count
Manycore& SOC
GPU
FPGA
ref: http://en.wikipedia.org/wiki/Transistor_count
3~5 Billion
7 Billion
7 Billion
![Page 7: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/7.jpg)
What Can 7bn Transistors Do?Render triangles.
Billions of triangles per second!
ref: "How GPUs Work", http://dx.doi.org/10.1109/MC.2007.59
![Page 8: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/8.jpg)
And Scientific ComputingTOP500 supercomputer list in June 2013. (http://www.top500.org)
![Page 9: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/9.jpg)
The GPU-CPU Gap (by NVIDIA)
ref: Tesla GPU Computing Brochure
![Page 10: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/10.jpg)
The GPU-CPU Gap (by Intel)
ref: "Debunking the 100X GPU vs. CPU myth", http://dx.doi.org/10.1145/1815961.1816021
![Page 11: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/11.jpg)
In This Lecture, We Will Find Out:
• What is the architecture of GPUs?• How to program GPUs?
![Page 12: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/12.jpg)
Let's Start with Examples
Don't worry, we will start from C and RISC!
![Page 13: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/13.jpg)
Let's Start with C and RISCint A[2][4];for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; }}
Assemblycode of inner-loop
lw r0, 4(r1)addi r0, r0, 1sw r0, 4(r1)
Programmer's view of RISC
![Page 14: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/14.jpg)
Most CPUs Have Vector SIMD Units
Programmer's view of a vector SIMD, e.g. SSE.
![Page 15: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/15.jpg)
Let's Program the Vector SIMDint A[2][4];for(i=0;i<2;i++){ movups xmm0, [ &A[i][0] ] // load addps xmm0, xmm1 // add 1 movups [ &A[i][0] ], xmm0 // store}
int A[2][4];for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; }}
Unroll inner-loop to vector operation.
int A[2][4];for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; }}
Assemblycode of inner-loop
lw r0, 4(r1)addi r0, r0, 1sw r0, 4(r1)
Looks like the previous example,but SSE instructions execute on 4 ALUs.
![Page 16: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/16.jpg)
How Do Vector Programs Run?int A[2][4];for(i=0;i<2;i++){ movups xmm0, [ &A[i][0] ] // load addps xmm0, xmm1 // add 1 movups [ &A[i][0] ], xmm0 // store}
![Page 17: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/17.jpg)
CUDA Programmer's View of GPUsA GPU contains multiple SIMD Units.
![Page 18: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/18.jpg)
CUDA Programmer's View of GPUsA GPU contains multiple SIMD Units. All of them can access global memory.
![Page 19: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/19.jpg)
What Are the Differences?SSE GPU
Let's start with two important differences:1. GPUs use threads instead of vectors2. GPUs have the "Shared Memory" spaces
![Page 20: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/20.jpg)
Thread Hierarchy in CUDA
GridcontainsThread Blocks
Thread BlockcontainsThreads
![Page 21: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/21.jpg)
Let's Start Again from Cint A[2][4];for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; }}
int A[2][4]; kernelF<<<(2,1),(4,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++; }
// define threads// all threads run same kernel
// each thread block has its id// each thread has its id
// each thread has a different i and j
convert into CUDA
![Page 22: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/22.jpg)
What Is the Thread Hierarchy?
int A[2][4]; kernelF<<<(2,1),(4,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++; }
// define threads// all threads run same kernel
// each thread block has its id// each thread has its id
// each thread has a different i and j
thread 3 of block 1 operateson element A[1][3]
![Page 23: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/23.jpg)
How Are Threads Scheduled?
![Page 24: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/24.jpg)
Blocks Are Dynamically Scheduled
![Page 25: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/25.jpg)
How Are Threads Executed?int A[2][4]; kernelF<<<(2,1),(4,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++; }
mv.u32 %r0, %ctaid.xmv.u32 %r1, %ntid.xmv.u32 %r2, %tid.xmad.u32 %r3, %r2, %r1, %r0ld.global.s32 %r4, [%r3]add.s32 %r4, %r4, 1st.global.s32 [%r3], %r4
// r0 = i = blockIdx.x// r1 = "threads-per-block"// r2 = j = threadIdx.x
// r3 = i * "threads-per-block" + j// r4 = A[i][j]// r4 = r4 + 1// A[i][j] = r4
![Page 26: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/26.jpg)
Utilizing Memory Hierarchy
![Page 27: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/27.jpg)
Example: Average Filters
Average over a3x3 window fora 16x16 array
kernelF<<<(1,1),(16,16)>>>(A);__device__ kernelF(A){ i = threadIdx.y; j = threadIdx.x; tmp = (A[i-1][j-1] + A[i-1][j] ... + A[i+1][i+1] ) / 9;
A[i][j] = tmp;}
![Page 28: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/28.jpg)
Utilizing the Shared MemorykernelF<<<(1,1),(16,16)>>>(A);__device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}
Average over a3x3 window fora 16x16 array
![Page 29: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/29.jpg)
Utilizing the Shared MemorykernelF<<<(1,1),(16,16)>>>(A);__device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}
allocate shared mem
![Page 30: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/30.jpg)
However, the Program Is IncorrectkernelF<<<(1,1),(16,16)>>>(A);__device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}
![Page 31: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/31.jpg)
Let's See What's WrongkernelF<<<(1,1),(16,16)>>>(A);__device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}
Assume 256 threads are scheduled on 8 PEs.
Before load instruction
![Page 32: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/32.jpg)
Let's See What's WrongkernelF<<<(1,1),(16,16)>>>(A);__device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}
Assume 256 threads are scheduled on 8 PEs.
Some threads finish the load earlier than others.
Threads starts window operation as soon as it loads it own data element.
![Page 33: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/33.jpg)
Let's See What's WrongkernelF<<<(1,1),(16,16)>>>(A);__device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}
Assume 256 threads are scheduled on 8 PEs.
Some threads finish the load earlier than others.
Threads starts window operation as soon as it loads it own data element.
Some elements in the window are not yet loaded by other threads. Error!
![Page 34: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/34.jpg)
How To Solve It?kernelF<<<(1,1),(16,16)>>>(A);__device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}
Assume 256 threads are scheduled on 8 PEs.
Some threads finish the load earlier than others.
![Page 35: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/35.jpg)
Use a "SYNC" barrier!kernelF<<<(1,1),(16,16)>>>(A);__device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem __sync(); // threads wait at barrier A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}
Assume 256 threads are scheduled on 8 PEs.
Some threads finish the load earlier than others.
![Page 36: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/36.jpg)
Use a "SYNC" barrier!kernelF<<<(1,1),(16,16)>>>(A);__device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem __sync(); // threads wait at barrier A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}
Assume 256 threads are scheduled on 8 PEs.
Some threads finish the load earlier than others.
Wait until all threadshit barrier.
![Page 37: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/37.jpg)
Use a "SYNC" barrier!kernelF<<<(1,1),(16,16)>>>(A);__device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem __sync(); // threads wait at barrier A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}
Assume 256 threads are scheduled on 8 PEs.
All elements in the window are loaded when each thread starts averaging.
![Page 38: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/38.jpg)
Review What We Have Learned1. Single Instruction Multiple Thread (SIMT)2. Shared memory
Q: What are the pros and cons of explicitly managed memory?
Q: What are the fundamental difference between SIMT and vector SIMD programming model?
![Page 39: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/39.jpg)
Take the Same Example Again
Average over a3x3 window fora 16x16 array
Assume vector SIMD and SIMT both have shared memory.What is the difference?
![Page 40: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/40.jpg)
Vector SIMD v.s. SIMTkernelF<<<(1,1),(16,16)>>>(A);__device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem __sync(); // threads wait at barrier A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}
int A[16][16]; // global memory__shared__ int B[16][16]; // shared memfor(i=0;i<16;i++){ for(j=0;i<4;j+=4){ movups xmm0, [ &A[i][j] ] movups [ &B[i][j] ], xmm0 }}for(i=0;i<16;i++){ for(j=0;i<4;j+=4){ addps xmm1, [ &B[i-1][j-1] ] addps xmm1, [ &B[i-1][j] ] ... divps xmm1, 9 }}for(i=0;i<16;i++){ for(j=0;i<4;j+=4){ addps [ &A[i][j] ], xmm1 }}
![Page 41: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/41.jpg)
Vector SIMD v.s. SIMTkernelF<<<(1,1),(16,16)>>>(A);__device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; __sync(); // threads wait at barrier A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}
int A[16][16];__shared__ int B[16][16];for(i=0;i<16;i++){ for(j=0;i<4;j+=4){ movups xmm0, [ &A[i][j] ] movups [ &B[i][j] ], xmm0 }}for(i=0;i<16;i++){ for(j=0;i<4;j+=4){ addps xmm1, [ &B[i-1][j-1] ] addps xmm1, [ &B[i-1][j] ] ... divps xmm1, 9 }}for(i=0;i<16;i++){ for(j=0;i<4;j+=4){ addps [ &A[i][j] ], xmm1 }}
Programmers need to know there are 4 PEs.
Each inst. is executed by all PEs in locked step.
# of PEs in HW is transparent to programmers.
Programmers give up exec. ordering to HW.
![Page 42: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/42.jpg)
Review What We Have Learned
Programmers convert data level parallelism (DLP) into thread level parallelism (TLP).
![Page 43: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/43.jpg)
HW Groups Threads Into WarpsExample: 32 threads per warp
![Page 44: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/44.jpg)
Why Make Things Complicated?
Remember the hazards in MIPS?•Structural hazards•Data hazards•Control hazards
![Page 45: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/45.jpg)
More About GPUs Work
Let's start with an example.
![Page 46: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/46.jpg)
Example of ImplementationNote: NVIDIA may use a more complicated implementation.
![Page 47: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/47.jpg)
Example of Register AllocationAssumption:register file has 32 laneseach warp has 32 threadseach thread uses 8 registers
![Page 48: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/48.jpg)
ExampleProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5
Assume warp 0 and warp 1 are scheduled for execution.
![Page 49: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/49.jpg)
Read Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5
Read source operands:r1 for warp 0r4 for warp 1
![Page 50: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/50.jpg)
Buffer Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5
Push ops to op collector:r1 for warp 0r4 for warp 1
![Page 51: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/51.jpg)
Read Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5
Read source operands:r2 for warp 0r5 for warp 1
![Page 52: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/52.jpg)
Buffer Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5
Push ops to op collector:r2 for warp 0r5 for warp 1
![Page 53: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/53.jpg)
ExecuteProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5
Compute the first 16 threads in the warp.
![Page 54: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/54.jpg)
ExecuteProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5
Compute the last 16 threads in the warp.
![Page 55: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/55.jpg)
Write backProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5
Write back:r0 for warp 0r3 for warp 1
![Page 56: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/56.jpg)
Recap
What we have learned so far:• Pros and cons of massive threading• How threads are executed on GPUs
![Page 57: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/57.jpg)
What Are the Possible Hazards?
Three types of hazards we have learned earlier: •Structural hazards•Data hazards•Control hazards
![Page 58: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/58.jpg)
Structural Hazards
In practice, structural hazards need to be handled dynamically.
![Page 59: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/59.jpg)
Data Hazards
In practice, we have much longer pipeline!
![Page 60: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/60.jpg)
Control Hazards
We cannot travel back in time, what should we do?
![Page 61: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/61.jpg)
How Do GPUs Solve Hazards?
However, GPUs (and SIMDs in general) have additional challenges: the branch!
![Page 62: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/62.jpg)
Let's Start Again from a CPU
Let's start from a MIPS.
![Page 63: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/63.jpg)
A Naive SIMD Processor
![Page 64: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/64.jpg)
Handling Branch In GPU•Threads within a warp are free to branch.
if( $r17 > $r19 ){ $r16 = $r20 + $r31 }else{ $r16 = $r21 - $r32 }$r18 = $r15 + $r16
Assembly code on the right are disassembled from cuda binary (cubin) using "decuda", link
![Page 65: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/65.jpg)
If No Branch Divergence in a Warp• If threads within a warp take the same path, the other path will not be
executed.
![Page 66: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/66.jpg)
Branch Divergence within a Warp• If threads within a warp diverge, both paths have to be executed.•Masks are set to filter out threads not executing on current path.
![Page 67: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/67.jpg)
Dynamic Warp FormationExample: merge two divergent warps into a new warp if possible.
Without dynamic warp formation With dynamic warp formation
Create warp 3 from warp 0 and 1
Create warp 4 from warp 0 and 1
ref: Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. http://dx.doi.org/10.1145/1543753.1543756
![Page 68: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/68.jpg)
Performance Modeling
What is the bottleneck?
![Page 69: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/69.jpg)
Roofline ModelIdentify performance bottleneck: computation bound v.s. bandwidth bound
![Page 70: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/70.jpg)
Optimization Is Key for Attainable Gflops/s
![Page 71: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/71.jpg)
Computation, Bandwidth, LatencyIllustrating three bottlenecks in the Roofline model.
![Page 72: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/72.jpg)
Program Optimization
If I know the bottleneck, how to optimize?
Optimization techniques for NVIDIA GPUs:"CUDA C Best Practices Guide"http://docs.nvidia.com/cuda/
![Page 73: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/73.jpg)
Recap: What We have Learned
● GPU architecture (what is the difference?)● GPU programming (why threads?)● Performance analysis (bottlenecks)
![Page 74: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/74.jpg)
Further ReadingGPU architecture and programming:
• NVIDIA Tesla: A unified graphics and computing architecture, in IEEE Micro 2008. (http://dx.doi.org/10.1109/MM.2008.31)
• Scalable Parallel Programming with CUDA, in ACM Queue 2008. (http://dx.doi.org/10.1145/1365490.1365500)
• Understanding throughput-oriented architectures, in Communications of the ACM 2010. (http://dx.doi.org/10.1145/1839676.1839694)
![Page 75: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/75.jpg)
Recommended ReadingPerformance modeling:
• Roofline: an insightful visual performance model for multicore architectures, in Communications of the ACM 2009. (http://dx.doi.org/10.1145/1498765.1498785)
Optional (if you are interested)How branches are executed in GPUs:
• Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware, in ACM Transactions on Architecture and Code Optimization 2009. (http://dx.doi.org/10.1145/1543753.1543756)
![Page 76: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/76.jpg)
Backup Slides
![Page 77: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/77.jpg)
Common Optimization TechniquesOptimizations on memory latency tolerance•Reduce register pressure•Reduce shared memory pressure
Optimizations on memory bandwidth •Global memory coalesce •Avoid shared memory bank conflicts•Grouping byte access •Avoid Partition camping
Optimizations on computation efficiency •Mul/Add balancing• Increase floating point proportion
Optimizations on operational intensity •Use tiled algorithm•Tuning thread granularity
![Page 78: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/78.jpg)
Common Optimization TechniquesOptimizations on memory latency tolerance•Reduce register pressure•Reduce shared memory pressure
Optimizations on memory bandwidth •Global memory coalesce •Avoid shared memory bank conflicts•Grouping byte access •Avoid Partition camping
Optimizations on computation efficiency •Mul/Add balancing• Increase floating point proportion
Optimizations on operational intensity •Use tiled algorithm•Tuning thread granularity
![Page 79: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/79.jpg)
Multiple Levels of Memory HierarchyName Cache? Latency (cycle) Read-only?Global L1/L2 200~400 (cache miss) R/WShared No 1~3 R/WConstant Yes 1~3 Read-onlyTexture Yes ~100 Read-onlyLocal L1/L2 200~400 (cache miss) R/W
![Page 80: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/80.jpg)
Shared Mem Contains Multiple Banks
![Page 81: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/81.jpg)
Compute CapabilityNeed arch info to perform optimization.
ref: NVIDIA, "CUDA C Programming Guide", (link)
![Page 82: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/82.jpg)
Shared Memory (compute capability 2.x)
withoutbankconflict:
withbankconflict:
![Page 83: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/83.jpg)
Common Optimization TechniquesOptimizations on memory latency tolerance•Reduce register pressure•Reduce shared memory pressure
Optimizations on memory bandwidth •Global memory alignment and coalescing•Avoid shared memory bank conflicts•Grouping byte access •Avoid Partition camping
Optimizations on computation efficiency •Mul/Add balancing• Increase floating point proportion
Optimizations on operational intensity •Use tiled algorithm•Tuning thread granularity
![Page 84: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/84.jpg)
Global Memory In Off-Chip DRAMAddress space is interleaved among multiple channels.
![Page 85: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/85.jpg)
Global Memory
![Page 86: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/86.jpg)
Global Memory
![Page 87: Graphics Processing Unit](https://reader035.vdocuments.site/reader035/viewer/2022081604/56816725550346895ddbb7ed/html5/thumbnails/87.jpg)
Global Memory