lec 2 gpu hardware architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdflec 2 3...
TRANSCRIPT
2013秋-创新实验课 CUDA高性能并行程序设计
Lec 2 GPU Hardware Architecture
Tonghua SuSchool of Software
Harbin Institute of Technology
Lec 2 2
Tonghua Su, School of Software, Harbin Institute of Technology, China
CUDA高性能并行程序设计
Outline
Top-level View
Kepler Architecture
Fermi Architecture
1
2
3
Lec 2 3
Tonghua Su, School of Software, Harbin Institute of Technology, China
CUDA高性能并行程序设计
Top-level View
CPU/GPU architecture simplified
DDR3 GDDR5
Lec 2 4
Tonghua Su, School of Software, Harbin Institute of Technology, China
CUDA高性能并行程序设计
Top-level View
CPU/GPU architecture—northbridge
Lec 2 5
Tonghua Su, School of Software, Harbin Institute of Technology, China
CUDA高性能并行程序设计
Top-level View
Multiple CPUs (SMP configuration)
Lec 2 6
Tonghua Su, School of Software, Harbin Institute of Technology, China
CUDA高性能并行程序设计
Top-level View
Multiple CPUs (NUMA)
Lec 2 7
Tonghua Su, School of Software, Harbin Institute of Technology, China
CUDA高性能并行程序设计
Top-level View
Multi-CPU (NUMA configuration), multiple buses
Lec 2 8
Tonghua Su, School of Software, Harbin Institute of Technology, China
CUDA高性能并行程序设计
Top-level View
Multi-CPU with integrated PCI Express
Lec 2 9
Tonghua Su, School of Software, Harbin Institute of Technology, China
CUDA高性能并行程序设计
Top-level View
Integrated GPU
Lec 2 10
Tonghua Su, School of Software, Harbin Institute of Technology, China
CUDA高性能并行程序设计
Top-level View
Integrated GPU with discrete GPU(s)
Lec 2 11
Tonghua Su, School of Software, Harbin Institute of Technology, China
CUDA高性能并行程序设计
Top-level View
GPUs in multiple slots
Lec 2 12
Tonghua Su, School of Software, Harbin Institute of Technology, China
CUDA高性能并行程序设计
Top-level View
Multi-GPU board
Lec 2 13
Tonghua Su, School of Software, Harbin Institute of Technology, China
CUDA高性能并行程序设计
Kepler
There are multiple products in the latest Kepler generation Consumer graphics cards (GeForce): GTX650 Ti: 768 cores, 1/2GB (£100/130)
GTX660 Ti: 1344 cores, 2GB (£240)
GTX680: 1536 cores, 2/4GB (£360/440)
GTX690: 2×1536 cores, 2×2GB (£800)
GTX 780: 2304 cores, 3GB
GTX TITAN: 2688 cores, 6GB
HPC cards (Tesla): K10 module: 2×1536 cores, 2×4GB
K20 card: 2496 cores, 5GB
K20X module: 2688 cores, 6GB
Lec 2 14
Tonghua Su, School of Software, Harbin Institute of Technology, China
CUDA高性能并行程序设计
Kepler
building block is a “streaming multiprocessor” (SMX): 192 cores and 64k registers 64KB of shared memory / L1 cache 8KB cache for constants 48KB texture cache for read-only arrays up to 2K threads per SMX
different chips have different numbers of these SMXs:
product SMXs bandwidth memory powerGTX 650 Ti
GTX 680K10 (2×)
K20X
488
14
86 GB/s190 GB/s160 GB/s250 GB/s
1/2 GB2/4 GB4 GB6 GB
110W195W110W235W
Lec 2 15
Tonghua Su, School of Software, Harbin Institute of Technology, China
CUDA高性能并行程序设计
Kepler
Kepler GPU
SMX
L2 cache
SMX SMX
SMX SMX SMX SMX
L1 cache /shared memory
SMX
Lec 2 16
Tonghua Su, School of Software, Harbin Institute of Technology, China
CUDA高性能并行程序设计
Fermi
older Fermi GPU has SM “streaming multiprocessor”: 32 cores and 32k registers 64KB of shared memory / L1 cache 8KB cache for constants up to 1536 threads per SM
different chips have different numbers of these SMs:
product SMs bandwidth memoryGTX 560GTX 580
M2050/2070M2075/2090
14161416
130 GB/s190 GB/s140 GB/s140 GB/s
1/2 GB1.5 GB3/6 GB3/6 GB
Lec 2 17
Tonghua Su, School of Software, Harbin Institute of Technology, China
CUDA高性能并行程序设计
Fermi
Fermi GPU
SM SM SM
L2 cache
SM SM SM
SM SM SM SM SM SM SM
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C C C C
C C C C
C C C C
L1 cache /shared memory
SM