lec 2 gpu hardware architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdflec 2 3...

2013秋-创新实验课 CUDA高性能并行程序设计

Lec 2 GPU Hardware Architecture

Tonghua SuSchool of Software

Harbin Institute of Technology

Lec 2 2

Tonghua Su, School of Software, Harbin Institute of Technology, China

CUDA高性能并行程序设计

Outline

Top-level View

Kepler Architecture

Fermi Architecture

1

2

3

Lec 2 3



Top-level View

CPU/GPU architecture simplified

DDR3 GDDR5

Lec 2 4



Top-level View

CPU/GPU architecture—northbridge

Lec 2 5



Top-level View

Multiple CPUs (SMP configuration)

Lec 2 6



Top-level View

Multiple CPUs (NUMA)

Lec 2 7



Top-level View

Multi-CPU (NUMA configuration), multiple buses

Lec 2 8



Top-level View

Multi-CPU with integrated PCI Express

Lec 2 9



Top-level View

Integrated GPU

Lec 2 10



Top-level View

Integrated GPU with discrete GPU(s)

Lec 2 11



Top-level View

GPUs in multiple slots

Lec 2 12



Top-level View

Multi-GPU board

Lec 2 13



Kepler

There are multiple products in the latest Kepler generation Consumer graphics cards (GeForce): GTX650 Ti: 768 cores, 1/2GB (£100/130)

GTX660 Ti: 1344 cores, 2GB (£240)

GTX680: 1536 cores, 2/4GB (£360/440)

GTX690: 2×1536 cores, 2×2GB (£800)

GTX 780: 2304 cores, 3GB

GTX TITAN: 2688 cores, 6GB

HPC cards (Tesla): K10 module: 2×1536 cores, 2×4GB

K20 card: 2496 cores, 5GB

K20X module: 2688 cores, 6GB

Lec 2 14



Kepler

building block is a “streaming multiprocessor” (SMX): 192 cores and 64k registers 64KB of shared memory / L1 cache 8KB cache for constants 48KB texture cache for read-only arrays up to 2K threads per SMX

different chips have different numbers of these SMXs:

product SMXs bandwidth memory powerGTX 650 Ti

GTX 680K10 (2×)

K20X

488

14

86 GB/s190 GB/s160 GB/s250 GB/s

1/2 GB2/4 GB4 GB6 GB

110W195W110W235W

Lec 2 15



Kepler

Kepler GPU

SMX

L2 cache

SMX SMX

SMX SMX SMX SMX

L1 cache /shared memory

SMX

Lec 2 16



Fermi

older Fermi GPU has SM “streaming multiprocessor”: 32 cores and 32k registers 64KB of shared memory / L1 cache 8KB cache for constants up to 1536 threads per SM

different chips have different numbers of these SMs:

product SMs bandwidth memoryGTX 560GTX 580

M2050/2070M2075/2090

14161416

130 GB/s190 GB/s140 GB/s140 GB/s

1/2 GB1.5 GB3/6 GB3/6 GB

Lec 2 17



Fermi

Fermi GPU

SM SM SM

L2 cache

SM SM SM

SM SM SM SM SM SM SM

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C C C C

C C C C

C C C C

L1 cache /shared memory

SM

lec 2 gpu hardware architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdflec 2 3...

Documents