lec 2 gpu hardware architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdflec 2 3...

2013秋-创新实验课 CUDA高性能并行程序设计

Lec 2 GPU Hardware Architecture

Tonghua SuSchool of Software

Harbin Institute of Technology

Lec 2 2

Tonghua Su, School of Software, Harbin Institute of Technology, China

CUDA高性能并行程序设计

Outline

Top-level View

Kepler Architecture

Fermi Architecture

Lec 2 3

Top-level View

CPU/GPU architecture simplified

DDR3 GDDR5

Lec 2 4

Top-level View

CPU/GPU architecture—northbridge

Lec 2 5

Top-level View

Multiple CPUs (SMP configuration)

Lec 2 6

Top-level View

Multiple CPUs (NUMA)

Lec 2 7

Top-level View

Multi-CPU (NUMA configuration), multiple buses

Lec 2 8

Top-level View

Multi-CPU with integrated PCI Express

Lec 2 9

Top-level View

Integrated GPU

Lec 2 10

Top-level View

Integrated GPU with discrete GPU(s)

Lec 2 11

Top-level View

GPUs in multiple slots

Lec 2 12

Top-level View

Multi-GPU board

Lec 2 13

Kepler

There are multiple products in the latest Kepler generation Consumer graphics cards (GeForce): GTX650 Ti: 768 cores, 1/2GB (£100/130)

GTX660 Ti: 1344 cores, 2GB (£240)

GTX680: 1536 cores, 2/4GB (£360/440)

GTX690: 2×1536 cores, 2×2GB (£800)

GTX 780: 2304 cores, 3GB

GTX TITAN: 2688 cores, 6GB

HPC cards (Tesla): K10 module: 2×1536 cores, 2×4GB

K20 card: 2496 cores, 5GB

K20X module: 2688 cores, 6GB

Lec 2 14

Kepler

building block is a “streaming multiprocessor” (SMX): 192 cores and 64k registers 64KB of shared memory / L1 cache 8KB cache for constants 48KB texture cache for read-only arrays up to 2K threads per SMX

different chips have different numbers of these SMXs:

product SMXs bandwidth memory powerGTX 650 Ti

GTX 680K10 (2×)

86 GB/s190 GB/s160 GB/s250 GB/s

1/2 GB2/4 GB4 GB6 GB

110W195W110W235W

Lec 2 15

Kepler

Kepler GPU

L2 cache

SMX SMX

SMX SMX SMX SMX

L1 cache /shared memory

Lec 2 16

older Fermi GPU has SM “streaming multiprocessor”: 32 cores and 32k registers 64KB of shared memory / L1 cache 8KB cache for constants up to 1536 threads per SM

different chips have different numbers of these SMs:

product SMs bandwidth memoryGTX 560GTX 580

M2050/2070M2075/2090

14161416

130 GB/s190 GB/s140 GB/s140 GB/s

1/2 GB1.5 GB3/6 GB3/6 GB

Lec 2 17

Fermi GPU

SM SM SM

L2 cache

SM SM SM

SM SM SM SM SM SM SM

C C C C

L1 cache /shared memory

lec 2 gpu hardware architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdflec 2 3...

Documents

journal de bord caravane biciletta 4 a 7 juillet ·...

shawn m. lamothe, jun guo, wentao li, tonghua yang, and...

peering into the hermit kingdom · tonghua fusong linjiang...

encore’s holiday concert groups 2016 piece:...

the power of the parallel: using culturally appropriate...

workshop practice ta...

women's leadership development...

lec 07 transforms and quantization ii - university of...

china - salary checks -world wide wage …2010 simping the...

絶対わかる超対称性 -...

p ii ti i ioti i jr hei gainevjlle...

bcsc september 2017 digital newsletter volume 21,...

biotech agora confÉrenceadocia · adocia and tonghua...

molecular cytogenetic analysis of early spontaneous ... ·...

propuesta de acuerdos de la junta general ordinaria de...

the connected...

no to jazda! innowacyjny zut rekordowe barzkowice...

environmental management plan of inspection center ... ·...

evaluation des formations - université lille...

e microsoft...