lec 2 gpu hardware architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdflec 2 3...

17
2013-创新实验课 CUDA高性能并行程序设计 Lec 2 GPU Hardware Architecture Tonghua Su School of Software Harbin Institute of Technology

Upload: others

Post on 15-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lec 2 GPU Hardware Architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdfLec 2 3 Tonghua Su, School of Software, Harbin Institute of Technology, China CUDA高性能并行程序设计

2013秋-创新实验课 CUDA高性能并行程序设计

Lec 2 GPU Hardware Architecture

Tonghua SuSchool of Software

Harbin Institute of Technology

Page 2: Lec 2 GPU Hardware Architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdfLec 2 3 Tonghua Su, School of Software, Harbin Institute of Technology, China CUDA高性能并行程序设计

Lec 2 2

Tonghua Su, School of Software, Harbin Institute of Technology, China

CUDA高性能并行程序设计

Outline

Top-level View

Kepler Architecture

Fermi Architecture

1

2

3

Page 3: Lec 2 GPU Hardware Architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdfLec 2 3 Tonghua Su, School of Software, Harbin Institute of Technology, China CUDA高性能并行程序设计

Lec 2 3

Tonghua Su, School of Software, Harbin Institute of Technology, China

CUDA高性能并行程序设计

Top-level View

CPU/GPU architecture simplified

DDR3 GDDR5

Page 4: Lec 2 GPU Hardware Architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdfLec 2 3 Tonghua Su, School of Software, Harbin Institute of Technology, China CUDA高性能并行程序设计

Lec 2 4

Tonghua Su, School of Software, Harbin Institute of Technology, China

CUDA高性能并行程序设计

Top-level View

CPU/GPU architecture—northbridge

Page 5: Lec 2 GPU Hardware Architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdfLec 2 3 Tonghua Su, School of Software, Harbin Institute of Technology, China CUDA高性能并行程序设计

Lec 2 5

Tonghua Su, School of Software, Harbin Institute of Technology, China

CUDA高性能并行程序设计

Top-level View

Multiple CPUs (SMP configuration)

Page 6: Lec 2 GPU Hardware Architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdfLec 2 3 Tonghua Su, School of Software, Harbin Institute of Technology, China CUDA高性能并行程序设计

Lec 2 6

Tonghua Su, School of Software, Harbin Institute of Technology, China

CUDA高性能并行程序设计

Top-level View

Multiple CPUs (NUMA)

Page 7: Lec 2 GPU Hardware Architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdfLec 2 3 Tonghua Su, School of Software, Harbin Institute of Technology, China CUDA高性能并行程序设计

Lec 2 7

Tonghua Su, School of Software, Harbin Institute of Technology, China

CUDA高性能并行程序设计

Top-level View

Multi-CPU (NUMA configuration), multiple buses

Page 8: Lec 2 GPU Hardware Architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdfLec 2 3 Tonghua Su, School of Software, Harbin Institute of Technology, China CUDA高性能并行程序设计

Lec 2 8

Tonghua Su, School of Software, Harbin Institute of Technology, China

CUDA高性能并行程序设计

Top-level View

Multi-CPU with integrated PCI Express

Page 9: Lec 2 GPU Hardware Architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdfLec 2 3 Tonghua Su, School of Software, Harbin Institute of Technology, China CUDA高性能并行程序设计

Lec 2 9

Tonghua Su, School of Software, Harbin Institute of Technology, China

CUDA高性能并行程序设计

Top-level View

Integrated GPU

Page 10: Lec 2 GPU Hardware Architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdfLec 2 3 Tonghua Su, School of Software, Harbin Institute of Technology, China CUDA高性能并行程序设计

Lec 2 10

Tonghua Su, School of Software, Harbin Institute of Technology, China

CUDA高性能并行程序设计

Top-level View

Integrated GPU with discrete GPU(s)

Page 11: Lec 2 GPU Hardware Architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdfLec 2 3 Tonghua Su, School of Software, Harbin Institute of Technology, China CUDA高性能并行程序设计

Lec 2 11

Tonghua Su, School of Software, Harbin Institute of Technology, China

CUDA高性能并行程序设计

Top-level View

GPUs in multiple slots

Page 12: Lec 2 GPU Hardware Architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdfLec 2 3 Tonghua Su, School of Software, Harbin Institute of Technology, China CUDA高性能并行程序设计

Lec 2 12

Tonghua Su, School of Software, Harbin Institute of Technology, China

CUDA高性能并行程序设计

Top-level View

Multi-GPU board

Page 13: Lec 2 GPU Hardware Architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdfLec 2 3 Tonghua Su, School of Software, Harbin Institute of Technology, China CUDA高性能并行程序设计

Lec 2 13

Tonghua Su, School of Software, Harbin Institute of Technology, China

CUDA高性能并行程序设计

Kepler

There are multiple products in the latest Kepler generation Consumer graphics cards (GeForce): GTX650 Ti: 768 cores, 1/2GB (£100/130)

GTX660 Ti: 1344 cores, 2GB (£240)

GTX680: 1536 cores, 2/4GB (£360/440)

GTX690: 2×1536 cores, 2×2GB (£800)

GTX 780: 2304 cores, 3GB

GTX TITAN: 2688 cores, 6GB

HPC cards (Tesla): K10 module: 2×1536 cores, 2×4GB

K20 card: 2496 cores, 5GB

K20X module: 2688 cores, 6GB

Page 14: Lec 2 GPU Hardware Architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdfLec 2 3 Tonghua Su, School of Software, Harbin Institute of Technology, China CUDA高性能并行程序设计

Lec 2 14

Tonghua Su, School of Software, Harbin Institute of Technology, China

CUDA高性能并行程序设计

Kepler

building block is a “streaming multiprocessor” (SMX): 192 cores and 64k registers 64KB of shared memory / L1 cache 8KB cache for constants 48KB texture cache for read-only arrays up to 2K threads per SMX

different chips have different numbers of these SMXs:

product SMXs bandwidth memory powerGTX 650 Ti

GTX 680K10 (2×)

K20X

488

14

86 GB/s190 GB/s160 GB/s250 GB/s

1/2 GB2/4 GB4 GB6 GB

110W195W110W235W

Page 15: Lec 2 GPU Hardware Architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdfLec 2 3 Tonghua Su, School of Software, Harbin Institute of Technology, China CUDA高性能并行程序设计

Lec 2 15

Tonghua Su, School of Software, Harbin Institute of Technology, China

CUDA高性能并行程序设计

Kepler

Kepler GPU

SMX

L2 cache

SMX SMX

SMX SMX SMX SMX

L1 cache /shared memory

SMX

Page 16: Lec 2 GPU Hardware Architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdfLec 2 3 Tonghua Su, School of Software, Harbin Institute of Technology, China CUDA高性能并行程序设计

Lec 2 16

Tonghua Su, School of Software, Harbin Institute of Technology, China

CUDA高性能并行程序设计

Fermi

older Fermi GPU has SM “streaming multiprocessor”: 32 cores and 32k registers 64KB of shared memory / L1 cache 8KB cache for constants up to 1536 threads per SM

different chips have different numbers of these SMs:

product SMs bandwidth memoryGTX 560GTX 580

M2050/2070M2075/2090

14161416

130 GB/s190 GB/s140 GB/s140 GB/s

1/2 GB1.5 GB3/6 GB3/6 GB

Page 17: Lec 2 GPU Hardware Architecturejwc.hit.edu.cn/.../4a6489df-c6d3-4a14-8376-1b9b2f814559.pdfLec 2 3 Tonghua Su, School of Software, Harbin Institute of Technology, China CUDA高性能并行程序设计

Lec 2 17

Tonghua Su, School of Software, Harbin Institute of Technology, China

CUDA高性能并行程序设计

Fermi

Fermi GPU

SM SM SM

L2 cache

SM SM SM

SM SM SM SM SM SM SM

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C C C C

C C C C

C C C C

L1 cache /shared memory

SM