gpu technology: past, present, future · past, present, future . 2. 3 5 4 3 2 1 0 2004 2006 2008...

35
Marc Hamilton, Vice President, Solution Architecture and Engineering GPU TECHNOLOGY: PAST, PRESENT, FUTURE

Upload: others

Post on 15-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

Marc Hamilton, Vice President,

Solution Architecture and Engineering

GPU TECHNOLOGY:

PAST, PRESENT, FUTURE

Page 2: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

2

Page 3: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

3

5

4

3

2

1

0

2004 2006 2008 2010 2012 2014

Tera

FLO

PS

GPU

CPU

Page 4: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

4

A Decade Of GPU Computing

- From Scientific Computing To Machine Learning - Mobile Is More Than Just Phones - GPU Architecture & CUDA Roadmap - Grid & The Last Mile of Virtualization

Page 5: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

5

From Scientific Computing To Machine Learning

Page 6: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

6

Giving Drones the Vision to Help Fight Fires

ResQU A Breakthrough in HIV

Research

UNIVERSITY OF ILLINOIS

Early, Accurate Detection of Breast Cancer

HOLOGIC

Page 7: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

7

TSUBAME KFC #1 OF “TOP 15” GREEN

SUPERCOMPUTERS POWERED

BY CUDA GPUS

Page 8: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

8

Branch of Artificial Intelligence

Computers that learn from data

person

car

helmet

motorcycle

bird

frog

person

dog

chair

person

hammer

flower pot

power drill

MACHINE LEARNING

Page 9: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

9

THREE TRENDS CONVERGING

Torrent of Data

2010 2015

Exabyte

s of

unst

ructu

red d

ata

Deep Neural Networks GPU Computing

SOURCE: : IDC

Page 10: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

10

Deep Learning with COTS HPC Systems

A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro

Stanford / NVIDIA • ICML 2013

STANFORD AI LAB

3 GPU-Accelerated Servers

12 GPUs • 18,432 cores

4 kWatts

$33,000

Now You Can Build Google’s

$1M Artificial Brain on the

Cheap

-Wired

1,000 CPU Servers 2,000 CPUs • 16,000

cores

600 kWatts

$5,000,00

0

GOOGLE BRAIN

Page 11: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

11

CUDA FOR MACHINE LEARNING

Prominent Research

Image Detection

Face Recognition

Gesture Recognition

Video Search & Analytics

Speech Recognition & Translation

Recommendation Engines

Indexing & Search

Use Cases Early Adopters

Image Analytics for Creative Cloud

Image Classification

Speech/Image

Recognition

Recommendation

Hadoop

Search Rankings

Page 12: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

12

Mobile - More Than Just Phones

Page 13: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

13

MOBILE

ARCHITECTURE

Maxwell

Kepler

Tesla

Fermi

Tegra 3

Tegra 4

Tegra

K1

GPU

ARCHITECTURE

UNIFIED ARCHITECTURE TEGRA K1 – MOBILE SUPER

CHIP

BREAKTHROUGH EXPERIENCES

TEGRA TK1

Page 14: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

14

192 CUDA cores

326 GFLOPS

VisionWorks SDK

JETSON TK1 DEV KIT 1ST MOBILE SUPERCOMPUTER FOR EMBEDDED SYSTEMS

Page 15: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

15

DIGITAL COCKPIT

EVOLUTION OF COMPUTING IN THE CAR

Tegra 4 Tegra 3 Tegra K1

Virtual Cockpit Autonomous Driving Infotainment

Page 16: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

16

TEGRA TK1 SUPERCOMPUTER FOR DRIVER ASSISTANCE

Optical Flow Histogram Feature Detection

Pedestrian Detection

Blind Spot Monitoring

Lane Departure Warning

Collision Avoidance

Traffic Sign Recognition

Adaptive Cruise Control

Page 17: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

17

COMPUTER VISION ON CUDA

Feature Detection / Tracking ~30 GFLOPS @ 30 Hz

Object Recognition / Tracking ~180 GFLOPS @ 30 Hz

3D Scene Interpretation ~280 GFLOPS @ 30 Hz

Page 18: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

18

GPU Architecture & CUDA Roadmap

Page 19: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

19

FAST PACED CUDA GPU ROADMAP

2012 2014 2008 2010

GFLO

PS p

er

Watt

Kepler

Tesla

Fermi

Maxwell

Higher Perf/Watt

Dynamic Parallelism

FP64

CUDA

32

16

8

4

2

1

0.5

2016 2018

Pascal

Unified Memory Stacked DRAM

NVLINK

Page 20: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

20

BANDWIDTH BOTTLENECKS CPU GPU

PCIe

PCI Express

CPU Memory

GPU Memory

16GB/sec

60GB/sec

288GB/sec

Page 21: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

21

INTRODUCING NVLINK

CPU GPU

PCIe

Differential with embedded clock

PCIe programming model (w/ DMA+)

Unified Memory

Cache coherency in Gen 2.0

5 to 12X PCIe

Page 22: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

22

5X MORE BANDWIDTH

FOR MULTI-GPU SCALING

GPU

PCIe SWITCH

CPU GPU GPU GPU

Page 23: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

23

3D MEMORY

3D Chip-on-Wafer integration

Many X bandwidth

2.5X capacity

4X energy efficiency

0

200

400

600

800

1000

1200

2008 2010 2012 2014 2016

Memory Bandwidth

Page 24: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

24

PASCAL

NVLink

3D Memory

Module

5 to 12X PCIe 3.0

2 to 4X memory BW & size

1/3 size of PCIe card

Power Regulation HBM Stacks

GPU Chip

Page 25: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

25

522M

CUDA-ENABLED GPUS

CUDA EVERWHERE

2.5M 58K 770

CUDA DOWNLOADS ACADEMIC PAPERS UNIVERSITY COURSES

Page 26: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

26

GOALS FOR THE CUDA PLATFORM

• Learn, adopt, & use parallelism with ease Simplicity

• Quickly achieve feature & performance goals Productivity

• Write code that can execute on all targets Portability

• High absolute performance and scalability Performance

Page 27: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

27

UNIFIED MEMORY DRAMATICALLY LOWER DEVELOPER EFFORT

Developer View Today Developer View With Unified Memory

Unified Memory System Memory

GPU Memory

Page 28: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

28

REMOTE DEVELOPMENT TOOLS

Local IDE, remote application

Edit locally, build & run remotely

Automatic sync via ssh

Cross-compilation to ARM

Full debugging & profiling via remote connection

Build

Run

Debug

Profile

Edit sync

Page 29: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

29

EXTENDED (XT) LIBRARY INTERFACES

Automatic Scaling to multiple GPUs per node

cuFFT 2D/3D & cuBLAS level 3

Operate directly on large datasets that reside in CPU memory

2.2 TFLOPS

4.2 TFLOPS

6.0 TFLOPS

7.9 TFLOPS

0

1

2

3

4

5

6

7

8

1 x K10 2 x K10 3 x K10 4 x K10

16K x 16K SGEMM on Tesla K10

developer.nvidia.com/cublasxt

Page 30: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

30

GRID Graphics Accelerated VDI The Original Graphics GPU Returns To The Data Center

Page 31: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

31

NVIDIA Grid GPUs Power Enterprise Virtualization 2.0

Page 32: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

32

IMPORTANCE OF A GPU

MUST HAVE

3D Engineering & Design Apps

PLM & Volume Design

Media Rich Web

INCREASINGLY NICE TO HAVE

Office Productivity Medical Records

COMMERCIAL MARKETS

DESIGNERS

POWER USERS

KNOWLEDGE WORKERS

TASK WORKERS

25M

200M

400M

100M

MARKET

SERVED

TO

DAY

NEW

OPPO

RTU

NIT

Y

Page 33: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

33

Without GPU With GPU

NIGHT AND DAY DIFFERENCE

Page 34: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

34

GRID ACCELERATED GRAPHICS

Page 35: GPU TECHNOLOGY: PAST, PRESENT, FUTURE · PAST, PRESENT, FUTURE . 2. 3 5 4 3 2 1 0 2004 2006 2008 2010 2012 2014 LOPS GPU CPU . 4- ... MACHINE LEARNING . 9 THREE TRENDS CONVERGING

35

Thank You