gpu technology: past, present, future · past, present, future . 2. 3 5 4 3 2 1 0 2004 2006 2008...

Marc Hamilton, Vice President,

Solution Architecture and Engineering

GPU TECHNOLOGY:

PAST, PRESENT, FUTURE

3

5

4

3

2

1

0

2004 2006 2008 2010 2012 2014

Tera

FLO

PS

GPU

CPU

4

A Decade Of GPU Computing

- From Scientific Computing To Machine Learning - Mobile Is More Than Just Phones - GPU Architecture & CUDA Roadmap - Grid & The Last Mile of Virtualization

5

From Scientific Computing To Machine Learning

6

Giving Drones the Vision to Help Fight Fires

ResQU A Breakthrough in HIV

Research

UNIVERSITY OF ILLINOIS

Early, Accurate Detection of Breast Cancer

HOLOGIC

7

TSUBAME KFC #1 OF “TOP 15” GREEN

SUPERCOMPUTERS POWERED

BY CUDA GPUS

8

Branch of Artificial Intelligence

Computers that learn from data

person

car

helmet

motorcycle

bird

frog

person

dog

chair

person

hammer

flower pot

power drill

MACHINE LEARNING

9

THREE TRENDS CONVERGING

Torrent of Data

2010 2015

Exabyte

s of

unst

ructu

red d

ata

Deep Neural Networks GPU Computing

SOURCE: : IDC

10

Deep Learning with COTS HPC Systems

A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro

Stanford / NVIDIA • ICML 2013

STANFORD AI LAB

3 GPU-Accelerated Servers

12 GPUs • 18,432 cores

4 kWatts

$33,000

Now You Can Build Google’s

$1M Artificial Brain on the

Cheap

“

“

-Wired

1,000 CPU Servers 2,000 CPUs • 16,000

cores

600 kWatts

$5,000,00

0

GOOGLE BRAIN

11

CUDA FOR MACHINE LEARNING

Prominent Research

Image Detection

Face Recognition

Gesture Recognition

Video Search & Analytics

Speech Recognition & Translation

Recommendation Engines

Indexing & Search

Use Cases Early Adopters

Image Analytics for Creative Cloud

Image Classification

Speech/Image

Recognition

Recommendation

Hadoop

Search Rankings

12

Mobile - More Than Just Phones

13

MOBILE

ARCHITECTURE

Maxwell

Kepler

Tesla

Fermi

Tegra 3

Tegra 4

Tegra

K1

GPU

ARCHITECTURE

UNIFIED ARCHITECTURE TEGRA K1 – MOBILE SUPER

CHIP

BREAKTHROUGH EXPERIENCES

TEGRA TK1

14

192 CUDA cores

326 GFLOPS

VisionWorks SDK

JETSON TK1 DEV KIT 1ST MOBILE SUPERCOMPUTER FOR EMBEDDED SYSTEMS

15

DIGITAL COCKPIT

EVOLUTION OF COMPUTING IN THE CAR

Tegra 4 Tegra 3 Tegra K1

Virtual Cockpit Autonomous Driving Infotainment

16

TEGRA TK1 SUPERCOMPUTER FOR DRIVER ASSISTANCE

Optical Flow Histogram Feature Detection

Pedestrian Detection

Blind Spot Monitoring

Lane Departure Warning

Collision Avoidance

Traffic Sign Recognition

Adaptive Cruise Control

17

COMPUTER VISION ON CUDA

Feature Detection / Tracking ~30 GFLOPS @ 30 Hz

Object Recognition / Tracking ~180 GFLOPS @ 30 Hz

3D Scene Interpretation ~280 GFLOPS @ 30 Hz

18

GPU Architecture & CUDA Roadmap

19

FAST PACED CUDA GPU ROADMAP

2012 2014 2008 2010

GFLO

PS p

er

Watt

Kepler

Tesla

Fermi

Maxwell

Higher Perf/Watt

Dynamic Parallelism

FP64

CUDA

32

16

8

4

2

1

0.5

2016 2018

Pascal

Unified Memory Stacked DRAM

NVLINK

20

BANDWIDTH BOTTLENECKS CPU GPU

PCIe

PCI Express

CPU Memory

GPU Memory

16GB/sec

60GB/sec

288GB/sec

21

INTRODUCING NVLINK

CPU GPU

PCIe

Differential with embedded clock

PCIe programming model (w/ DMA+)

Unified Memory

Cache coherency in Gen 2.0

5 to 12X PCIe

22

5X MORE BANDWIDTH

FOR MULTI-GPU SCALING

GPU

PCIe SWITCH

CPU GPU GPU GPU

23

3D MEMORY

3D Chip-on-Wafer integration

Many X bandwidth

2.5X capacity

4X energy efficiency

0

200

400

600

800

1000

1200

2008 2010 2012 2014 2016

Memory Bandwidth

24

PASCAL

NVLink

3D Memory

Module

5 to 12X PCIe 3.0

2 to 4X memory BW & size

1/3 size of PCIe card

Power Regulation HBM Stacks

GPU Chip

25

522M

CUDA-ENABLED GPUS

CUDA EVERWHERE

2.5M 58K 770

CUDA DOWNLOADS ACADEMIC PAPERS UNIVERSITY COURSES

26

GOALS FOR THE CUDA PLATFORM

• Learn, adopt, & use parallelism with ease Simplicity

• Quickly achieve feature & performance goals Productivity

• Write code that can execute on all targets Portability

• High absolute performance and scalability Performance

27

UNIFIED MEMORY DRAMATICALLY LOWER DEVELOPER EFFORT

Developer View Today Developer View With Unified Memory

Unified Memory System Memory

GPU Memory

28

REMOTE DEVELOPMENT TOOLS

Local IDE, remote application

Edit locally, build & run remotely

Automatic sync via ssh

Cross-compilation to ARM

Full debugging & profiling via remote connection

Build

Run

Debug

Profile

Edit sync

29

EXTENDED (XT) LIBRARY INTERFACES

Automatic Scaling to multiple GPUs per node

cuFFT 2D/3D & cuBLAS level 3

Operate directly on large datasets that reside in CPU memory

2.2 TFLOPS

4.2 TFLOPS

6.0 TFLOPS

7.9 TFLOPS

0

1

2

3

4

5

6

7

8

1 x K10 2 x K10 3 x K10 4 x K10

16K x 16K SGEMM on Tesla K10

developer.nvidia.com/cublasxt

30

GRID Graphics Accelerated VDI The Original Graphics GPU Returns To The Data Center

31

NVIDIA Grid GPUs Power Enterprise Virtualization 2.0

32

IMPORTANCE OF A GPU

MUST HAVE

3D Engineering & Design Apps

PLM & Volume Design

Media Rich Web

INCREASINGLY NICE TO HAVE

Office Productivity Medical Records

COMMERCIAL MARKETS

DESIGNERS

POWER USERS

KNOWLEDGE WORKERS

TASK WORKERS

25M

200M

400M

100M

MARKET

SERVED

TO

DAY

NEW

OPPO

RTU

NIT

Y

33

Without GPU With GPU

NIGHT AND DAY DIFFERENCE

34

GRID ACCELERATED GRAPHICS

35

Thank You

gpu technology: past, present, future · past, present, future . 2. 3 5 4 3 2 1 0 2004 2006 2008...

Documents