gpu technology: past, present, future · past, present, future . 2. 3 5 4 3 2 1 0 2004 2006 2008...
TRANSCRIPT
Marc Hamilton, Vice President,
Solution Architecture and Engineering
GPU TECHNOLOGY:
PAST, PRESENT, FUTURE
2
3
5
4
3
2
1
0
2004 2006 2008 2010 2012 2014
Tera
FLO
PS
GPU
CPU
4
A Decade Of GPU Computing
- From Scientific Computing To Machine Learning - Mobile Is More Than Just Phones - GPU Architecture & CUDA Roadmap - Grid & The Last Mile of Virtualization
5
From Scientific Computing To Machine Learning
6
Giving Drones the Vision to Help Fight Fires
ResQU A Breakthrough in HIV
Research
UNIVERSITY OF ILLINOIS
Early, Accurate Detection of Breast Cancer
HOLOGIC
7
TSUBAME KFC #1 OF “TOP 15” GREEN
SUPERCOMPUTERS POWERED
BY CUDA GPUS
8
Branch of Artificial Intelligence
Computers that learn from data
person
car
helmet
motorcycle
bird
frog
person
dog
chair
person
hammer
flower pot
power drill
MACHINE LEARNING
9
THREE TRENDS CONVERGING
Torrent of Data
2010 2015
Exabyte
s of
unst
ructu
red d
ata
Deep Neural Networks GPU Computing
SOURCE: : IDC
10
Deep Learning with COTS HPC Systems
A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro
Stanford / NVIDIA • ICML 2013
STANFORD AI LAB
3 GPU-Accelerated Servers
12 GPUs • 18,432 cores
4 kWatts
$33,000
Now You Can Build Google’s
$1M Artificial Brain on the
Cheap
“
“
-Wired
1,000 CPU Servers 2,000 CPUs • 16,000
cores
600 kWatts
$5,000,00
0
GOOGLE BRAIN
11
CUDA FOR MACHINE LEARNING
Prominent Research
Image Detection
Face Recognition
Gesture Recognition
Video Search & Analytics
Speech Recognition & Translation
Recommendation Engines
Indexing & Search
Use Cases Early Adopters
Image Analytics for Creative Cloud
Image Classification
Speech/Image
Recognition
Recommendation
Hadoop
Search Rankings
12
Mobile - More Than Just Phones
13
MOBILE
ARCHITECTURE
Maxwell
Kepler
Tesla
Fermi
Tegra 3
Tegra 4
Tegra
K1
GPU
ARCHITECTURE
UNIFIED ARCHITECTURE TEGRA K1 – MOBILE SUPER
CHIP
BREAKTHROUGH EXPERIENCES
TEGRA TK1
14
192 CUDA cores
326 GFLOPS
VisionWorks SDK
JETSON TK1 DEV KIT 1ST MOBILE SUPERCOMPUTER FOR EMBEDDED SYSTEMS
15
DIGITAL COCKPIT
EVOLUTION OF COMPUTING IN THE CAR
Tegra 4 Tegra 3 Tegra K1
Virtual Cockpit Autonomous Driving Infotainment
16
TEGRA TK1 SUPERCOMPUTER FOR DRIVER ASSISTANCE
Optical Flow Histogram Feature Detection
Pedestrian Detection
Blind Spot Monitoring
Lane Departure Warning
Collision Avoidance
Traffic Sign Recognition
Adaptive Cruise Control
17
COMPUTER VISION ON CUDA
Feature Detection / Tracking ~30 GFLOPS @ 30 Hz
Object Recognition / Tracking ~180 GFLOPS @ 30 Hz
3D Scene Interpretation ~280 GFLOPS @ 30 Hz
18
GPU Architecture & CUDA Roadmap
19
FAST PACED CUDA GPU ROADMAP
2012 2014 2008 2010
GFLO
PS p
er
Watt
Kepler
Tesla
Fermi
Maxwell
Higher Perf/Watt
Dynamic Parallelism
FP64
CUDA
32
16
8
4
2
1
0.5
2016 2018
Pascal
Unified Memory Stacked DRAM
NVLINK
20
BANDWIDTH BOTTLENECKS CPU GPU
PCIe
PCI Express
CPU Memory
GPU Memory
16GB/sec
60GB/sec
288GB/sec
21
INTRODUCING NVLINK
CPU GPU
PCIe
Differential with embedded clock
PCIe programming model (w/ DMA+)
Unified Memory
Cache coherency in Gen 2.0
5 to 12X PCIe
22
5X MORE BANDWIDTH
FOR MULTI-GPU SCALING
GPU
PCIe SWITCH
CPU GPU GPU GPU
23
3D MEMORY
3D Chip-on-Wafer integration
Many X bandwidth
2.5X capacity
4X energy efficiency
0
200
400
600
800
1000
1200
2008 2010 2012 2014 2016
Memory Bandwidth
24
PASCAL
NVLink
3D Memory
Module
5 to 12X PCIe 3.0
2 to 4X memory BW & size
1/3 size of PCIe card
Power Regulation HBM Stacks
GPU Chip
25
522M
CUDA-ENABLED GPUS
CUDA EVERWHERE
2.5M 58K 770
CUDA DOWNLOADS ACADEMIC PAPERS UNIVERSITY COURSES
26
GOALS FOR THE CUDA PLATFORM
• Learn, adopt, & use parallelism with ease Simplicity
• Quickly achieve feature & performance goals Productivity
• Write code that can execute on all targets Portability
• High absolute performance and scalability Performance
27
UNIFIED MEMORY DRAMATICALLY LOWER DEVELOPER EFFORT
Developer View Today Developer View With Unified Memory
Unified Memory System Memory
GPU Memory
28
REMOTE DEVELOPMENT TOOLS
Local IDE, remote application
Edit locally, build & run remotely
Automatic sync via ssh
Cross-compilation to ARM
Full debugging & profiling via remote connection
Build
Run
Debug
Profile
Edit sync
29
EXTENDED (XT) LIBRARY INTERFACES
Automatic Scaling to multiple GPUs per node
cuFFT 2D/3D & cuBLAS level 3
Operate directly on large datasets that reside in CPU memory
2.2 TFLOPS
4.2 TFLOPS
6.0 TFLOPS
7.9 TFLOPS
0
1
2
3
4
5
6
7
8
1 x K10 2 x K10 3 x K10 4 x K10
16K x 16K SGEMM on Tesla K10
developer.nvidia.com/cublasxt
30
GRID Graphics Accelerated VDI The Original Graphics GPU Returns To The Data Center
31
NVIDIA Grid GPUs Power Enterprise Virtualization 2.0
32
IMPORTANCE OF A GPU
MUST HAVE
3D Engineering & Design Apps
PLM & Volume Design
Media Rich Web
INCREASINGLY NICE TO HAVE
Office Productivity Medical Records
COMMERCIAL MARKETS
DESIGNERS
POWER USERS
KNOWLEDGE WORKERS
TASK WORKERS
25M
200M
400M
100M
MARKET
SERVED
TO
DAY
NEW
OPPO
RTU
NIT
Y
33
Without GPU With GPU
NIGHT AND DAY DIFFERENCE
34
GRID ACCELERATED GRAPHICS
35
Thank You