hpc technologies driving advances in machine learning · 2016-09-26 · hpc technologies driving...

27
Asaf Wachtel, Sr. Director BD, Mellanox Bob Keating, Solutions Architect, NVIDIA HPC for Wall Street, September 2016 HPC Technologies Driving Advances in Machine Learning

Upload: others

Post on 28-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

Asaf Wachtel, Sr. Director BD, Mellanox

Bob Keating, Solutions Architect, NVIDIA

HPC for Wall Street, September 2016

HPC Technologies Driving Advances in

Machine Learning

Page 2: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

© 2016 Mellanox Technologies 2

Agenda

Introduction to Machine Learning & Deep Learning

GPUs in Machine Learning – Use Cases & Benefits

High Performance Interconnect in Machine Learning – Use Cases & Benefits

Roadmap – Where do we go from here?

Page 3: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

3

ENTERPRISE AUTOGAMING DATA CENTERPRO VISUALIZATION

THE WORLD LEADER IN VISUAL COMPUTING

Page 4: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

4

THE AI RACE IS ON

Google Brain

ImageNetNVIDIA cuDNN

IBM WatsonJeopardy Theano

Caffe

Torch Microsoft

Google

ML Beats Humans

Google Car 1M Miles

Toyota $1B AI Lab

2010 2011 2012 2013 2014 2015

Facebook Big Sur

MS AzureML CNTK

Google TensorFlow

Amazon ML

IBM Watson

OpenAI

Microsoft ImageNet

Page 5: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

5

Page 6: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

6

DEEP LEARNING DEMANDS NEW CLASS OF HPC

TRAINING INFERENCING

Data / Users

ScalablePerformance

Throughput+ Efficiency

Billions of TFLOPS per training run

Years of compute-days on Xeon CPU

GPU turns years to days

Billions of FLOPS per inference

Seconds for response on Xeon CPU

GPU for instant response

Page 7: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

7NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

TESLA P100New GPU Architecture to Enable the World’s Fastest Compute Node

Pascal Architecture NVLink CoWoS HBM2 Page Migration Engine

Highest Compute Performance GPU Interconnect for Maximum Scalability

Unifying Compute & Memory in Single Package

Simple Parallel Programming with Virtually Unlimited Memory

Space

Unified Memory

CPU

Tesla P100

Page 8: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

8

NVIDIA DGX-1AI Supercomputer-in-a-Box

170 TFLOPS | 8x Tesla P100 16GB | NVLink Hybrid Cube Mesh

2x Xeon | 8 TB RAID 0 | Quad IB 100Gbps, Dual 10GbE | 3U — 3200W

Page 9: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

9

Instant productivity — plug-and-play, supports every AI framework

Performance optimized across the entire stack

Always up-to-date via the cloud

Mixed framework environments —containerized

Direct access to NVIDIA experts

DGX STACKFully integrated Deep Learning platform

Page 10: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

10

NVIDIA DIGITSInteractive Deep Learning GPU Training System

Test Image

Monitor ProgressConfigure DNNProcess Data Visualize Layers

developer.nvidia.com/digits

Page 11: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

11

Page 12: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

12

Page 13: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

13

TESLA END-TO-END DEEP LEARNING

TRAINING INFERENCING

Tesla P100

65XTesla P4

40Xin 3 years In 2 years

Training: comparing to Kepler GPU in 2013 using Caffe, Inference: comparing img/sec/watt to CPU: Intel E5-2697v4 using AlexNet

Page 14: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

14

NVIDIA DEEP LEARNING INSTITUTE

Access to self-study and instructor-led on-line courses and training materials.

Now with links to on-line training from Coursera, Microsoft, and Udacity.

On-line interactive courses provide complete coding environment – ask for

tokens (free).

developer.nvidia.com/deep-learning

Page 15: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

© 2016 Mellanox Technologies 15

High Performance Interconnect Usage for Machine Learning

Page 16: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

© 2016 Mellanox Technologies 16

Mellanox InfiniBand Proven and Most Scalable HPC Interconnect

“Summit” System “Sierra” System

Paving the Road to Exascale

Page 17: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

© 2016 Mellanox Technologies 17

Deep Learning – Natural Fit for HPC Technology

“Training deep neural networks is very computationally intensive: training one of our models takes tens of exaflops of work, and so HPC techniques are key to creating these models.

Because the neural network training problem is so arithmetically intense, we rely on computationally dense processors like GPUs, and because we need to scale the training process over multiple nodes, we rely on fast interconnect technologies such as Infiniband. Along with HPC hardware, we also use HPC software such as MPI and BLAS libraries. Perhaps most importantly, we approach problems from an HPC point of view: we examine the fundamental limits to our computation, and then push to see how close we can get to those limits.”

Andrew Ng, Chief Scientist, Baidu @ International Supercomputing Conference June 2016

Page 18: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

© 2016 Mellanox Technologies 18

Evolution of GPUDirect RDMA

Before GPUDirect

Network and third-party device drivers, did not

share buffers, and needed to make a redundant

copy in host memory.

With GPUDirect Shared Host Memory Pages

The network and GPU can share “pinned”

(page-locked) buffers, eliminating the need to

make a redundant copy in host memory.

Pre-GPUDirect

GPUDirect Shared Host Memory Pages Model

Page 19: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

© 2016 Mellanox Technologies 19

Eliminates CPU bandwidth and latency bottlenecks

Uses remote direct memory access (RDMA) transfers between GPUs

Resulting in significantly improved MPISendRecv efficiency between GPUs in remote nodes

GPUDirect™ RDMA

With GPUDirect™ RDMA

Using PeerDirect™

Page 21: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

© 2016 Mellanox Technologies 21

Hadoop goes “Deep” with RDMA

Yahoo has 600PB of data spread across 40,000 Hadoop Nodes

Enhancing deep learning nodes with multiple GPUs & InfiniBand interconnect

Yahoo has Open-sourced CaffeOnSpark (github.com/yahoo/CaffeOnSpark)

• Multi-GPU support; MPI + RDMA support

Multiple Applications using the solution

• Image recognition, search, advertisement, fraud detection etc.

Spark ExecutorData Feeding & Control

Enhanced Caffew/ Multi-GPU in a node

Model Synchronizeracross Nodes

Spark Driver

Spark ExecutorData Feeding & Control

Enhanced Caffew/ Multi-GPU in a node

Model Synchronizeracross Nodes

Spark ExecutorData Feeding & Control

Enhanced Caffew/ Multi-GPU in a node

Model Synchronizeracross Nodes

Dataset from HDFS

ModelOn HDFS

RDMA with Mellanox Infiniband

Large Scale Distributed Deep Learning on Hadoop Clusters - Yahoo Big ML Team [link]

Page 22: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

© 2016 Mellanox Technologies 22

Mellanox Interconnect Enables Baidu’s Deep Image Supercomputer - Minwa

Mellanox Interconnect Enables Baidu’s Deep Image Supercomputer - Minwa

• 4x higher resolution images

• Less than 6% application learning error rate

Future use cases include driver-less cars

Ren Wu, Shengen Yan, Yi Shan, Qingqing Dang and Gang Sun from Baidu Research, Deep Image: Scaling up Image Recognition [link]

Page 23: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

© 2016 Mellanox Technologies 23

Big Sur – An Open AI Platform from Facebook

An OCP based, GP-GPU AI Platform

Open Rack v2 compatible, 4OU chassis

Flexible Architecture supporting up to 8 GPUs

High Speed Interconnect for scale

Use Cases:

• Text Processing

• Language Modeling

• Artificial Intelligence

• Computer Vision

https://code.facebook.com/posts/1687861518126048/facebook-to-open-source-ai-hardware-design/

Page 24: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

24NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

NVIDIA DGX-1WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER

170 TFLOPS

8x Tesla P100 16GB in NVLink Cube Mesh

Optimized Deep Learning Software

Dual Xeon

7 TB SSD Deep Learning Cache

Dual 10GbE

Quad Infiniband 100Gb

3RU – 3200W

Mellanox ConnectX-4 + GPUDirect RDMA Inside

Page 25: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

© 2016 Mellanox Technologies 25

Come Visit our Booth @ HPC on Wall Street:

Highest-Performance 100Gb/s Interconnect Solutions

Transceivers

Active Optical and Copper Cables

(10 / 25 / 40 / 50 / 56 / 100Gb/s)VCSELs, Silicon Photonics and Copper

36 EDR (100Gb/s) Ports, <90ns Latency

Throughput of 7.2Tb/s

7.02 Billion msg/sec (195M msg/sec/port)

100Gb/s Adapter, 0.6us latency

200 million messages per second

(10 / 25 / 40 / 50 / 56 / 100Gb/s)

32 100GbE Ports, 64 25/50GbE Ports

(10 / 25 / 40 / 50 / 100GbE)

Throughput of 6.4Tb/s

Page 26: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

26

SEE THE FUTURE OF AI IN DCLocation: Ronald Reagan Building &International Trade Center, Washington D.C.

Event Date: October 26-27, 2016

The GTC DC is a regional extensionof the GTC event held annually inSilicon Valley. GTC DC attendees can train and connect with the brightestminds in computing on the hottesttopics – including artificial intelligenceand deep learning, virtual reality, andautonomous machines.

Registration opens August 18, 2016at http://dc.gputechconf.com

Page 27: HPC Technologies Driving Advances in Machine Learning · 2016-09-26 · HPC Technologies Driving Advances in Machine Learning ... Pascal Architecture NVLink CoWoS HBM2 Page Migration

Thank You