facebook flexible gpu - opencompute.org · 5 impact language translation language translation...

Post on 22-Jul-2019

223 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

F a c e b o o k F l e x i b l e G P U E x p a n d e r B i g B a s i n R e f r e s h

Whitney Zhao/HW Eng/Facebook Inc.Xiaodong Wang/SW Eng/Facebook Inc.

3

Introduct ion

Agenda

Architecture

Performance

Quest ions

4

Introduct ion

Agenda

5

Impact

LANGUAGE TRANSLATION LANGUAGE TRANSLATION

Facebook’s commitment to developing AI & advancing ML

FACE RECOGNITION

FACE RECOGNITIONSEARCH

SEARCH

ADS

ADSNEWS FEED

NEWS FEED

SIGMA

LUMOS

6

Goal • Open, full contribution to OCP• Disaggregation/Modularity• Serviceability

2016: Big Sur2017: Leopard + Big Basin2018: Tioga Pass + Big Basin V2

7

Big Basin V2 Overview

• 3 OU chassis• Open Rack v2 compatible• 8x Nvidia Tesla V100 GPUs; NVLink capable• 300W TDP for each Tesla V100 GPU• Facebook 2S Server Tioga Pass as Head node

8

A deeper look into Big Basin

Baseboard on sliding tray

Midplane boardBaseboardIO board

9

Serviceability

• Quick repairs at data center

• Telemetries accessible from head node

• Provisioning Big Basin with its head node is not much different from provisioning existing servers; these servers come with additional GPUs.

Thumb screws

10

Agenda

Architecture

11

Architecture (Headnode to Big Basin)

Leopard + Big Basin(Tesla P100) Tioga Pass + Big Basin V2(Tesla V100)

• MiniSAS HD cable(2 for each x16)o Standard PCIe x16o Present Pino USB2.0o IPMB/I2C

12

Architecture (PCIe)

Leopard + Big Basin Tioga Pass + Big Basin V2

Tioga Pass50G

24 4

51 1 2

3 3

5

13

Architecture (NVLINK)

Big Basin W/Nvidia Tesla P100 Big Basin V2 W/ Nvidia Tesla V100

14

Architecture (IPMB/I2C/PMBUS)

15

Agenda Performance

16

Performance

• Hardware Spec Improvement

• Application performance• Computer vision

• Single-GPU• Multi-GPU scalability• TensorCore

• Neural machine translation

17

Performance

Metrics NVIDIA V100 NVIIDA P100 Improvement

Performance

FP‐32 15 TFLOPS 10.6 TFLOPS 1.42x

FP‐16 30 TFLOPS 21.2 TFLOPS

TensorCore 125 TFLOPS NA Up to 5x

Mem Bandwidth 900 GB/s 720 GB/s 1.25x

NVLink 300 GB/s 160 GB/s 1.88x

Power 300 W 300 W 

• Comparisons of GPU Hardware

18

Performance

• Comparisons of GPU Hardware

• Head-node upgrade: Tioga Pass• New CPU architecture: Broadwell to Skylake• Double PCIe bandwidth• Upgraded 100G NIC

• CUDA 9 + cudnn 7: faster libraries, etc.

19

Impact - Computer Vision

20

Performance metrics in Computer Vision

• Computer Vision: resnet-50⎻1-GPU training speed: use P100 + CUDA 8 as baseline

P100 V100 + CUDA 8 V100 + CUDA 9 V100 + CUDA 9 +TensorCore

1.66x

21

Computer Vision Performance

1-V100 2-V100 4-V100 8-V100

FP-32

• Computer Vision⎻Multi-GPU speedup vs. 1 P100

Spee

dup

vs. 1

P1

00

1.66x

22

Computer Vision Performance

1-V100 2-V100 4-V100 8-V100

FP-32

FP-16 TensorCore

• Computer Vision⎻High-bandwidth FP-16 TensorCore (WIP)

Spee

dup

vs. 1

P1

00

23

Machine Translation

Better Translation Quality

Phrase-based statistical approach Neural network approach

24

Machine Translation Performance

• Neural Machine translation

V100 + CUDA 9

2.2XV100 + CUDA 8

1.45XP100 + CUDA 8

Training Throughput as

Baseline

25

Questions?

27

OCP Marketplace

• http://www.opencompute.org/products/specsanddesign?keyword=Big+basin

top related