advanced spark and tensorflow meetup 2017-05-06 reduced precision (fp16, int8) inference on...

49
Apr 2017 – Chris Gottbrath REDUCED PRECISION (FP16, INT8) INFERENCE ON CONVOLUTIONAL NEURAL NETWORKS WITH TENSORRT AND NVIDIA PASCAL

Upload: chris-fregly

Post on 12-Apr-2017

175 views

Category:

Software


10 download

TRANSCRIPT

Page 1: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

Apr 2017 – Chris Gottbrath

REDUCED PRECISION (FP16, INT8) INFERENCE ON CONVOLUTIONAL NEURAL NETWORKS WITH TENSORRT AND NVIDIA PASCAL

Page 2: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

2

AGENDA

Deep Learning

TensorRT

Reduced Precision

GPU REST Engine

Conclusion

Page 3: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

3

NEW AI SERVICES POSSIBLE WITH GPU CLOUD

SPOTIFY SONG RECOMMENDATIONS

NETFLIX VIDEO RECOMMENDATIONS

YELP SELECTING COVER PHOTOS

Page 4: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

4

TESLA REVOLUTIONIZES DEEP LEARNING

NEURAL NETWORK APPLICATION

BEFORE TESLA AFTER TESLA

Cost $5,000K $200K

Servers 1,000 Servers 16 Tesla Servers

Energy 600 KW 4 KW

Performance 1x 6x

Page 5: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

5

NVIDIA DEEP LEARNING SDK

Powerful tools and libraries for designing and deploying GPU-accelerated deep learning applications

High performance building blocks for training and deploying deep neural networks on NVIDIA GPUs

Industry vetted deep learning algorithms and linear algebra subroutines for developing novel deep neural networks

Multi-GPU scaling that accelerates training on up to eight GPU

High performance GPU-acceleration for deep learning

“ We are amazed by the steady stream

of improvements made to the NVIDIA

Deep Learning SDK and the speedups

that they deliver.”

— Frédéric Bastien, Team Lead (Theano) MILA

developer.nvidia.com/deep-learning-software

Page 6: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

6

POWERING THE DEEP LEARNING ECOSYSTEM NVIDIA SDK accelerates every major framework

COMPUTER VISION

OBJECT DETECTION IMAGE CLASSIFICATION

SPEECH & AUDIO

VOICE RECOGNITION LANGUAGE TRANSLATION

NATURAL LANGUAGE PROCESSING

RECOMMENDATION ENGINES SENTIMENT ANALYSIS

DEEP LEARNING FRAMEWORKS

Mocha.jl

NVIDIA DEEP LEARNING SDK

developer.nvidia.com/deep-learning-software

Page 7: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

7

TensorRT

Page 8: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

8

NVIDIA DEEP LEARNING SOFTWARE PLATFORM

NVIDIA DEEP LEARNING SDK

TensorRT

Embedded

Automotive

Data center

TRAINING FRAMEWORK

Training Data

Training

Data Management

Model Assessment

Trained Neural Network

developer.nvidia.com/deep-learning-software

Page 9: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

9

NVIDIA TensorRT High-performance deep learning inference for production deployment

developer.nvidia.com/tensorrt

High performance neural network inference engine for production deployment

Generate optimized and deployment-ready models for datacenter, embedded and automotive platforms

Deliver high-performance, low-latency inference demanded by real-time services

Deploy faster, more responsive and memory efficient deep learning applications with INT8 and FP16 optimized precision support

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

2 8 128

CPU-Only

Tesla P40 + TensorRT (FP32)

Tesla P40 + TensorRT (INT8)

Up to 36x More Image/sec

Batch Size

GoogLenet, CPU-only vs Tesla P40 + TensorRT CPU: 1 socket E4 2690 v4 @2.6 GHz, HT-on GPU: 2 socket E5-2698 v3 @2.3 GHz, HT off, 1 P40 card in the box

Images/

Second

Page 10: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

10

WORKFLOW – GETTING A TRAINED MODEL INTO TensorRT

Page 11: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

11

TensorRT Development Workflow

Training Framework OPTIMIZATION USING TensorRT

Validation USING TensorRT

PLAN NEURAL NETWORK

developer.nvidia.com/tensorrt

Serialize to disk

Batch Size

Precision

Page 12: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

12

TensorRT Production Workflow

RUNTIME USING TensorRT

Serialized PLAN

developer.nvidia.com/tensorrt

Page 13: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

13

TO IMPORT A TRAINED MODEL TO TensorRT

IBuilder* builder = createInferBuilder(gLogger);

INetworkDefinition* network = builder->createNetwork();

CaffeParser parser;

auto blob_name_to_tensor = parser.parse(<network definition>,<weights>,*network,<datatype>);

network->markOutput(*blob_name_to_tensor->find(<output layer name>));

builder->setMaxBatchSize(<size>);

builder->setMaxWorkspaceSize(<size>);

ICudaEngine* engine = builder->buildCudaEngine(*network);

Key function calls

This assumes you have a Caffe model file

developer.nvidia.com/tensorrt

Page 14: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

14

IMPORTING USING THE GRAPH DEFINITION API

If using other frameworks such as TensorFlow you can call our network builder API

ITensor* in = network->addInput(“input”, DataType::kFloat, Dims3{…});

IPoolingLayer* pool = network->addPooling(in, PoolingType::kMAX, …);

Etc…

We are looking at a streamlined graph input for TensorFlow like our Caffe parser.

From any framework

developer.nvidia.com/tensorrt

Page 15: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

15

EXECUTE THE NEURAL NETWORK

IExecutionContext *context = engine->createExecutionContext();

<handle> = engine->getBindingIndex(<binding layer name>),

<malloc and cudaMalloc calls > //allocate buffers for data moving in and out

cudaStream_t stream;

cudaStreamCreate(&stream);

cudaMemcpyAsync( <args> )); // Copy Input Data to the GPU

context.enqueue(<args>);

cudaMemcpyAsync( <args> )); // Copy Output Data to the Host

cudaStreamSynchronize(stream);

Running inference using the API

Page 16: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

16

THROUGHPUT

0

500

1000

1500

2000

2500

1 2 4 8 16 32 64 128

Caffe FP32 on CPU Caffe FP32 on P100 TensorFlow FP32 on P100

TensorRT FP32 on P100 TensorRT FP16 on P100

Images

/ s

Batch Size Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.

Page 17: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

17

LATENCY

1

10

100

1000

10000

1 2 4 8 16 32 64 128

Caffe FP32 on CPU Caffe FP32 on P100 TensorFlow FP32 on P100

TensorRT FP32 on P100 TensorRT FP16 on P100

Late

nce (

ms

to e

xecute

batc

h)

Batch Size Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.

Page 18: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

18

REDUCED PRECISION

Page 19: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

19

SMALLER AND FASTER

0

0.5

1

1.5

2

2.5

3

3.5

FP32 FP16 on P100 INT8 on P40

Performance

% s

cale

d t

o F

P32

ResNet50 Model, Batch Size = 128, TensoRT 2.1 RC prerelease

0

20

40

60

80

100

120

FP32 FP16 on P100 INT8 on P40

Memory Usage

Images

/ s

-

Scale

d t

o F

P32

developer.nvidia.com/tensorrt

Page 20: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

20

INT8 INFERENCE

• Main challenge

• INT8 has significantly lower precision and dynamic range compared to FP32

• Requires “smart” quantization and calibration from FP32 to INT8

Challenge

Dynamic Range Min Pos Value

FP32 -3.4x1038 ~ +3.4x1038 1.4 × 10−45

FP16 -65504 ~ +65504 5.96 x 10-8

INT8 -128 ~ +127 1

developer.nvidia.com/tensorrt

Page 21: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

21

QUANTIZATION OF WEIGHTS

-127 -126 -125 125 126 127

I8_weight = Round_to_nearest_int( scaling_factor * F32_weight )

scaling_factor = 127.0f / max( abs( all_F32_weights_in_the_filter ) )

Symmetric, Linear Quantization

[-127, 127]

Page 22: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

22 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

QUANTIZATION OF ACTIVATIONS

I8_value = (value > threshold) ?

threshold :

scale * F32_value

How do you decide optimal ‘threshold’?

Activation range is unknown offline, input dependent

Calibration using ‘representative’ dataset

? ? ?

Input

Page 23: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

24

TENSORRT INT8 Workflow

FP32 Training Framework

INT8 OPTIMIZATION USING TensorRT

INT8 RUNTIME USING TensorRT

INT8 PLAN

FP32 NEURAL NETWORK

developer.nvidia.com/tensorrt

Calibration

Dataset

Batch Size

Precision

Page 24: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

25

TURNING ON INT8 AND CALLING THE CALIBRATOR

builder->setInt8Mode(true);

IInt8Calibrator* calibrator,

builder->setInt8Calibrator(calibrator);

bool getBatch(<args>) override

API calls

developer.nvidia.com/tensorrt

Page 25: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

26

8-BIT INFERENCE

Top-1 Accuracy

Network FP32 Top1 INT8 Top1 Difference Perf Gain

developer.nvidia.com/tensorrt

Page 26: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

27

DEPLOYING ACCELERATED FUNCTIONS SUCH AS TensorRT

AS A MICROSERVICE WITH GPU REST ENGINE (GRE)

Page 27: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

28

GPU REST ENGINE (GRE) SDK

Accelerated microservices for web and mobile

Supercomputer performance for hyperscale datacenters

Up to 50 teraflops per node, min ~250μs response time

Easy to develop new microservices

Open source, integrates with existing infrastructure

Easy to deploy & scale

Ready-to-run Dockerfile

HTTP (~250μs)

GPU REST Engine

Image

Classification

Speech

Recognition …

Image

Scaling

developer.nvidia.com/gre

Page 28: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

29

WEB ARCHITECTURE WITH GRE

Create accelerated microservices

REST interfaces

Provide your own GPU kernel

GRE plugs in easily Web Presentation Layer

Content

Ident Svc GRE

Ads

ICE

Img Data

Analytics

GRE

Image

Classification

developer.nvidia.com/gre

Page 29: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

30

REST API

HTTP layer

App layer

CPU-side layer

Device-layer

func EmptyKernel_Handler

kernel_wrapper()

benchmark_execute()

Microservice

Client

empty_kernel<<<>>>

Go

C++

CUDA host

CUDA device GPU

Host CPU

Host CPU

Host CPU

ScopedContext<>

Hello World Microservice

developer.nvidia.com/gre

Page 30: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

31

Context Pool

Request ScopedContext Request ScopedContext

GPU1

GPU2

Context

Context

Request ScopedContext

Request ScopedContext

Request ScopedContext

Context

Request ScopedContext

Context

Request ScopedContext

Request ScopedContext

Request ScopedContext

Resource Pool

developer.nvidia.com/gre

Page 31: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

32

ScopedContext<>

REST API

HTTP layer

App layer

Device-layer

func classify

classifier_classify()

Microservice

Client

Go

C++

CUDA device GPU

Host CPU

Host CPU

classify()

Classification Microservice

developer.nvidia.com/gre

Page 32: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

33

CLASSIFICATION.CPP (1/2)

func classify

classifier_classify()

classify()

constexpr static int kContextsPerDevice = 2; classifier_ctx* classifier_initialize(char* model_file, char* trained_file, char* mean_file, char* label_file) {try{ cudaError_t st = cudaGetDeviceCount(&device_count); ContextPool<CaffeContext> pool; for (int dev = 0; dev < device_count; ++dev) { for (int i = 0; i < kContextsPerDevice; ++i) { std::unique_ptr<CaffeContext> context(new CaffeContext(model_file, trained_file, Mean_file, label_file, dev)); pool.Push(std::move(context)); }}} catch { ... } }

To allow latency hiding

CaffeContexts

developer.nvidia.com/gre

Page 33: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

34

CLASSIFICATION.CPP (2/2)

func classify

classifier_classify()

classify()

const char* classifier_classify(classifier_ctx* ctx, char* buffer, size_t length) { try{ ScopedContext<CaffeContext> context(ctx->pool); auto classifier = context->CaffeClassifier(); predictions = classifier->Classify(img); /* Write the top N predictions in JSON format. */ }

Uses a scoped context

Lower level classify routine

developer.nvidia.com/gre

Page 34: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

35

CONCLUSION

Inference is going to power an increasing number of features and capabilities.

Latency is important for responsive services

Throughput is important for controlling costs and scaling out

GPUs can deliver throughput and low latency

Reduced precision can be used for an extra boost

There is a template to follow for creating accelerated microservices

developer.nvidia.com/gre

Page 35: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

36

WANT TO LEARN MORE?

GPU Technology Conference

May 8-11 in San Jose

S7310 - 8-Bit Inference with TensorRT Szymon Migacz

S7458 - Deploying unique DL Networks as Micro-Services with TensorRT, user extensible layers, and GPU REST Engine Chris Gottbrath

9 Spark and 17 TensorFlow sessions

20% off discount code: NVCGOTT

developer.nvidia.com/tensorrt

developer.nvidia.com/gre

devblogs.nvidia.com/parallelforall/

NVIDIA Jetson TX2 Delivers Twice …

Production Deep Learning …

www.nvidia.com/en-us/deep-learning-ai/education/

github.com/dusty-nv/jetson-inference

Resources to check out

developer.nvidia.com/gre

Page 36: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

[email protected]

THANKS

Page 37: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

38

RESOURCE SLIDES

Page 38: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

39

main.go

func EmptyKernel_Handler

kernel_wrapper()

benchmark_execute()

empty_kernel<<<>>>

func EmptyKernel_Handler(w http.ResponseWriter, r *http.Request) { C.benchmark_execute(benchmark_ctx, (*C.char)(unsafe.Pointer(&message[0]))) io.WriteString(w, string(message[:])) } func main() { http.HandleFunc("/EmptyKernel/", EmptyKernel_Handler) http.ListenAndServe(":8000", nil) }

Calls the C func

Execute server

Set API URL

Page 39: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

40

benchmark.cpp (1/2)

kernel_wrapper()

benchmark_execute()

empty_kernel<<<>>>

constexpr static int kContextsPerDevice = 4; benchmark_ctx* benchmark_initialize() {

cudaGetDeviceCount(&device_count); ContextPool<BenchmarkContext> pool; for (int dev = 0; dev < device_count; ++dev) for (int i = 0; i < kContextsPerDevice; ++i) std::unique_ptr<BenchmarkContext> context(new BenchmarkContext(dev)); pool.Push(std::move(context)); }

4 per GPU

Get # GPUs

Create pool

func EmptyKernel_Handler

Page 40: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

41

benchmark.cpp (2/2)

kernel_wrapper()

benchmark_execute()

empty_kernel<<<>>>

func EmptyKernel_Handler

void benchmark_execute(benchmark_ctx* ctx, char* message) {

ScopedContext<BenchmarkContext> context(ctx->pool);

cudaStream_t stream = context->CUDAStream(); kernel_wrapper(stream, message); }

Scoped Context

Run the wrapper

Page 41: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

42

kernel.cu kernel_wrapper()

benchmark_execute()

empty_kernel<<<>>>

func EmptyKernel_Handler

__global__ void empty_kernel(char* device_message) { const char message[50] = "Hello world from an (almost) empty CUDA kernel :)"; for(int i=0;i<50;i++){ device_message[i] = message[i]; if(message[i]=='\0') break; }} void kernel_wrapper(cudaStream_t stream, char* message) { cudaHostAlloc((void**)&device_message, message_size, cudaHostAllocDefault); host_message = (char*)malloc(message_size); empty_kernel<<<1, 1, 0, stream>>>(device_message); cudaMemcpy(host_message, device_message, message_size, cudaMemcpyDeviceToHost); strncpy(message, host_message, message_size); }

GPU code

Device call

Host side wrapper

Page 42: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

43

TensorRT

• Convolution: Currently only 2D convolutions

• Activation: ReLU, tanh and sigmoid

• Pooling: max and average

• Scale: similar to Caffe Power layer (shift+scale*x)^p

• ElementWise: sum, product or max of two tensors

• LRN: cross-channel only

• Fully-connected: with or without bias

• SoftMax: cross-channel only

• Deconvolution

Layers Types Supported

Page 43: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

44

TENSORRT Optimizations

• Fuse network layers

• Eliminate concatenation layers

• Kernel specialization

• Auto-tuning for target platform

• Tuned for given batch size TRAINED

NEURAL NETWORK

OPTIMIZED INFERENCE RUNTIME

developer.nvidia.com/tensorrt

Page 44: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

45

GRAPH OPTIMIZATION Unoptimized network

concat

max pool

input

next input

3x3 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

concat

1x1 conv.

relu

bias 5x5 conv.

relu

bias

Page 45: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

46

GRAPH OPTIMIZATION Vertical fusion

concat

max pool

input

next input

concat

1x1 CBR 3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR 1x1 CBR

Page 46: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

47

GRAPH OPTIMIZATION Horizontal fusion

concat

max pool

input

next input

concat

3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR

Page 47: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

48

GRAPH OPTIMIZATION Concat elision

max pool

input

next input

3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR

Page 48: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

49

Int8 precision New in TensorRT

ACCURACY EFFICIENCY PERFORMANCE

0

1000

2000

3000

4000

5000

6000

7000

2 4 128

FP32 INT8

Up To 3x More Images/sec with INT8

Precision

Batch Size

GoogLenet, FP32 vs INT8 precision + TensorRT on

Tesla P40 GPU, 2 Socket Haswell E5-2698 [email protected] with HT off

Images/

Second

0

200

400

600

800

1000

1200

1400

2 4 128

FP32 INT8

Deploy 2x Larger Models with INT8

Precision

Batch Size

Mem

ory

(M

B)

0%

20%

40%

60%

80%

100%

Top 1Accuracy

Top 5Accuracy

FP32 INT8

Deliver full accuracy with INT8

precision

% A

ccura

cy

Page 49: Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

50

IDP.4A – 8 BIT INSTRUCTION

i8 i8 i8 i8

× × × ×

i8 i8 i8 i8

i32 + i32