advanced spark and tensorflow meetup 2017-05-06 reduced precision (fp16, int8) inference on...

Apr 2017 – Chris Gottbrath

REDUCED PRECISION (FP16, INT8) INFERENCE ON CONVOLUTIONAL NEURAL NETWORKS WITH TENSORRT AND NVIDIA PASCAL

2

AGENDA

Deep Learning

TensorRT

Reduced Precision

GPU REST Engine

Conclusion

3

NEW AI SERVICES POSSIBLE WITH GPU CLOUD

SPOTIFY SONG RECOMMENDATIONS

NETFLIX VIDEO RECOMMENDATIONS

YELP SELECTING COVER PHOTOS

4

TESLA REVOLUTIONIZES DEEP LEARNING

NEURAL NETWORK APPLICATION

BEFORE TESLA AFTER TESLA

Cost $5,000K $200K

Servers 1,000 Servers 16 Tesla Servers

Energy 600 KW 4 KW

Performance 1x 6x

5

NVIDIA DEEP LEARNING SDK

Powerful tools and libraries for designing and deploying GPU-accelerated deep learning applications

High performance building blocks for training and deploying deep neural networks on NVIDIA GPUs

Industry vetted deep learning algorithms and linear algebra subroutines for developing novel deep neural networks

Multi-GPU scaling that accelerates training on up to eight GPU

High performance GPU-acceleration for deep learning

“ We are amazed by the steady stream

of improvements made to the NVIDIA

Deep Learning SDK and the speedups

that they deliver.”

— Frédéric Bastien, Team Lead (Theano) MILA

developer.nvidia.com/deep-learning-software

6

POWERING THE DEEP LEARNING ECOSYSTEM NVIDIA SDK accelerates every major framework

COMPUTER VISION

OBJECT DETECTION IMAGE CLASSIFICATION

SPEECH & AUDIO

VOICE RECOGNITION LANGUAGE TRANSLATION

NATURAL LANGUAGE PROCESSING

RECOMMENDATION ENGINES SENTIMENT ANALYSIS

DEEP LEARNING FRAMEWORKS

Mocha.jl



7

TensorRT

8

NVIDIA DEEP LEARNING SOFTWARE PLATFORM


TensorRT

Embedded

Automotive

Data center

TRAINING FRAMEWORK

Training Data

Training

Data Management

Model Assessment

Trained Neural Network


9

NVIDIA TensorRT High-performance deep learning inference for production deployment

developer.nvidia.com/tensorrt

High performance neural network inference engine for production deployment

Generate optimized and deployment-ready models for datacenter, embedded and automotive platforms

Deliver high-performance, low-latency inference demanded by real-time services

Deploy faster, more responsive and memory efficient deep learning applications with INT8 and FP16 optimized precision support

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

2 8 128

CPU-Only

Tesla P40 + TensorRT (FP32)

Tesla P40 + TensorRT (INT8)

Up to 36x More Image/sec

Batch Size

GoogLenet, CPU-only vs Tesla P40 + TensorRT CPU: 1 socket E4 2690 v4 @2.6 GHz, HT-on GPU: 2 socket E5-2698 v3 @2.3 GHz, HT off, 1 P40 card in the box

Images/

Second

10

WORKFLOW – GETTING A TRAINED MODEL INTO TensorRT

11

TensorRT Development Workflow

Training Framework OPTIMIZATION USING TensorRT

Validation USING TensorRT

PLAN NEURAL NETWORK


Serialize to disk

Batch Size

Precision

12

TensorRT Production Workflow

RUNTIME USING TensorRT

Serialized PLAN


13

TO IMPORT A TRAINED MODEL TO TensorRT

IBuilder* builder = createInferBuilder(gLogger);

INetworkDefinition* network = builder->createNetwork();

CaffeParser parser;

auto blob_name_to_tensor = parser.parse(<network definition>,<weights>,*network,<datatype>);

network->markOutput(*blob_name_to_tensor->find(<output layer name>));

builder->setMaxBatchSize(<size>);

builder->setMaxWorkspaceSize(<size>);

ICudaEngine* engine = builder->buildCudaEngine(*network);

Key function calls

This assumes you have a Caffe model file


14

IMPORTING USING THE GRAPH DEFINITION API

If using other frameworks such as TensorFlow you can call our network builder API

ITensor* in = network->addInput(“input”, DataType::kFloat, Dims3{…});

IPoolingLayer* pool = network->addPooling(in, PoolingType::kMAX, …);

Etc…

We are looking at a streamlined graph input for TensorFlow like our Caffe parser.

From any framework


15

EXECUTE THE NEURAL NETWORK

IExecutionContext *context = engine->createExecutionContext();

<handle> = engine->getBindingIndex(<binding layer name>),

<malloc and cudaMalloc calls > //allocate buffers for data moving in and out

cudaStream_t stream;

cudaStreamCreate(&stream);

cudaMemcpyAsync( <args> )); // Copy Input Data to the GPU

context.enqueue(<args>);

cudaMemcpyAsync( <args> )); // Copy Output Data to the Host

cudaStreamSynchronize(stream);

Running inference using the API

16

THROUGHPUT

0

500

1000

1500

2000

2500

1 2 4 8 16 32 64 128

Caffe FP32 on CPU Caffe FP32 on P100 TensorFlow FP32 on P100

TensorRT FP32 on P100 TensorRT FP16 on P100

Images

/ s

Batch Size Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.

17

LATENCY

1

10

100

1000

10000

1 2 4 8 16 32 64 128

Caffe FP32 on CPU Caffe FP32 on P100 TensorFlow FP32 on P100

TensorRT FP32 on P100 TensorRT FP16 on P100

Late

nce (

ms

to e

xecute

batc

h)

Batch Size Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.

18

REDUCED PRECISION

19

SMALLER AND FASTER

0

0.5

1

1.5

2

2.5

3

3.5

FP32 FP16 on P100 INT8 on P40

Performance

% s

cale

d t

o F

P32

ResNet50 Model, Batch Size = 128, TensoRT 2.1 RC prerelease

0

20

40

60

80

100

120

FP32 FP16 on P100 INT8 on P40

Memory Usage

Images

/ s

-

Scale

d t

o F

P32


20

INT8 INFERENCE

• Main challenge

• INT8 has significantly lower precision and dynamic range compared to FP32

• Requires “smart” quantization and calibration from FP32 to INT8

Challenge

Dynamic Range Min Pos Value

FP32 -3.4x1038 ~ +3.4x1038 1.4 × 10−45

FP16 -65504 ~ +65504 5.96 x 10-8

INT8 -128 ~ +127 1


21

QUANTIZATION OF WEIGHTS

-127 -126 -125 125 126 127

I8_weight = Round_to_nearest_int( scaling_factor * F32_weight )

scaling_factor = 127.0f / max( abs( all_F32_weights_in_the_filter ) )

Symmetric, Linear Quantization

[-127, 127]

22 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

QUANTIZATION OF ACTIVATIONS

I8_value = (value > threshold) ?

threshold :

scale * F32_value

How do you decide optimal ‘threshold’?

Activation range is unknown offline, input dependent

Calibration using ‘representative’ dataset

? ? ?

Input

24

TENSORRT INT8 Workflow

FP32 Training Framework

INT8 OPTIMIZATION USING TensorRT

INT8 RUNTIME USING TensorRT

INT8 PLAN

FP32 NEURAL NETWORK


Calibration

Dataset

Batch Size

Precision

25

TURNING ON INT8 AND CALLING THE CALIBRATOR

builder->setInt8Mode(true);

IInt8Calibrator* calibrator,

builder->setInt8Calibrator(calibrator);

bool getBatch(<args>) override

API calls


26

8-BIT INFERENCE

Top-1 Accuracy

Network FP32 Top1 INT8 Top1 Difference Perf Gain


27

DEPLOYING ACCELERATED FUNCTIONS SUCH AS TensorRT

AS A MICROSERVICE WITH GPU REST ENGINE (GRE)

28

GPU REST ENGINE (GRE) SDK

Accelerated microservices for web and mobile

Supercomputer performance for hyperscale datacenters

Up to 50 teraflops per node, min ~250μs response time

Easy to develop new microservices

Open source, integrates with existing infrastructure

Easy to deploy & scale

Ready-to-run Dockerfile

HTTP (~250μs)

GPU REST Engine

Image

Classification

Speech

Recognition …

Image

Scaling

developer.nvidia.com/gre




29

WEB ARCHITECTURE WITH GRE

Create accelerated microservices

REST interfaces

Provide your own GPU kernel

GRE plugs in easily Web Presentation Layer

Content

Ident Svc GRE

Ads

ICE

Img Data

Analytics

GRE

Image

Classification





30

REST API

HTTP layer

App layer

CPU-side layer

Device-layer

func EmptyKernel_Handler

kernel_wrapper()

benchmark_execute()

Microservice

Client

empty_kernel<<<>>>

Go

C++

CUDA host

CUDA device GPU

Host CPU

Host CPU

Host CPU

ScopedContext<>

Hello World Microservice





31

Context Pool

Request ScopedContext Request ScopedContext

GPU1

GPU2

Context

Context

Request ScopedContext



Context


Context




Resource Pool





32

ScopedContext<>

REST API

HTTP layer

App layer

Device-layer

func classify

classifier_classify()

Microservice

Client

Go

C++

CUDA device GPU

Host CPU

Host CPU

classify()

Classification Microservice





33

CLASSIFICATION.CPP (1/2)

func classify


classify()

constexpr static int kContextsPerDevice = 2; classifier_ctx* classifier_initialize(char* model_file, char* trained_file, char* mean_file, char* label_file) {try{ cudaError_t st = cudaGetDeviceCount(&device_count); ContextPool<CaffeContext> pool; for (int dev = 0; dev < device_count; ++dev) { for (int i = 0; i < kContextsPerDevice; ++i) { std::unique_ptr<CaffeContext> context(new CaffeContext(model_file, trained_file, Mean_file, label_file, dev)); pool.Push(std::move(context)); }}} catch { ... } }

To allow latency hiding

CaffeContexts





34

CLASSIFICATION.CPP (2/2)

func classify


classify()

const char* classifier_classify(classifier_ctx* ctx, char* buffer, size_t length) { try{ ScopedContext<CaffeContext> context(ctx->pool); auto classifier = context->CaffeClassifier(); predictions = classifier->Classify(img); /* Write the top N predictions in JSON format. */ }

Uses a scoped context

Lower level classify routine





35

CONCLUSION

Inference is going to power an increasing number of features and capabilities.

Latency is important for responsive services

Throughput is important for controlling costs and scaling out

GPUs can deliver throughput and low latency

Reduced precision can be used for an extra boost

There is a template to follow for creating accelerated microservices





36

WANT TO LEARN MORE?

GPU Technology Conference

May 8-11 in San Jose

S7310 - 8-Bit Inference with TensorRT Szymon Migacz

S7458 - Deploying unique DL Networks as Micro-Services with TensorRT, user extensible layers, and GPU REST Engine Chris Gottbrath

9 Spark and 17 TensorFlow sessions

20% off discount code: NVCGOTT



devblogs.nvidia.com/parallelforall/

NVIDIA Jetson TX2 Delivers Twice …

Production Deep Learning …

www.nvidia.com/en-us/deep-learning-ai/education/

github.com/dusty-nv/jetson-inference

Resources to check out





[email protected]

THANKS

38

RESOURCE SLIDES

39

main.go


kernel_wrapper()

benchmark_execute()

empty_kernel<<<>>>

func EmptyKernel_Handler(w http.ResponseWriter, r *http.Request) { C.benchmark_execute(benchmark_ctx, (*C.char)(unsafe.Pointer(&message[0]))) io.WriteString(w, string(message[:])) } func main() { http.HandleFunc("/EmptyKernel/", EmptyKernel_Handler) http.ListenAndServe(":8000", nil) }

Calls the C func

Execute server

Set API URL

40

benchmark.cpp (1/2)

kernel_wrapper()

benchmark_execute()

empty_kernel<<<>>>

constexpr static int kContextsPerDevice = 4; benchmark_ctx* benchmark_initialize() {

cudaGetDeviceCount(&device_count); ContextPool<BenchmarkContext> pool; for (int dev = 0; dev < device_count; ++dev) for (int i = 0; i < kContextsPerDevice; ++i) std::unique_ptr<BenchmarkContext> context(new BenchmarkContext(dev)); pool.Push(std::move(context)); }

4 per GPU

Get # GPUs

Create pool


41

benchmark.cpp (2/2)

kernel_wrapper()

benchmark_execute()

empty_kernel<<<>>>


void benchmark_execute(benchmark_ctx* ctx, char* message) {

ScopedContext<BenchmarkContext> context(ctx->pool);

cudaStream_t stream = context->CUDAStream(); kernel_wrapper(stream, message); }

Scoped Context

Run the wrapper

42

kernel.cu kernel_wrapper()

benchmark_execute()

empty_kernel<<<>>>


__global__ void empty_kernel(char* device_message) { const char message[50] = "Hello world from an (almost) empty CUDA kernel :)"; for(int i=0;i<50;i++){ device_message[i] = message[i]; if(message[i]=='\0') break; }} void kernel_wrapper(cudaStream_t stream, char* message) { cudaHostAlloc((void**)&device_message, message_size, cudaHostAllocDefault); host_message = (char*)malloc(message_size); empty_kernel<<<1, 1, 0, stream>>>(device_message); cudaMemcpy(host_message, device_message, message_size, cudaMemcpyDeviceToHost); strncpy(message, host_message, message_size); }

GPU code

Device call

Host side wrapper

43

TensorRT

• Convolution: Currently only 2D convolutions

• Activation: ReLU, tanh and sigmoid

• Pooling: max and average

• Scale: similar to Caffe Power layer (shift+scale*x)^p

• ElementWise: sum, product or max of two tensors

• LRN: cross-channel only

• Fully-connected: with or without bias

• SoftMax: cross-channel only

• Deconvolution

Layers Types Supported

44

TENSORRT Optimizations

• Fuse network layers

• Eliminate concatenation layers

• Kernel specialization

• Auto-tuning for target platform

• Tuned for given batch size TRAINED

NEURAL NETWORK

OPTIMIZED INFERENCE RUNTIME


45

GRAPH OPTIMIZATION Unoptimized network

concat

max pool

input

next input

3x3 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

concat

1x1 conv.

relu

bias 5x5 conv.

relu

bias

46

GRAPH OPTIMIZATION Vertical fusion

concat

max pool

input

next input

concat

1x1 CBR 3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR 1x1 CBR

47

GRAPH OPTIMIZATION Horizontal fusion

concat

max pool

input

next input

concat

3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR

48

GRAPH OPTIMIZATION Concat elision

max pool

input

next input

3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR

49

Int8 precision New in TensorRT

ACCURACY EFFICIENCY PERFORMANCE

0

1000

2000

3000

4000

5000

6000

7000

2 4 128

FP32 INT8

Up To 3x More Images/sec with INT8

Precision

Batch Size

GoogLenet, FP32 vs INT8 precision + TensorRT on

Tesla P40 GPU, 2 Socket Haswell E5-2698 [email protected] with HT off

Images/

Second

0

200

400

600

800

1000

1200

1400

2 4 128

FP32 INT8

Deploy 2x Larger Models with INT8

Precision

Batch Size

Mem

ory

(M

B)

0%

20%

40%

60%

80%

100%

Top 1Accuracy

Top 5Accuracy

FP32 INT8

Deliver full accuracy with INT8

precision

% A

ccura

cy

50

IDP.4A – 8 BIT INSTRUCTION

i8 i8 i8 i8

× × × ×

i8 i8 i8 i8

i32 + i32

advanced spark and tensorflow meetup 2017-05-06 reduced precision (fp16, int8) inference on...

Software