advanced spark and tensorflow meetup 2017-05-06 reduced precision (fp16, int8) inference on...
TRANSCRIPT
Apr 2017 – Chris Gottbrath
REDUCED PRECISION (FP16, INT8) INFERENCE ON CONVOLUTIONAL NEURAL NETWORKS WITH TENSORRT AND NVIDIA PASCAL
2
AGENDA
Deep Learning
TensorRT
Reduced Precision
GPU REST Engine
Conclusion
3
NEW AI SERVICES POSSIBLE WITH GPU CLOUD
SPOTIFY SONG RECOMMENDATIONS
NETFLIX VIDEO RECOMMENDATIONS
YELP SELECTING COVER PHOTOS
4
TESLA REVOLUTIONIZES DEEP LEARNING
NEURAL NETWORK APPLICATION
BEFORE TESLA AFTER TESLA
Cost $5,000K $200K
Servers 1,000 Servers 16 Tesla Servers
Energy 600 KW 4 KW
Performance 1x 6x
5
NVIDIA DEEP LEARNING SDK
Powerful tools and libraries for designing and deploying GPU-accelerated deep learning applications
High performance building blocks for training and deploying deep neural networks on NVIDIA GPUs
Industry vetted deep learning algorithms and linear algebra subroutines for developing novel deep neural networks
Multi-GPU scaling that accelerates training on up to eight GPU
High performance GPU-acceleration for deep learning
“ We are amazed by the steady stream
of improvements made to the NVIDIA
Deep Learning SDK and the speedups
that they deliver.”
— Frédéric Bastien, Team Lead (Theano) MILA
developer.nvidia.com/deep-learning-software
6
POWERING THE DEEP LEARNING ECOSYSTEM NVIDIA SDK accelerates every major framework
COMPUTER VISION
OBJECT DETECTION IMAGE CLASSIFICATION
SPEECH & AUDIO
VOICE RECOGNITION LANGUAGE TRANSLATION
NATURAL LANGUAGE PROCESSING
RECOMMENDATION ENGINES SENTIMENT ANALYSIS
DEEP LEARNING FRAMEWORKS
Mocha.jl
NVIDIA DEEP LEARNING SDK
developer.nvidia.com/deep-learning-software
7
TensorRT
8
NVIDIA DEEP LEARNING SOFTWARE PLATFORM
NVIDIA DEEP LEARNING SDK
TensorRT
Embedded
Automotive
Data center
TRAINING FRAMEWORK
Training Data
Training
Data Management
Model Assessment
Trained Neural Network
developer.nvidia.com/deep-learning-software
9
NVIDIA TensorRT High-performance deep learning inference for production deployment
developer.nvidia.com/tensorrt
High performance neural network inference engine for production deployment
Generate optimized and deployment-ready models for datacenter, embedded and automotive platforms
Deliver high-performance, low-latency inference demanded by real-time services
Deploy faster, more responsive and memory efficient deep learning applications with INT8 and FP16 optimized precision support
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
2 8 128
CPU-Only
Tesla P40 + TensorRT (FP32)
Tesla P40 + TensorRT (INT8)
Up to 36x More Image/sec
Batch Size
GoogLenet, CPU-only vs Tesla P40 + TensorRT CPU: 1 socket E4 2690 v4 @2.6 GHz, HT-on GPU: 2 socket E5-2698 v3 @2.3 GHz, HT off, 1 P40 card in the box
Images/
Second
10
WORKFLOW – GETTING A TRAINED MODEL INTO TensorRT
11
TensorRT Development Workflow
Training Framework OPTIMIZATION USING TensorRT
Validation USING TensorRT
PLAN NEURAL NETWORK
developer.nvidia.com/tensorrt
Serialize to disk
Batch Size
Precision
12
TensorRT Production Workflow
RUNTIME USING TensorRT
Serialized PLAN
developer.nvidia.com/tensorrt
13
TO IMPORT A TRAINED MODEL TO TensorRT
IBuilder* builder = createInferBuilder(gLogger);
INetworkDefinition* network = builder->createNetwork();
CaffeParser parser;
auto blob_name_to_tensor = parser.parse(<network definition>,<weights>,*network,<datatype>);
network->markOutput(*blob_name_to_tensor->find(<output layer name>));
builder->setMaxBatchSize(<size>);
builder->setMaxWorkspaceSize(<size>);
ICudaEngine* engine = builder->buildCudaEngine(*network);
Key function calls
This assumes you have a Caffe model file
developer.nvidia.com/tensorrt
14
IMPORTING USING THE GRAPH DEFINITION API
If using other frameworks such as TensorFlow you can call our network builder API
ITensor* in = network->addInput(“input”, DataType::kFloat, Dims3{…});
IPoolingLayer* pool = network->addPooling(in, PoolingType::kMAX, …);
Etc…
We are looking at a streamlined graph input for TensorFlow like our Caffe parser.
From any framework
developer.nvidia.com/tensorrt
15
EXECUTE THE NEURAL NETWORK
IExecutionContext *context = engine->createExecutionContext();
<handle> = engine->getBindingIndex(<binding layer name>),
<malloc and cudaMalloc calls > //allocate buffers for data moving in and out
cudaStream_t stream;
cudaStreamCreate(&stream);
cudaMemcpyAsync( <args> )); // Copy Input Data to the GPU
context.enqueue(<args>);
cudaMemcpyAsync( <args> )); // Copy Output Data to the Host
cudaStreamSynchronize(stream);
Running inference using the API
16
THROUGHPUT
0
500
1000
1500
2000
2500
1 2 4 8 16 32 64 128
Caffe FP32 on CPU Caffe FP32 on P100 TensorFlow FP32 on P100
TensorRT FP32 on P100 TensorRT FP16 on P100
Images
/ s
Batch Size Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.
17
LATENCY
1
10
100
1000
10000
1 2 4 8 16 32 64 128
Caffe FP32 on CPU Caffe FP32 on P100 TensorFlow FP32 on P100
TensorRT FP32 on P100 TensorRT FP16 on P100
Late
nce (
ms
to e
xecute
batc
h)
Batch Size Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.
18
REDUCED PRECISION
19
SMALLER AND FASTER
0
0.5
1
1.5
2
2.5
3
3.5
FP32 FP16 on P100 INT8 on P40
Performance
% s
cale
d t
o F
P32
ResNet50 Model, Batch Size = 128, TensoRT 2.1 RC prerelease
0
20
40
60
80
100
120
FP32 FP16 on P100 INT8 on P40
Memory Usage
Images
/ s
-
Scale
d t
o F
P32
developer.nvidia.com/tensorrt
20
INT8 INFERENCE
• Main challenge
• INT8 has significantly lower precision and dynamic range compared to FP32
• Requires “smart” quantization and calibration from FP32 to INT8
Challenge
Dynamic Range Min Pos Value
FP32 -3.4x1038 ~ +3.4x1038 1.4 × 10−45
FP16 -65504 ~ +65504 5.96 x 10-8
INT8 -128 ~ +127 1
developer.nvidia.com/tensorrt
21
QUANTIZATION OF WEIGHTS
-127 -126 -125 125 126 127
I8_weight = Round_to_nearest_int( scaling_factor * F32_weight )
scaling_factor = 127.0f / max( abs( all_F32_weights_in_the_filter ) )
Symmetric, Linear Quantization
[-127, 127]
22 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
QUANTIZATION OF ACTIVATIONS
I8_value = (value > threshold) ?
threshold :
scale * F32_value
How do you decide optimal ‘threshold’?
Activation range is unknown offline, input dependent
Calibration using ‘representative’ dataset
? ? ?
Input
24
TENSORRT INT8 Workflow
FP32 Training Framework
INT8 OPTIMIZATION USING TensorRT
INT8 RUNTIME USING TensorRT
INT8 PLAN
FP32 NEURAL NETWORK
developer.nvidia.com/tensorrt
Calibration
Dataset
Batch Size
Precision
25
TURNING ON INT8 AND CALLING THE CALIBRATOR
builder->setInt8Mode(true);
IInt8Calibrator* calibrator,
builder->setInt8Calibrator(calibrator);
bool getBatch(<args>) override
API calls
developer.nvidia.com/tensorrt
26
8-BIT INFERENCE
Top-1 Accuracy
Network FP32 Top1 INT8 Top1 Difference Perf Gain
developer.nvidia.com/tensorrt
27
DEPLOYING ACCELERATED FUNCTIONS SUCH AS TensorRT
AS A MICROSERVICE WITH GPU REST ENGINE (GRE)
28
GPU REST ENGINE (GRE) SDK
Accelerated microservices for web and mobile
Supercomputer performance for hyperscale datacenters
Up to 50 teraflops per node, min ~250μs response time
Easy to develop new microservices
Open source, integrates with existing infrastructure
Easy to deploy & scale
Ready-to-run Dockerfile
HTTP (~250μs)
GPU REST Engine
Image
Classification
Speech
Recognition …
Image
Scaling
developer.nvidia.com/gre
29
WEB ARCHITECTURE WITH GRE
Create accelerated microservices
REST interfaces
Provide your own GPU kernel
GRE plugs in easily Web Presentation Layer
Content
Ident Svc GRE
Ads
ICE
Img Data
Analytics
GRE
Image
Classification
developer.nvidia.com/gre
30
REST API
HTTP layer
App layer
CPU-side layer
Device-layer
func EmptyKernel_Handler
kernel_wrapper()
benchmark_execute()
Microservice
Client
empty_kernel<<<>>>
Go
C++
CUDA host
CUDA device GPU
Host CPU
Host CPU
Host CPU
ScopedContext<>
Hello World Microservice
developer.nvidia.com/gre
31
Context Pool
Request ScopedContext Request ScopedContext
GPU1
GPU2
Context
Context
Request ScopedContext
Request ScopedContext
Request ScopedContext
Context
Request ScopedContext
Context
Request ScopedContext
Request ScopedContext
Request ScopedContext
Resource Pool
developer.nvidia.com/gre
32
ScopedContext<>
REST API
HTTP layer
App layer
Device-layer
func classify
classifier_classify()
Microservice
Client
Go
C++
CUDA device GPU
Host CPU
Host CPU
classify()
Classification Microservice
developer.nvidia.com/gre
33
CLASSIFICATION.CPP (1/2)
func classify
classifier_classify()
classify()
constexpr static int kContextsPerDevice = 2; classifier_ctx* classifier_initialize(char* model_file, char* trained_file, char* mean_file, char* label_file) {try{ cudaError_t st = cudaGetDeviceCount(&device_count); ContextPool<CaffeContext> pool; for (int dev = 0; dev < device_count; ++dev) { for (int i = 0; i < kContextsPerDevice; ++i) { std::unique_ptr<CaffeContext> context(new CaffeContext(model_file, trained_file, Mean_file, label_file, dev)); pool.Push(std::move(context)); }}} catch { ... } }
To allow latency hiding
CaffeContexts
developer.nvidia.com/gre
34
CLASSIFICATION.CPP (2/2)
func classify
classifier_classify()
classify()
const char* classifier_classify(classifier_ctx* ctx, char* buffer, size_t length) { try{ ScopedContext<CaffeContext> context(ctx->pool); auto classifier = context->CaffeClassifier(); predictions = classifier->Classify(img); /* Write the top N predictions in JSON format. */ }
Uses a scoped context
Lower level classify routine
developer.nvidia.com/gre
35
CONCLUSION
Inference is going to power an increasing number of features and capabilities.
Latency is important for responsive services
Throughput is important for controlling costs and scaling out
GPUs can deliver throughput and low latency
Reduced precision can be used for an extra boost
There is a template to follow for creating accelerated microservices
developer.nvidia.com/gre
36
WANT TO LEARN MORE?
GPU Technology Conference
May 8-11 in San Jose
S7310 - 8-Bit Inference with TensorRT Szymon Migacz
S7458 - Deploying unique DL Networks as Micro-Services with TensorRT, user extensible layers, and GPU REST Engine Chris Gottbrath
9 Spark and 17 TensorFlow sessions
20% off discount code: NVCGOTT
developer.nvidia.com/tensorrt
developer.nvidia.com/gre
devblogs.nvidia.com/parallelforall/
NVIDIA Jetson TX2 Delivers Twice …
Production Deep Learning …
www.nvidia.com/en-us/deep-learning-ai/education/
github.com/dusty-nv/jetson-inference
Resources to check out
developer.nvidia.com/gre
THANKS
38
RESOURCE SLIDES
39
main.go
func EmptyKernel_Handler
kernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
func EmptyKernel_Handler(w http.ResponseWriter, r *http.Request) { C.benchmark_execute(benchmark_ctx, (*C.char)(unsafe.Pointer(&message[0]))) io.WriteString(w, string(message[:])) } func main() { http.HandleFunc("/EmptyKernel/", EmptyKernel_Handler) http.ListenAndServe(":8000", nil) }
Calls the C func
Execute server
Set API URL
40
benchmark.cpp (1/2)
kernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
constexpr static int kContextsPerDevice = 4; benchmark_ctx* benchmark_initialize() {
cudaGetDeviceCount(&device_count); ContextPool<BenchmarkContext> pool; for (int dev = 0; dev < device_count; ++dev) for (int i = 0; i < kContextsPerDevice; ++i) std::unique_ptr<BenchmarkContext> context(new BenchmarkContext(dev)); pool.Push(std::move(context)); }
4 per GPU
Get # GPUs
Create pool
func EmptyKernel_Handler
41
benchmark.cpp (2/2)
kernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
func EmptyKernel_Handler
void benchmark_execute(benchmark_ctx* ctx, char* message) {
ScopedContext<BenchmarkContext> context(ctx->pool);
cudaStream_t stream = context->CUDAStream(); kernel_wrapper(stream, message); }
Scoped Context
Run the wrapper
42
kernel.cu kernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
func EmptyKernel_Handler
__global__ void empty_kernel(char* device_message) { const char message[50] = "Hello world from an (almost) empty CUDA kernel :)"; for(int i=0;i<50;i++){ device_message[i] = message[i]; if(message[i]=='\0') break; }} void kernel_wrapper(cudaStream_t stream, char* message) { cudaHostAlloc((void**)&device_message, message_size, cudaHostAllocDefault); host_message = (char*)malloc(message_size); empty_kernel<<<1, 1, 0, stream>>>(device_message); cudaMemcpy(host_message, device_message, message_size, cudaMemcpyDeviceToHost); strncpy(message, host_message, message_size); }
GPU code
Device call
Host side wrapper
43
TensorRT
• Convolution: Currently only 2D convolutions
• Activation: ReLU, tanh and sigmoid
• Pooling: max and average
• Scale: similar to Caffe Power layer (shift+scale*x)^p
• ElementWise: sum, product or max of two tensors
• LRN: cross-channel only
• Fully-connected: with or without bias
• SoftMax: cross-channel only
• Deconvolution
Layers Types Supported
44
TENSORRT Optimizations
• Fuse network layers
• Eliminate concatenation layers
• Kernel specialization
• Auto-tuning for target platform
• Tuned for given batch size TRAINED
NEURAL NETWORK
OPTIMIZED INFERENCE RUNTIME
developer.nvidia.com/tensorrt
45
GRAPH OPTIMIZATION Unoptimized network
concat
max pool
input
next input
3x3 conv.
relu
bias
1x1 conv.
relu
bias
1x1 conv.
relu
bias
1x1 conv.
relu
bias
concat
1x1 conv.
relu
bias 5x5 conv.
relu
bias
46
GRAPH OPTIMIZATION Vertical fusion
concat
max pool
input
next input
concat
1x1 CBR 3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR 1x1 CBR
47
GRAPH OPTIMIZATION Horizontal fusion
concat
max pool
input
next input
concat
3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR
48
GRAPH OPTIMIZATION Concat elision
max pool
input
next input
3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR
49
Int8 precision New in TensorRT
ACCURACY EFFICIENCY PERFORMANCE
0
1000
2000
3000
4000
5000
6000
7000
2 4 128
FP32 INT8
Up To 3x More Images/sec with INT8
Precision
Batch Size
GoogLenet, FP32 vs INT8 precision + TensorRT on
Tesla P40 GPU, 2 Socket Haswell E5-2698 [email protected] with HT off
Images/
Second
0
200
400
600
800
1000
1200
1400
2 4 128
FP32 INT8
Deploy 2x Larger Models with INT8
Precision
Batch Size
Mem
ory
(M
B)
0%
20%
40%
60%
80%
100%
Top 1Accuracy
Top 5Accuracy
FP32 INT8
Deliver full accuracy with INT8
precision
% A
ccura
cy
50
IDP.4A – 8 BIT INSTRUCTION
i8 i8 i8 i8
× × × ×
i8 i8 i8 i8
i32 + i32