inference optimization using tensorrt with use cases · 2018. 11. 26. · mixture of experts neural...
TRANSCRIPT
![Page 1: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/1.jpg)
Jack Han / 한재근 | Solutions Architect | NVIDIA
Inference Optimization Using TensorRT
with Use Cases
![Page 2: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/2.jpg)
![Page 3: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/3.jpg)
TensorRT 4 Adoption
Video
MapsImage
NLP
Speech
Search
Use Cases
![Page 4: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/4.jpg)
AI Inference is exploding
SPEECH
1 BillionVoice Searches Per Day
Google, Bing, etc.
LIVE VIDEO
1 BillionVideos Watched Per Day
RECOMMENDATIONS
1 TrillionAds/Rankings Per Day
Impressions
![Page 5: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/5.jpg)
2011 2012 2013 2014 2015 2016 2017
Image(GOP * Bandwidth)
ResNet-50
Inception-v2
Inception-v4
AlexNet
GoogleNet
350X
2013 2014 2015 2016 2017 2018
Speech(GOP * Bandwidth)
DeepSpeech
DeepSpeech 2
DeepSpeech 3
30X
2014 2015 2016 2017 2018
Translation(GOP * Bandwidth)
MoE
OpenNMT
GNMT
10X
Bigger and More ComputeIntensive in quest for accuracy
![Page 6: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/6.jpg)
Cambrian ExplosionConvolutional
Networks
ReLuEncoder/Decoder BatchNorm
Dropout PoolingConcat
Recurrent Networks
GRULSTM
CTC
Beam Search
WaveNet Attention
Generative Adversarial Networks
Speech Enhancement GANCoupled GAN
Conditional GANMedGAN3D-GAN
Reinforcement Learning
DQN Simulation
DDPG
New Species
Neural Collaborative FilteringMixture of Experts
Block Sparse LSTM
Capsule Nets
![Page 7: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/7.jpg)
Inefficiency Limits Innovation
Custom Development
Developers need to reinvent the
plumbing for every application
Single Model Only
Some systems are overused
while others are
underutilized
ASR NLPRec-
ommender
!
Single Framework Only
Solutions can only support
models from one framework
![Page 8: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/8.jpg)
More Accuracy = Time
Speed/accuracy trade-offs for modern convolutional object detectors, Google Research
![Page 9: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/9.jpg)
Inference Performance
Latency EfficiencyThroughput
Easy Deploy
![Page 10: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/10.jpg)
Inference Optimization
Lightweight Model Deep Compression
Small Computation
Suitable Lightweight Hardware
E.g;
MobileNet
SqueezeNet
SEP-Nets
No Network Modification
Continuous the Enovation
E.g;
Quantization and Binarization
Network Pruning and Sharing
Distillation / Factorization
![Page 11: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/11.jpg)
TensorRTFrom framework to target
DRIVE PX 2
JETSON TX2
NVIDIA DLA
TESLA T4
TESLA V100
FRAMEWORKS GPU PLATFORMS
TensorRT
Optimizer Runtime
![Page 12: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/12.jpg)
Speed-up
0
1000
2000
3000
4000
5000
0 10 20 30 40 50 60
TensorRT INT8 TensorRT FP16 TensorRT FP32 GPU Native FP32 CPU Native FP32
ResNet-50
V100
Batch Response time (ms)
Th
rou
gh
put(im
ag
es/s
ec)
![Page 13: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/13.jpg)
TensorRT 5
More Layers / Plugin / APIs
Inference Server Integration
![Page 14: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/14.jpg)
TensorRT Workflow
Optimize Plan
Step 1. Optimize trained model
Model
![Page 15: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/15.jpg)
TensorRT Workflow
Embedded
Automotive
Data center
Step 2. Deploy optimized plan
Plan Runtime
Engine
![Page 16: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/16.jpg)
Step-by-StepTensorRT Plan Build
Network
C++/Python API
Model Parser
Network
Definitions
TensorRT
Builder Engine
https://docs.nvidia.com/deeplearning/sdk/
tensorrt-developer-guide/index.html#support_op
![Page 17: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/17.jpg)
Step 1: Convert trained model into
TensorRT format
Step 2: Create a model parser
Step 3: Register inputs and outputs
Step 4: Optimize model and create
a runtime engine
Step 5: Serialize optimized engine
Step 6: De-serialize engine
Step 7: Perform inference
7 Steps to Deployment with TensorRT
![Page 18: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/18.jpg)
TensorRT
Builder Engine
Network
C++/Python API
Network
Definitions
Model Parser
Plugin
Factory
Plugin A Plugin B
Custom Layer Supportusing Plugin Layer
![Page 19: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/19.jpg)
Plugin Layer Procedure
Plugin LayeriCudaEnginePlugin
Factory
Layer?Initialize
(missing)Phase 1.
Initialize
enqueuePhase 2.
Inference
Serialize
DeserializePhase 3.
Store/Load
![Page 20: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/20.jpg)
Supporting Features
Feature C++Python
(x86 only)NvCaffeParser NvUffParser
CNNs Yes Yes Yes Yes
RNNs Yes Yes No No
INT8 Calibration Yes Yes N/A N/A
Asymmetric
PaddingYes Yes No No
![Page 21: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/21.jpg)
TensorRT APIDefine your network using C/C++ or Python
INetworkDefinition* network = builder->createNetwork();
convertRNNWeights, convertRNNBias
setWeightsForGate, setBiasForGate
![Page 22: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/22.jpg)
TensorRT RNN Programming
Dump Weights Load Weights Convert Weights Set Weights
TF format TRT format
TensorRT
RNN Layer
TF format
/usr/src/tensorrt/samples/
common/dumpTFWts.py
ckpt wts
Get started with sampleCharRNN in documentation
![Page 23: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/23.jpg)
NMT with Transformer Inference
Running 3 binded engines
(Encoder, Generator, Suffle Engine)
Encoder Decoder
Beam
Search
Encoder RNN
Decoder
RNN
Attention
Model
Projection
TopK
Output
Be
am
Sh
uffle
Ba
tch
Re
du
ctio
n
Beam
Scorin
g
Input
Setup
Input
* CPU-Only: TensorFlow on SKL 6140 18 core, FP32
GPU: V100, TensorRT 5, FP16; Sorted data, Batch=128, English to German
Runs on CPU
GPU-Accelerated
Support NMT layers such as
Gather, Softmax, Batch GEMM and Top K
Modular Network Merge
Deploy highly-optimized
language translation apps in
production environments
Get started with NMT sample in documentation
![Page 24: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/24.jpg)
Recommendation Inference
TopK
Output Top
Scores
Embedding
Embedding
Embedding
Embedding
Fused MLP
Kernel
Fully Connected
Bias
Activation
Fully Connected
Bias
Activation
Fully Connected
Bias
Activation
Embedding
Concat
xNUser
1
Context
Item
Item
ItemN
2
...
Runs on CPU
GPU-Accelerated
* CPU-Only: TensorFlow on Intel Xeon E5-2698 v4 CPU
at 2.20 GHz; GPU: V100 with custom networks
Deploy multi-layer perceptron (MLP) based
recommendation apps in production
Predict events (click, purchase, interest)
accurately based on input items (user,
query, observed activity)
Get started with examples on
MNIST character recognition and
movie recommender system
in the documentation
![Page 25: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/25.jpg)
Inferencing with Low precision
5.522
65
130
260
0
50
100
150
200
250
300
TFLO
PS
/ TO
PS
Peak Performance
T4P4Float INT8 Float INT8 INT4
![Page 26: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/26.jpg)
Post Training QuantizationMinimize information loss between FP32 and INT8
inference on calibration dataset
100 mini-batches are sufficient
to determine quantization parameter
(~500 samples)
![Page 27: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/27.jpg)
New INT8 APIs and Optimizations
Quantization Aware Training
Custom Calibration
Auto CalibrationFP32
Training
Optimize
to INT8
FP32 weights
Calibration Data O(1000) Images
FP32
Training
Optimize
to INT8
FP32 weights Custom
Calibration
FP32 or INT8 weights
Custom Activation ranges
Quantization
Aware Training
in FP32
Optimize
to INT8
Custom Activation Ranges
FP32 or
accurate INT8 weights
Maximize throughput at low latency with mixed precision compute in production
Apply INT8 quantization aware training or custom calibration algorithms with new APIs
Control precision per-layer with new APIs
Optimizations for depth wise convolution operation
Getting Start with INT8 sample in the documentation
![Page 28: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/28.jpg)
Inference Performance
0
500
1000
1500
2000
2500
3000
1 2 4 8
P4 - INT8 V100 - Mixed
0
5
10
15
20
25
1 2 4 8P4 - INT8 V100 - Mixed
Throughput Performance/WATT
Higher is better!!
![Page 29: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/29.jpg)
30
WIDELY ADOPTED
![Page 30: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/30.jpg)
31
NVIDIA INFERENCE MOMENTUM
Image Tagging Video Analysis Advertising Impact Video Captioning Visual Search
Finding Music Sports Performance Customer Service Visual SearchIndustrial
InspectionVoice Recognition
Cybersecurity
![Page 31: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/31.jpg)
32
Using GPUs made it possible to enable media
understanding of the Twitter platform, by
drastically reducing media deep learning model
training time and by allowing us to derive real-time
understanding of live videos at inference time.
Nicolas Koumchatzky, Head of Twitter Cortex
“
”
![Page 32: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/32.jpg)
33
“ At Microsoft, we are working hard to deliver the most advanced AI-powered services to our
customers. Using NVIDIA GPUs in real-time inference workloads has improved Bing’s advanced search
offerings, enabling us to reduce object detection latency for images. We look forward to working
with NVIDIA’s next generation inference hardware and software to expand the way people benefit
from AI products and services.”
-Jordi Ribas CVP, Bing and AI Products, Microsoft
“ AI is becoming increasingly pervasive, and inference is a critical capability customers need to
successfully deploy their AI models, so we’re excited to support NVIDIA’s Turing Tesla T4 GPUs on
Google Cloud Platform soon.”
-Chris Kleban, product manager at Google Cloud
![Page 33: Inference Optimization Using TensorRT with Use Cases · 2018. 11. 26. · Mixture of Experts Neural Collaborative Filtering Block Sparse LSTM Capsule Nets. Inefficiency Limits Innovation](https://reader034.vdocuments.site/reader034/viewer/2022052100/603a1b3f71a04a310142efaf/html5/thumbnails/33.jpg)
SEOUL | NOVEMBER 7 - 8,2018
www.nvidia.com/ko-kr/ai-conference/