deep into trtis: bert practical deployment on nvidia …€¦ · natural language processing...

DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPUXu Tianhao, Deep Learning Solution Architect, NVIDIA

AGENDA

• TensorRT Hyperscale Inference Platform overview

• TensorRT Inference Server

• Overview and Deep Dive: Key features

• Deployment possibilities: Generic deployment ecosystem

• Hands-on

• NVIDA BERT Overview

• FasterTransformer and TRT optimized BERT inference

• Deploy BERT TensorFlow model with custom op

• Deploy BERT TensorRT model with plugins

• Benchmarking

• Open Discussion

DEEP INTO TRTIS

WORLD’S MOST ADVANCEDSCALE-OUT GPU

INTEGRATED INTO TENSORFLOW & ONNX SUPPORT

TENSORRT HYPERSCALE INFERENCE PLATFORM

TENSORRT INFERENCE SERVER

Universal Inference Acceleration

320 Turing Tensor cores

2,560 CUDA cores

65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS

16GB | 320GB/s

ANNOUNCING TESLA T4WORLD’S MOST ADVANCED INFERENCE GPU

NEW TURING TENSOR CORE

MULTI-PRECISION FOR AI INFERENCE

65 TFLOPS FP16 | 130 TeraOPS INT8 | 260 TeraOPS INT4

Up To 36X Faster Than CPUs | Accelerates All AI Workloads

WORLD’S MOST PERFORMANT INFERENCE PLATFORM

Speedup: 36x fasterGNMT

Speedup: 27x fasterResNet-50 (7ms latency limit)

Speedup: 21X fasterDeepSpeech 2

Natural Language Processing Inference

CPU Server Tesla P4 Tesla T4

Speech Inference

Video Inference

Peak Performance

Float INT8 Float INT8 INT4

NVIDIA TENSORRT OVERVIEWFrom Every Framework, Optimized For Each Target Platform

TESLA V100

DRIVE PX 2

NVIDIA T4

JETSON TX2

NVIDIA DLA

TensorRT

NVIDIA TENSORRT OVERVIEWFrom Every Framework, Optimized For Each Target Platform

Quantized INT8 (Precision Optimization)Significantly improves inference performance of models trained in FP32 full precision by quantizing them to INT8, while minimizing accuracy loss

Layer Fusion (Graph Optimization)Improves GPU utilization and optimizes memory storage and bandwidth by combining successive nodes into a single node, for single kernel execution

Kernel Auto-Tuning (Auto-tuning) Optimizes execution time by choosing the best data layer and best parallel algorithms for the target Jetson, Tesla or DrivePX GPU platform

Dynamic Tensor Memory (Memory optimization)Reduces memory footprint and improves memory re-use by allocating memory for each tensor only for the duration it’s usage

Multi Stream ExecutionScales to multiple input streams, by processing them parallel using the same model and weights

Un-Optimized Network

concat

max pool

next input

3x3 conv.

1x1 conv.

concat

1x1 conv.

bias5x5 conv.

Non-Optimized Network• Vertical Fusion

• Horizonal Fusion

• Layer Elimination

Network Layers

before

Layers

VGG19 43 27

Inception

309 113

ResNet-152 670 159

GRAPH OPTIMIZATION

concat

max pool

next input

3x3 conv.

1x1 conv.

concat

1x1 conv.

bias5x5 conv.

Un-Optimized Network

concat

max pool

next input

3x3 conv.

1x1 conv.

concat

1x1 conv.

bias5x5 conv.

max pool

next input

3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR

TensorRT Optimized Network• Vertical Fusion

• Horizonal Fusion

• Layer Elimination

Network Layers

before

Layers

VGG19 43 27

Inception

309 113

ResNet-152 670 159

GRAPH OPTIMIZATION

140 305

6.67 ms 6.83 ms

CPU-Only V100 +TensorFlow

V100 + TensorRT

ncy (m

s)Images/

Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2-

16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16),

batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587

Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake

with AVX512.

280 ms

153 ms

117 ms

CPU-Only + Torch V100 + Torch V100 + TensorRT

ncy (m

s)Images/

Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-

PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-

16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 v4@2.60GHz

3.5GHz Turbo (Broadwell) HT On

TENSORRT PERFORMANCE

developer.nvidia.com/tensorrt

40x Faster CNNs on V100 vs. CPU-Only

Under 7ms Latency (ResNet50)

140x Faster Language Translation RNNs on

V100 vs. CPU-Only Inference (OpenNMT)

AGENDA

• Hands-on

• Benchmarking

• Open Discussion

DEEP INTO TRTIS

INEFFICIENCY LIMITS INNOVATIONDifficulties with Deploying Data Center Inference

Single Framework OnlySingle Model Only Custom Development

Some systems are overused while

others are underutilizedSolutions can only support

models from one framework

Developers need to reinvent the

plumbing for every application

ASR NLPRec-

ommender

NVIDIA TENSORRT INFERENCE SERVERArchitected for Maximum Datacenter Utilization

Maximize real-time inference

performance of GPUs

Quickly deploy and manage multiple

models per GPU per node

Easily scale to heterogeneous GPUs

and multi GPU nodes

Integrates with orchestration

systems and auto scalers via latency

and health metrics

Now open source for thorough

customization and integration

TensorR

NVIDIA

r Tesla P4

Tesla P4

FEATURESUtilization Usability Performance Customization

Dynamic BatchingInference requests can be

batched up by the

inference server to 1) the

model-allowed maximum or

2) the user-defined latency

Concurrent Model

ExecutionMultiple models (or multiple

instances of same model) may

execute on GPU simultaneously

CPU Model Inference

ExecutionFramework native models can

execute inference requests on the

Multiple Model Format

SupportPyTorch JIT (.pt)

TensorFlow GraphDef/SavedModel

TensorFlow+TensorRT GraphDef

ONNX graph (ONNX Runtime)

TensorRT Plans

Caffe2 NetDef (ONNX import path)

MetricsUtilization, count, memory, and

latency

Model Control APIExplicitly load/unload models into

and out of TRTIS based on changes

made in the model-control

configuration

System/CUDA Shared

MemoryInputs/outputs needed to be

passed to/from TRTIS are stored

in system/CUDA shared memory.

Reduces HTTP/gRPC overhead

Library VersionLink against libtrtserver.so so that

you can include all the inference

server functionality directly in

your application

Custom BackendCustom backend allows the user

more flexibility by providing their

own implementation of an

execution engine through the use

of a shared library

Model EnsemblePipeline of one or more models

and the connection of input and

output tensors between those

models (can be used with custom

backend)

Streaming APIBuilt-in support for audio

streaming input e.g. for speech

recognition

INFERENCE SERVER ARCHITECTURE

Models supported● TensorFlow GraphDef/SavedModel● TensorFlow and TensorRT GraphDef● TensorRT Plans● Caffe2 NetDef (ONNX import)● ONNX graph● PyTorch JIT (.pb)

Multi-GPU support

Concurrent model execution

Server HTTP REST API/gRPC

Python/C++ client libraries

Python/C++ Client Library

Available with Monthly Updates

COMMON WAYS TO FULLY UTILIZE GPU

1. Increase computation intensity – Increase batch size

2. Execute multi- tasks concurrently with multi- streams or MPS (MULTI-PROCESS SERVICE)

DYNAMIC BATCHING SCHEDULER

Framework Backend

Dynamic

Batcher

Runtime

Context

Batch-1 RequestBatch-4 Request

TensorRT Inference Server

DYNAMIC BATCHING SCHEDULER

ModelY Backend

Dynamic

Batcher

Runtime

Context

Preferred batch size and wait time are configuration options.

Assume 4 gives best utilization in this example.

Grouping requests into a single “batch” increases overall GPU throughput

DYNAMIC BATCHING

TensorRT Inference Servergroups inference requests based on customer defined metrics for optimal performance

Customer defines 1) batch size (required) and 2) latency requirements (optional)

Example: No dynamic batching (batch size 1 & 8) vs dynamic batching

2.5X Faster Inferences/Second at a 50ms End-to-End Server Latency Threshold

MPS VS CUDA STREAMS IN TRTIS

TRTIS CUDA Streams are 1-4% slower than MPS but provide

some usability advantages and other methods to maximize

performance over MPS limitations

• Multiple processes on a single GPU (no

interconnect/intercommunication between processes)

• Shares GPU memory between multiple processes, if one

process over subscribes the memory, the others are

starved - harder to coordinate memory usage

• Experimental in nv-docker

CUDA Streams

• One process on a single GPU with multiple

streams/execution contexts

• More holistic view of memory - easier to coordinate

memory usage

• Maximize GPU utilization by using batching vs having

several processes executing at batch size 1

Concurrent Model Execution

max_batch_size: 8instance_group [

{count: 4kind: KIND_GPUgpus: [0, 1]

},{count: 4kind: KIND_CPUgpus: [3, 4]

Create one execution context for each instance of a group of a certain model

CONCURRENT MODEL EXECUTION - RESNET 50

Inference

Requests

ResNet

50Request

V100 16GB GPU

RN50 Instance 1 CUDA Stream

RN50 Instance 2CUDA Stream

4x Better Performance and Improved GPU Utilization Through Multiple Model Concurrency

concurrent

requests

Common Scenario

One API using multiple copies of the

same model on a GPU

Example: 12 instances of TRT FP16

ResNet50 (each model takes 1.33GB GPU

memory) are loaded onto the GPU and can

run concurrently on a 16GB V100 GPU.

14 concurrent inference requests happen:

each model instance fulfills one request

simultaneously and 2 are queued in the

per-model scheduler queues in TensorRT

Inference Server to execute after the 12

requests finish.

With this configuration, 2832 inferences

per second at 33.94 ms with batch size 8

on each inference server instance is

achieved.

CONCURRENT MODEL EXECUTION - RESNET 504x Better Performance and Improved GPU Utilization Through Multiple Model Concurrency

Common Scenario

same model on a GPU

requests finish.

achieved.

CONCURRENT MODEL EXECUTION - RESNET 504x Better Performance and Improved GPU Utilization Through Multiple Model Concurrency

Common Scenario

same model on a GPU

requests finish.

achieved.

Concurrent Model Execution

max_batch_size: 8instance_group [

{count: 4kind: KIND_GPUgpus: [0, 1]

},{count: 4kind: KIND_CPUgpus: [3, 4]

Create one execution context for each instance of a group of a certain model

Scheduling threadsMultiple streams

Priority: MAX, DEFAUTL, MIN

Model Control and Model Configuration

Perform HTTP POST to /api/modelcontrol/<load|unload>/<model

name> loads or unloads a model from the inference server

Model Control Modes

1) NONE

• Server attempts to load all models at runtime.

• Changes to the model repo will be ignored

• Model control API requests will have no affect

2) POLL

• Server attempts to load all models at runtime

• Changes to model repo will be detected and server will

attempt to load and unload models based on changes

• Model control requests will have no affect

3) EXPLICIT

• Server does not load any models in the model repo at

runtime

• All model loading and unloading must be initiated using the

Model Control API

Local model repository

Model Control and Model Configuration

{name: "input0"data_type: TYPE_FP32dims: [ 16 ]reshape: { shape: [ ] } }

]output [

{name: "output0"data_type: TYPE_FP32dims: [ 16 ]

}]version_policy: { all { }}

instance_group [{count: 2kind: KIND_GPUgpus: [0, 1]

}]dynamic_batching {

preferred_batch_size: [ 4, 8 ]max_queue_delay_microseconds: 100

}optimization {

graph {level: 1

},cuda {graphs: 1

},priority: PRIORITY_MAX

• Dims, -1 for dynamic• Reshape for model accepted dims

• Support multiple backends(platform)

• Version control: serve selected versions

• Instances for concurrent exection• Select multiple gpus• Select CPU or GPU for execution• There can be multiple groups

• Preferred batch size is configurable• Set max queue delay for SLA control

• Multiple optimizations• Set graph level to 1 to trigger XLA of TF• Set cuda graphs to 1 to using CUDA graph for

small batch sizes inference• Set priority to max to set scheduler thread

priority and cuda stream priority (only for TRT now)

• ExecutionAccelerators, enable onnx-tensorrtor tensorflow-tensorrt to automatically benefit from tensorrt integration

MODEL ENSEMBLING• Pipeline of one or more models

and the connection of input and

output tensors between those

models

• Use for model stitching or data

flow of multiple models such as

data preprocessing → inference →data post-processing

• Collects the output tensors in

each step, provides them as input

tensors for other steps according

to the specification

• Ensemble models will inherit the

characteristics of the models

involved, so the meta-data in the

request header must comply with

the models within the ensemble

ensemble_scheduling {step [

{model_name: "image_preprocess_model"model_version: -1input_map {

key: "RAW_IMAGE"value: "IMAGE"

}output_map {

key: "PREPROCESSED_OUTPUT"value: "preprocessed_image"

model_name: "classification_model"model_version: -1input_map {

key: "FORMATTED_IMAGE"value: "preprocessed_image"

}output_map {

key: "CLASSIFICATION_OUTPUT"value: "CLASSIFICATION"

model_name: "segmentation_model"model_version: -1input_map {

key: "FORMATTED_IMAGE"value: "preprocessed_image"

}output_map {

key: "SEGMENTATION_OUTPUT"value: "SEGMENTATION"

CustomBackend

CUSTOM BACKENDIntegrate custom, non-framework code into TRTIS

Not uncommon for model to have some non-ML-model parts

BERT: tokenizer, feature extractor

Custom backend allows these parts to be integrated into TRTIS

Implement code as shared library using backwards compatible C API

Benefit from the full TRTIS feature set (same as framework backends)

• Dynamic batcher, sequence batcher, concurrent execution, multi-GPU, etc.

Provides deployment flexibility; TRTIS provides standard, consistent interface protocol

between models and custom components

STREAMING INFERENCE REQUESTS

DeepSpeech2

Wave2Letter

Per Model Request Queues

Corr 1Corr 1Corr 1Corr 1 Corr 2 Corr 2 Corr 3 Corr 3

New Streaming API

Based on the correlation ID, the

audio requests are sent to the

appropriate batch slot in the

sequence batcher*

*Correct order of requests is assumed at entry into the endpointNote: Corr = Correlation ID

Inference Request

Corr 1Corr 1Corr 1Corr 1

Corr 2Corr 2

Corr 3Corr 3

DeepSpeech2 Sequence Batcher

Wav2Letter Sequence Batcher

Framework Inference Backend

Streaming APIFSM maintained in StreamInferContext

Request

Finished

Non-Streaming

Request

Finished

Response

Initialized

Unexpected or finish

Start to write all remaining data back

Will call CompleteExecution() to

write result back.

Streaming, it’s bidirectional

TRTIS LIBRARY VERSIONTightly couple TRTIS functionality into control application via shared library

Smaller binary to Plug TRTIS library into existing application

Removes existing REST and gRPC endpoints

Still leverage GPU optimizations like dynamic batching and model concurrency

Very low communication overhead (same system and CUDA memory address space)

Backward compatible C interface

AVAILABLE METRICS

Category Name Use Case Granularity Frequency

GPU Utilization

Power usage Proxy for load on the GPU Per GPU Per second

Power limit Maximum GPU power limit Per GPU Per second

GPU utilizationGPU utilization rate

[0.0 - 1.0)

Per GPU Per second

GPU Memory

GPU Total Memory Total GPU memory, in bytes Per GPU Per second

GPU Used Memory Used GPU memory, in bytes Per GPU Per second

CountGPU & CPU

Request count Number of inference requests Per model Per request

Execution count

Number of model inference executions

Request count / execution count = avg dynamic request

batching

Per model Per request

Inference countNumber of inferences performed (one request counts as

“batch size” inferences)

LatencyGPU & CPU

Latency: request time End-to-end inference request handling time Per model Per request

Latency: compute timeTime a request spends executing the inference model (in

the appropriate framework)

Latency: queue timeTime a request spends waiting in the queue before being

executed

PERF_CLIENT TOOL

• Measures throughput (inf/s) and latency under varying client loads

• Perf_client Modes

1. Specify how many concurrent outstanding requests and it will find a stable ltency and throughput for that level

2. Generate throughout vs latency curve by increasing the request concurrency until a specific latency or concurrency limit is reached

• Generates a file containing CSV output of the results

• Easy steps to help visualize the throughput vs latency tradeoffs

Model:

Legend

(CPU/ GPU)

Load balancer

Containerized

inference

service

(CPU/ GPU)

processing

Cluster

Metrics serviceAuto scaler

Model repository

Persistent volume

Your training/ pruning/

validation flow: dump

Model repository(Network Storage Location)

GENERIC INFERENCE SERVER

DEPLOYMENT ARCHITECTURE

TensorRT, TensorFlow, C2/ONNX

Already existing New from NVIDIA

Multiple

workloads

37For a more detailed explanation and step-by-step guidance for this collaboration, refer to this GitHub repo.

TENSORRT INFERENCE SERVER COLLABORATION WITH KUBEFLOW

What is Kubeflow?

• Open-source project to make ML workflows on Kubernetes simple, portable, and

scalable

• Customizable scripts and configuration files to deploy containers on their chosen

environment

Problems it solves

• Easily set up an ML stack/pipeline that can fit into the majority of enterprise

datacenter and multi-cloud environments

How it helps TensorRT Inference Server

• TensorRT Inference Server is deployed as a component inside of a production

workflow to

• Optimize GPU performance

• Enable auto-scaling, traffic load balancing, and redundancy/failover via

metrics

TRTIS Helm Chart

Helm: Most used “package manager” for Kubernetes

We built a simple chart (“package”) for the TensorRTInference Server.

You can use it to easily deploy an instance of the server.It can also be easily configured to point to a different image, model store, …

https://github.com/NVIDIA/tensorrt-inference-server/tree/b6b45ead074d57e3d18703b7c0273672c5e92893/deploy/single_server

Simple helm chart for installing a single instance of the NVIDIA TensorRT Inference Server

AGENDA

• Hands-on

• Benchmarking

• Open Discussion

DEEP INTO TRTIS

WHAT IS BERT?

BERT: Bidirectional Encoder Representations from Transformers

Widely used in Multiple NLP Tasks, due to high accuracy.

WHAT IS BERTTransformer Encoder Part

TENSORFLOW INFERENCE

Previous TF inference is not efficient:

1. TF Ops are very small, kernel launch causes much time, e.g., GELU/LayerNorm contains several small Ops;2. Multi head self attention lacks efficient GPU implementation;3. TF Scheduling is not good.

NVIDIA’S INFERENCE

Optimization ideas:

1. Optimize the calculations with CUDA, integrate the implementation to TF with custom op

2. Optimize the inference with TensorRT

3. Algorithm Level Acceleration

NVIDIA’S INFERENCECUDA Optimization - Performance

<batch_size,

layers, seq_len,

head_num,

size_per_head>

P4 FP32 (in ms) T4 FP32 (in ms) T4 FP16 (in ms)

(1, 12, 32, 12, 64) 3.43 2.74 1.56

(1, 12, 64, 12, 64) 4.04 3.64 1.77

(1, 12, 128, 12, 64) 6.22 5.93 2.23

Performance over different seq_len on P4, T4

NVIDIA’S INFERENCECUDA Optimization - Resources

Where you can find it:

FasterTransformer project（open-sourced）：https://github.com/NVIDIA/DeepLearningExamples/tree/master/FasterTransformer

NVIDIA’S INFERENCETRT Optimization

Before

NVIDIA’S INFERENCETRT Optimization - Resources

Where you can find it:

Bert TRT demo（open-sourced）：https://github.com/NVIDIA/TensorRT/tree/release/6.0/demo/BERT(To be re-located to DeepLearningExamples)Blog:https://devblogs.nvidia.com/nlu-with-tensorrt-bert/

HANDS-ON

1. Follow FasterTransformer README and generate custom op lib: libtf_fastertransformer.so

2. Prepare gemm_config.in for best algo of gemm by running the built binaries.

3. Modify sample/tensorflow_bert/profile_bert_inference.py to create the squad model, and using saved_model api to export the model.

4. Arrange the exported model in a tree structure ./bert_ft/1/model.savedmodel/xxxexported_files

Deploy BERT TensorFlow Model with custom op

HANDS-ON

1. Follow TensorRT/demo/BERT README and generate plugin lib: libbert_plugins.so and libcommon.so

2. Follow the README and ran sample_bert with additional arg ‘—saveEngine=model.plan’

3. Arrange the model dir in a tree structure: bert_trt/1/model.plan

Deploy BERT TensorRT Model with plugins

HANDS-ON

1. Prepare model_repository

Prepare model_repository, run trtserver and perf_client

|-- 1| `-- model.plan`-- config.pbtxt

{name: "input_ids"data_type: TYPE_INT32dims: [ 1, 128 ]

}]output [

{name: "prediction"data_type: TYPE_FP32dims: [ 2, 1, 128 ]

}]instance_group {

kind: KIND_GPUcount: 1

}version_policy: {specific { versions: [1] }}

{name: "segment_ids"data_type: TYPE_INT32dims: 128

}]output [

{name: "cls_squad_logits"data_type: TYPE_FP32dims: [128,2,1,1]

instance_group {kind: KIND_GPUcount: 1

version_policy: {specific { versions: [32] }}

config.pbtxt config.pbtxt

Model directory

HANDS-ON

1. Launch trtserver over http/grpc

1. NV_GPU=x nvidia-docker run --rm -it -name=trtis_bert -p8000:8000 -p8001:8001 -v/path/to/model_repository:/models nvcr.io/nvidia/tensorrtserver:19.11-py3

2. Export LD_PRELOAD=/path/to/{libcommon.so; libbert_plugins.so; libtf_fastertransformer.so}

3. trtserver --model-store=/models --log-verbose=1 --strict-model-config=True

HANDS-ON

1. Run perf_client to infer over grpc

1. Launch docker: docker run –-net=host –-rm –it nvcr.io/nvidia/tensorrtserver-clients

2. ./install/bin/perf_client -m bert_trt -d –c8 –l200 –p2000 –b1 -i grpc -u localhost:8001 -t1 --max-threads=8

Request concurrency: 1Client:

Request count: 59Throughput: 944 infer/secAvg latency: 34422 usec (standard deviation 288 usec)p50 latency: 34457 usecp90 latency: 34667 usecp95 latency: 34877 usecp99 latency: 35130 usecAvg gRPC time: 34452 usec ((un)marshal request/response 27 usec + response wait 34425 usec)

Server: Request count: 70Avg request latency: 33473 usec (overhead 26 usec + queue 58 usec + compute 33389 usec)

Result reported by perf_client

HANDS-ONBenchmarking - FasterTransformer

SQuAD task inference (FasterTransformer):batchsize=1, tensorflow backend, max QPSTesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

Processor Precision QPS AvgL(ms) TP99(ms) Concurrent

CPU FP32 7.7 131 203 1

CPU FP32 10.4 289 339 4

GPU FP32 104.5 9.5 11.8 1

GPU FP32 137 21.9 23.7 4

GPU FP16 267.5 3.7 3.9 1

GPU FP16 461.5 8.7 10.3 4

CPU -> multi-thread CPU -> GPU FP32 -> concurrent GPU FP32 -> GPU FP16 -> concurrent GPU FP167.7 104.5 461.5131 9.5 8.7

Virtual GPU feature in Tensorflowto enable multi-stream:

--tf-add-vgpu=“0;4;3000”

HANDS-ONBenchmarking - FasterTransformer

SQuAD task inference (FasterTransformer):batchsize=32, tensorflow backend, max QPSTesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

CPU FP32 0.4 2491 2810 1

GPU FP32 5.5 182 184 1

GPU FP32 5 186 186 4

GPU FP16 21.8 46 48.8 1

GPU FP16 21.6 46.1 48.1 4

CPU -> GPU FP32 -> GPU FP160.4 5.5 21.82491 182 46

HANDS-ONBenchmarking - TensorRT

SQuAD task inference (TensorRT):batchsize=1, TensorRT backend, max QPSTesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

GPU FP32 163 12.3 12.4 1

GPU FP32 156 12.8 14 4

GPU FP16 438.5 4.6 4.6 1

GPU FP16 473.5 4.2 5.1 4

GPU FP32 -> GPU FP16163 473.512.3 4.2

HANDS-ONBenchmarking - TensorRT

SQuAD task inference (TensorRT):batchsize=32, TensorRT backend, max QPSTesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

GPU FP32 6.5 157 159 1

GPU FP32 6.5 316 356 4

GPU FP16 29.5 34.2 34.8 1

GPU FP16 30.5 134 151 4

GPU FP32 -> GPU FP166.5 30.5157 134

Learn more here:https://nvidia.com/data-center-inference

https://docs.nvidia.com/deeplearning/sdk/inference-release-notes/index.html

https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-

guide/docs/quickstart.html

Get the ready-to-deploy container with monthly updates

from the NGC container registry:https://ngc.nvidia.com/catalog/containers/nvidia%2Ftensorrtserver

Open source GitHub repository:https://github.com/NVIDIA/tensorrt-inference-server

LEARN MORE AND DOWNLOAD TO USE

Engineering developer blog (benchmarks, model concurrency, etc.):https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/

Kubeflow guest blog:https://www.kubeflow.org/blog/nvidia_tensorrt/

Open source announcement:https://news.developer.nvidia.com/nvidia-tensorrt-inference-server-now-open-source

More:Data center inference page & TensorRT page

DevTalk Forum for Support

TensorRT Hyperscale Inference Platform infographic

NVIDIA AI Inference Platform technical overview

NVIDIA TensorRT Inference Server and Kubeflow

NVIDIA TensorRT Inference Server Now Available

ADDITIONAL RESOURCES

60NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

NVIDIA (DLI)

• GPU

• DLI

www.nvidia.cn/dli

DLI 深度学习全天培训 @ GTC CHINA 2019

全球开发者培训证书 | 全配置GPU实验环境 | 5大新课首发 | 一年一度6折特惠

查看课程

和报名

培训

咨询

NEW!NEW!

用多GPU训练神经网络应对大规模神经网络训练的算法和工程挑战

CUDA Python

轻松实现在GPU上加速运行Python应用

计算机视觉零基础入门，深度学习方法与实践

自然语言处理NLP 必备理论与应用技能

多数据类型机器视觉和NLP技术的融合进阶应用

自动驾驶汽车的感知系统（2019新版）学用 NVIDIA DRIVE AGX 构建自动驾驶汽车

工业检测应用深度学习打造自动化工业检测模型

使用 Jetson Nano 开发AI应用机器人基础入门，获得您的Jetson Nano套件

deep into trtis: bert practical deployment on nvidia …€¦ · natural language processing...

Documents

tesla composed like mozart - wayne state files/tesla... ·...

nikola tesla · web view2 nikola tesla nikola tesla10 25...

tesla- karlovac-croatia/ tesla hotspot

ties behnke, desy hamburg 11-12-2001 " tesla: das projekt "...

tesla gpu computing - nvidia 1u server tesla 1u system 14x...

tesla phänomene - leseprobe - ciando.com · tesla...

tesla-european years_ - tesla society switzerland

performance guide tesla v100 · single gpu server vs...

cpu cpu 81,349 cpu cpu x86 cpu cpu cpu cpu cpu ) , cpu

nvidia cuda И openacc -...

ebook -tesla- solid state tesla coil

televizor tesla tesla ‚minor 1956 tesla sonet duo tesla...

tesla kam, sah und siegte - voller erfolg am wiener tesla...

ufo nikola tesla - tesla coil plans

get moving with cmc fpga/gpu cluster · 2019. 12. 12. ·...

tesla master deck - hpc saudi 2019 | saudi hpc 2019 ·...

tesla master deck may 2014 - computing...9 cpu optimized for...

s8822 optimizing nmt with tensorrt - nvidia · inference...

tesla: fastest processor adoption in hpc history · tesla...

performance guide tesla v100 - one stop systems · 2018. 7....