deep into trtis: bert practical deployment on nvidia …€¦ · natural language processing...

DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPUXu Tianhao, Deep Learning Solution Architect, NVIDIA

2

AGENDA

• TensorRT Hyperscale Inference Platform overview

• TensorRT Inference Server

• Overview and Deep Dive: Key features

• Deployment possibilities: Generic deployment ecosystem

• Hands-on

• NVIDA BERT Overview

• FasterTransformer and TRT optimized BERT inference

• Deploy BERT TensorFlow model with custom op

• Deploy BERT TensorRT model with plugins

• Benchmarking

• Open Discussion

DEEP INTO TRTIS

3

WORLD’S MOST ADVANCEDSCALE-OUT GPU

INTEGRATED INTO TENSORFLOW & ONNX SUPPORT

TENSORRT HYPERSCALE INFERENCE PLATFORM

TENSORRT INFERENCE SERVER

4

Universal Inference Acceleration

320 Turing Tensor cores

2,560 CUDA cores

65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS

16GB | 320GB/s

ANNOUNCING TESLA T4WORLD’S MOST ADVANCED INFERENCE GPU

5

NEW TURING TENSOR CORE

MULTI-PRECISION FOR AI INFERENCE

65 TFLOPS FP16 | 130 TeraOPS INT8 | 260 TeraOPS INT4

6

Up To 36X Faster Than CPUs | Accelerates All AI Workloads

WORLD’S MOST PERFORMANT INFERENCE PLATFORM

Speedup: 36x fasterGNMT

Speedup: 27x fasterResNet-50 (7ms latency limit)

Speedup: 21X fasterDeepSpeech 2

1.0

10X

36X

-0

5

10

15

20

25

30

35

40

Spee

du

p v

. CP

U S

erve

r

Natural Language Processing Inference

CPU Server Tesla P4 Tesla T4

1.0

4X

21X

-0

5

10

15

20

25

Spee

du

p v

. CP

U S

erve

r

Speech Inference


1.0

10X

27X

-0

5

10

15

20

25

30

Spee

du

p v

. CP

U S

erve

r

Video Inference


5.522

65

130

260

0

50

100

150

200

250

300

TFLO

PS

/ TO

PS

Peak Performance

T4P4

Float INT8 Float INT8 INT4

7

NVIDIA TENSORRT OVERVIEWFrom Every Framework, Optimized For Each Target Platform

TESLA V100

DRIVE PX 2

NVIDIA T4

JETSON TX2

NVIDIA DLA

TensorRT

8

NVIDIA TENSORRT OVERVIEWFrom Every Framework, Optimized For Each Target Platform

Quantized INT8 (Precision Optimization)Significantly improves inference performance of models trained in FP32 full precision by quantizing them to INT8, while minimizing accuracy loss

Layer Fusion (Graph Optimization)Improves GPU utilization and optimizes memory storage and bandwidth by combining successive nodes into a single node, for single kernel execution

Kernel Auto-Tuning (Auto-tuning) Optimizes execution time by choosing the best data layer and best parallel algorithms for the target Jetson, Tesla or DrivePX GPU platform

Dynamic Tensor Memory (Memory optimization)Reduces memory footprint and improves memory re-use by allocating memory for each tensor only for the duration it’s usage

Multi Stream ExecutionScales to multiple input streams, by processing them parallel using the same model and weights

9

Un-Optimized Network

concat

max pool

input

next input

3x3 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

concat

1x1 conv.

relu

bias5x5 conv.

relu

bias

Non-Optimized Network• Vertical Fusion

• Horizonal Fusion

• Layer Elimination

Network Layers

before

Layers

after

VGG19 43 27

Inception

V3

309 113

ResNet-152 670 159

GRAPH OPTIMIZATION

concat

max pool

input

next input

3x3 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

concat

1x1 conv.

relu

bias5x5 conv.

relu

bias

10

Un-Optimized Network

concat

max pool

input

next input

3x3 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

concat

1x1 conv.

relu

bias5x5 conv.

relu

bias

max pool

input

next input

3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR

TensorRT Optimized Network• Vertical Fusion

• Horizonal Fusion

• Layer Elimination

Network Layers

before

Layers

after

VGG19 43 27

Inception

V3

309 113

ResNet-152 670 159

GRAPH OPTIMIZATION

11

140 305

5700

14 ms

6.67 ms 6.83 ms

0

5

10

15

20

25

30

35

40

0

1,000

2,000

3,000

4,000

5,000

6,000

CPU-Only V100 +TensorFlow

V100 + TensorRT

Late

ncy (m

s)Images/

sec

Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2-

16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16),

batch size 2, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587

Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake

with AVX512.

425

550

280 ms

153 ms

117 ms

0

50

100

150

200

250

300

350

400

450

500

0

100

200

300

400

500

600

CPU-Only + Torch V100 + Torch V100 + TensorRT

Late

ncy (m

s)Images/

sec

Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-

PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-

16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 [email protected]

3.5GHz Turbo (Broadwell) HT On

TENSORRT PERFORMANCE

developer.nvidia.com/tensorrt

40x Faster CNNs on V100 vs. CPU-Only

Under 7ms Latency (ResNet50)

140x Faster Language Translation RNNs on

V100 vs. CPU-Only Inference (OpenNMT)

12

AGENDA





• Hands-on





• Benchmarking

• Open Discussion

DEEP INTO TRTIS

13

INEFFICIENCY LIMITS INNOVATIONDifficulties with Deploying Data Center Inference

Single Framework OnlySingle Model Only Custom Development

Some systems are overused while

others are underutilizedSolutions can only support

models from one framework

Developers need to reinvent the

plumbing for every application

ASR NLPRec-

ommender

!

14

NVIDIA TENSORRT INFERENCE SERVERArchitected for Maximum Datacenter Utilization

Maximize real-time inference

performance of GPUs

Quickly deploy and manage multiple

models per GPU per node

Easily scale to heterogeneous GPUs

and multi GPU nodes

Integrates with orchestration

systems and auto scalers via latency

and health metrics

Now open source for thorough

customization and integration

TensorR

T

Infe

rence

Serv

er

NVIDIA

T4

NVIDIA

T4

Te

nso

rRT

Infe

ren

ce

Se

rve

r

Tesla

V100

Tesla

V100

Te

nso

rRT

Infe

ren

ce

Se

rve

r Tesla P4

Tesla P4

15

FEATURESUtilization Usability Performance Customization

Dynamic BatchingInference requests can be

batched up by the

inference server to 1) the

model-allowed maximum or

2) the user-defined latency

SLA

Concurrent Model

ExecutionMultiple models (or multiple

instances of same model) may

execute on GPU simultaneously

CPU Model Inference

ExecutionFramework native models can

execute inference requests on the

CPU

Multiple Model Format

SupportPyTorch JIT (.pt)

TensorFlow GraphDef/SavedModel

TensorFlow+TensorRT GraphDef

ONNX graph (ONNX Runtime)

TensorRT Plans

Caffe2 NetDef (ONNX import path)

MetricsUtilization, count, memory, and

latency

Model Control APIExplicitly load/unload models into

and out of TRTIS based on changes

made in the model-control

configuration

System/CUDA Shared

MemoryInputs/outputs needed to be

passed to/from TRTIS are stored

in system/CUDA shared memory.

Reduces HTTP/gRPC overhead

Library VersionLink against libtrtserver.so so that

you can include all the inference

server functionality directly in

your application

Custom BackendCustom backend allows the user

more flexibility by providing their

own implementation of an

execution engine through the use

of a shared library

Model EnsemblePipeline of one or more models

and the connection of input and

output tensors between those

models (can be used with custom

backend)

Streaming APIBuilt-in support for audio

streaming input e.g. for speech

recognition

16

INFERENCE SERVER ARCHITECTURE

Models supported● TensorFlow GraphDef/SavedModel● TensorFlow and TensorRT GraphDef● TensorRT Plans● Caffe2 NetDef (ONNX import)● ONNX graph● PyTorch JIT (.pb)

Multi-GPU support

Concurrent model execution

Server HTTP REST API/gRPC

Python/C++ client libraries

Python/C++ Client Library

Available with Monthly Updates

17

COMMON WAYS TO FULLY UTILIZE GPU

1. Increase computation intensity – Increase batch size

2. Execute multi- tasks concurrently with multi- streams or MPS (MULTI-PROCESS SERVICE)

18

DYNAMIC BATCHING SCHEDULER

Framework Backend

Dynamic

Batcher

Runtime

Context

Context

Batch-1 RequestBatch-4 Request

TensorRT Inference Server

19

DYNAMIC BATCHING SCHEDULER

ModelY Backend

Dynamic

Batcher

Runtime

Context

Context

Preferred batch size and wait time are configuration options.

Assume 4 gives best utilization in this example.

Grouping requests into a single “batch” increases overall GPU throughput


20

DYNAMIC BATCHING

TensorRT Inference Servergroups inference requests based on customer defined metrics for optimal performance

Customer defines 1) batch size (required) and 2) latency requirements (optional)

Example: No dynamic batching (batch size 1 & 8) vs dynamic batching

2.5X Faster Inferences/Second at a 50ms End-to-End Server Latency Threshold

21

MPS VS CUDA STREAMS IN TRTIS

21

TRTIS CUDA Streams are 1-4% slower than MPS but provide

some usability advantages and other methods to maximize

performance over MPS limitations

MPS

• Multiple processes on a single GPU (no

interconnect/intercommunication between processes)

• Shares GPU memory between multiple processes, if one

process over subscribes the memory, the others are

starved - harder to coordinate memory usage

• Experimental in nv-docker

CUDA Streams

• One process on a single GPU with multiple

streams/execution contexts

• More holistic view of memory - easier to coordinate

memory usage

• Maximize GPU utilization by using batching vs having

several processes executing at batch size 1

22

Concurrent Model Execution

max_batch_size: 8instance_group [

{count: 4kind: KIND_GPUgpus: [0, 1]

},{count: 4kind: KIND_CPUgpus: [3, 4]

}]

Create one execution context for each instance of a group of a certain model

23

CONCURRENT MODEL EXECUTION - RESNET 50

Inference

Requests


ResNet

50Request

Queue

V100 16GB GPU

Time

RN50 Instance 1 CUDA Stream

RN50 Instance 2CUDA Stream











4x Better Performance and Improved GPU Utilization Through Multiple Model Concurrency

14

concurrent

requests

Common Scenario

One API using multiple copies of the

same model on a GPU

Example: 12 instances of TRT FP16

ResNet50 (each model takes 1.33GB GPU

memory) are loaded onto the GPU and can

run concurrently on a 16GB V100 GPU.

14 concurrent inference requests happen:

each model instance fulfills one request

simultaneously and 2 are queued in the

per-model scheduler queues in TensorRT

Inference Server to execute after the 12

requests finish.

With this configuration, 2832 inferences

per second at 33.94 ms with batch size 8

on each inference server instance is

achieved.

24

CONCURRENT MODEL EXECUTION - RESNET 504x Better Performance and Improved GPU Utilization Through Multiple Model Concurrency

Common Scenario


same model on a GPU










requests finish.




achieved.

25

CONCURRENT MODEL EXECUTION - RESNET 504x Better Performance and Improved GPU Utilization Through Multiple Model Concurrency

Common Scenario


same model on a GPU










requests finish.




achieved.

26

Concurrent Model Execution

max_batch_size: 8instance_group [

{count: 4kind: KIND_GPUgpus: [0, 1]

},{count: 4kind: KIND_CPUgpus: [3, 4]

}]

Create one execution context for each instance of a group of a certain model

Scheduling threadsMultiple streams

Priority: MAX, DEFAUTL, MIN

27

Model Control and Model Configuration

Perform HTTP POST to /api/modelcontrol/<load|unload>/<model

name> loads or unloads a model from the inference server

Model Control Modes

1) NONE

• Server attempts to load all models at runtime.

• Changes to the model repo will be ignored

• Model control API requests will have no affect

2) POLL

• Server attempts to load all models at runtime

• Changes to model repo will be detected and server will

attempt to load and unload models based on changes

• Model control requests will have no affect

3) EXPLICIT

• Server does not load any models in the model repo at

runtime

• All model loading and unloading must be initiated using the

Model Control API

Local model repository

28

Model Control and Model Configuration

name: "mymodel"platform: "tensorrt_plan"max_batch_size: 8input [

{name: "input0"data_type: TYPE_FP32dims: [ 16 ]reshape: { shape: [ ] } }

]output [

{name: "output0"data_type: TYPE_FP32dims: [ 16 ]

}]version_policy: { all { }}

instance_group [{count: 2kind: KIND_GPUgpus: [0, 1]

}]dynamic_batching {

preferred_batch_size: [ 4, 8 ]max_queue_delay_microseconds: 100

}optimization {

graph {level: 1

},cuda {graphs: 1

},priority: PRIORITY_MAX

}

• Dims, -1 for dynamic• Reshape for model accepted dims

• Support multiple backends(platform)

• Version control: serve selected versions

• Instances for concurrent exection• Select multiple gpus• Select CPU or GPU for execution• There can be multiple groups

• Preferred batch size is configurable• Set max queue delay for SLA control

• Multiple optimizations• Set graph level to 1 to trigger XLA of TF• Set cuda graphs to 1 to using CUDA graph for

small batch sizes inference• Set priority to max to set scheduler thread

priority and cuda stream priority (only for TRT now)

• ExecutionAccelerators, enable onnx-tensorrtor tensorflow-tensorrt to automatically benefit from tensorrt integration

29

MODEL ENSEMBLING• Pipeline of one or more models

and the connection of input and

output tensors between those

models

• Use for model stitching or data

flow of multiple models such as

data preprocessing → inference →data post-processing

• Collects the output tensors in

each step, provides them as input

tensors for other steps according

to the specification

• Ensemble models will inherit the

characteristics of the models

involved, so the meta-data in the

request header must comply with

the models within the ensemble

ensemble_scheduling {step [

{model_name: "image_preprocess_model"model_version: -1input_map {

key: "RAW_IMAGE"value: "IMAGE"

}output_map {

key: "PREPROCESSED_OUTPUT"value: "preprocessed_image"

}},{

model_name: "classification_model"model_version: -1input_map {

key: "FORMATTED_IMAGE"value: "preprocessed_image"

}output_map {

key: "CLASSIFICATION_OUTPUT"value: "CLASSIFICATION"

}},{

model_name: "segmentation_model"model_version: -1input_map {

key: "FORMATTED_IMAGE"value: "preprocessed_image"

}output_map {

key: "SEGMENTATION_OUTPUT"value: "SEGMENTATION"

}}

]}

CustomBackend

30

CUSTOM BACKENDIntegrate custom, non-framework code into TRTIS

Not uncommon for model to have some non-ML-model parts

BERT: tokenizer, feature extractor

Custom backend allows these parts to be integrated into TRTIS

Implement code as shared library using backwards compatible C API

Benefit from the full TRTIS feature set (same as framework backends)

• Dynamic batcher, sequence batcher, concurrent execution, multi-GPU, etc.

Provides deployment flexibility; TRTIS provides standard, consistent interface protocol

between models and custom components

31

STREAMING INFERENCE REQUESTS

DeepSpeech2

Wave2Letter

Per Model Request Queues

Corr 1Corr 1Corr 1Corr 1 Corr 2 Corr 2 Corr 3 Corr 3

New Streaming API

Based on the correlation ID, the

audio requests are sent to the

appropriate batch slot in the

sequence batcher*

*Correct order of requests is assumed at entry into the endpointNote: Corr = Correlation ID

Inference Request

Corr 1Corr 1Corr 1Corr 1



Corr 2Corr 2

Corr 3Corr 3

NEW

NEW

DeepSpeech2 Sequence Batcher

Wav2Letter Sequence Batcher

Framework Inference Backend

32

Streaming APIFSM maintained in StreamInferContext

Request

Done

Finished

Done

Non-Streaming

Request

Done

Finished

Done

Response

Done

Initialized

Done

Next

Unexpected or finish

Unexpected or finish

Start to write all remaining data back

Will call CompleteExecution() to

write result back.

Streaming, it’s bidirectional

Reset

Reset

33

TRTIS LIBRARY VERSIONTightly couple TRTIS functionality into control application via shared library

Smaller binary to Plug TRTIS library into existing application

Removes existing REST and gRPC endpoints

Still leverage GPU optimizations like dynamic batching and model concurrency

Very low communication overhead (same system and CUDA memory address space)

Backward compatible C interface

34

AVAILABLE METRICS

Category Name Use Case Granularity Frequency

GPU Utilization

Power usage Proxy for load on the GPU Per GPU Per second

Power limit Maximum GPU power limit Per GPU Per second

GPU utilizationGPU utilization rate

[0.0 - 1.0)

Per GPU Per second

GPU Memory

GPU Total Memory Total GPU memory, in bytes Per GPU Per second

GPU Used Memory Used GPU memory, in bytes Per GPU Per second

CountGPU & CPU

Request count Number of inference requests Per model Per request

Execution count

Number of model inference executions

Request count / execution count = avg dynamic request

batching

Per model Per request

Inference countNumber of inferences performed (one request counts as

“batch size” inferences)


LatencyGPU & CPU

Latency: request time End-to-end inference request handling time Per model Per request

Latency: compute timeTime a request spends executing the inference model (in

the appropriate framework)


Latency: queue timeTime a request spends waiting in the queue before being

executed


35

PERF_CLIENT TOOL

• Measures throughput (inf/s) and latency under varying client loads

• Perf_client Modes

1. Specify how many concurrent outstanding requests and it will find a stable ltency and throughput for that level

2. Generate throughout vs latency curve by increasing the request concurrency until a specific latency or concurrency limit is reached

• Generates a file containing CSV output of the results

• Easy steps to help visualize the throughput vs latency tradeoffs

36

Clie

nt

R

EC

Clie

nt

IMG

AP

I: A

SR

Model:

Legend

ATTIS

(CPU/ GPU)

Load balancer

Containerized

inference

service

(CPU/ GPU)

Pre

processing

Post

processing

Cluster

Metrics serviceAuto scaler

GPU

Model repository

Persistent volume

Your training/ pruning/

validation flow: dump

model

Model repository(Network Storage Location)

GENERIC INFERENCE SERVER

DEPLOYMENT ARCHITECTURE

GPU

GPU

GPU

TensorRT, TensorFlow, C2/ONNX

Already existing New from NVIDIA

Multiple

workloads

37For a more detailed explanation and step-by-step guidance for this collaboration, refer to this GitHub repo.

TENSORRT INFERENCE SERVER COLLABORATION WITH KUBEFLOW

What is Kubeflow?

• Open-source project to make ML workflows on Kubernetes simple, portable, and

scalable

• Customizable scripts and configuration files to deploy containers on their chosen

environment

Problems it solves

• Easily set up an ML stack/pipeline that can fit into the majority of enterprise

datacenter and multi-cloud environments

How it helps TensorRT Inference Server

• TensorRT Inference Server is deployed as a component inside of a production

workflow to

• Optimize GPU performance

• Enable auto-scaling, traffic load balancing, and redundancy/failover via

metrics

https://github.com/kubeflow/kubeflow/tree/master/kubeflow/nvidia-inference-server

38

TRTIS Helm Chart

Helm: Most used “package manager” for Kubernetes

We built a simple chart (“package”) for the TensorRTInference Server.

You can use it to easily deploy an instance of the server.It can also be easily configured to point to a different image, model store, …

https://github.com/NVIDIA/tensorrt-inference-server/tree/b6b45ead074d57e3d18703b7c0273672c5e92893/deploy/single_server

Simple helm chart for installing a single instance of the NVIDIA TensorRT Inference Server

https://github.com/NVIDIA/tensorrt-inference-server/tree/b6b45ead074d57e3d18703b7c0273672c5e92893/deploy/single_server

39

AGENDA





• Hands-on





• Benchmarking

• Open Discussion

DEEP INTO TRTIS

40

WHAT IS BERT?

BERT: Bidirectional Encoder Representations from Transformers

Widely used in Multiple NLP Tasks, due to high accuracy.

41

WHAT IS BERTTransformer Encoder Part

42

TENSORFLOW INFERENCE

Previous TF inference is not efficient:

1. TF Ops are very small, kernel launch causes much time, e.g., GELU/LayerNorm contains several small Ops;2. Multi head self attention lacks efficient GPU implementation;3. TF Scheduling is not good.

43

NVIDIA’S INFERENCE

Optimization ideas:

1. Optimize the calculations with CUDA, integrate the implementation to TF with custom op

2. Optimize the inference with TensorRT

3. Algorithm Level Acceleration

44

NVIDIA’S INFERENCECUDA Optimization - Performance

<batch_size,

layers, seq_len,

head_num,

size_per_head>

P4 FP32 (in ms) T4 FP32 (in ms) T4 FP16 (in ms)

(1, 12, 32, 12, 64) 3.43 2.74 1.56

(1, 12, 64, 12, 64) 4.04 3.64 1.77

(1, 12, 128, 12, 64) 6.22 5.93 2.23

Performance over different seq_len on P4, T4

45

NVIDIA’S INFERENCECUDA Optimization - Resources

Where you can find it:

FasterTransformer project（open-sourced）：https://github.com/NVIDIA/DeepLearningExamples/tree/master/FasterTransformer

https://github.com/NVIDIA/DeepLearningExamples/tree/master/FasterTransformer

46

NVIDIA’S INFERENCETRT Optimization

47

NVIDIA’S INFERENCETRT Optimization

Before

After

48

NVIDIA’S INFERENCETRT Optimization - Resources

Where you can find it:

Bert TRT demo（open-sourced）：https://github.com/NVIDIA/TensorRT/tree/release/6.0/demo/BERT(To be re-located to DeepLearningExamples)Blog:https://devblogs.nvidia.com/nlu-with-tensorrt-bert/

https://github.com/NVIDIA/TensorRT/tree/release/6.0/demo/BERT

https://devblogs.nvidia.com/nlu-with-tensorrt-bert/

49

HANDS-ON

1. Follow FasterTransformer README and generate custom op lib: libtf_fastertransformer.so

2. Prepare gemm_config.in for best algo of gemm by running the built binaries.

3. Modify sample/tensorflow_bert/profile_bert_inference.py to create the squad model, and using saved_model api to export the model.

4. Arrange the exported model in a tree structure ./bert_ft/1/model.savedmodel/xxxexported_files

Deploy BERT TensorFlow Model with custom op

50

HANDS-ON

1. Follow TensorRT/demo/BERT README and generate plugin lib: libbert_plugins.so and libcommon.so

2. Follow the README and ran sample_bert with additional arg ‘—saveEngine=model.plan’

3. Arrange the model dir in a tree structure: bert_trt/1/model.plan

Deploy BERT TensorRT Model with plugins

51

HANDS-ON

1. Prepare model_repository

Prepare model_repository, run trtserver and perf_client

model_repository/|-- bert_fastertransformer| |-- 1| | `-- model.savedmodel| | |-- saved_model.pb| | `-- variables| | |-- variables.data-00000-of-00001| | `-- variables.index| `-- config.pbtxt`-- bert_trt

|-- 1| `-- model.plan`-- config.pbtxt

name: “bert_fastertransformer"platform: "tensorflow_savedmodel"input [

{name: "input_ids"data_type: TYPE_INT32dims: [ 1, 128 ]

},{

name: "input_mask"data_type: TYPE_INT32dims: [ 1, 128 ]

},{

name: "segment_ids"data_type: TYPE_INT32dims: [ 1, 128 ]

}]output [

{name: "prediction"data_type: TYPE_FP32dims: [ 2, 1, 128 ]

}]instance_group {

kind: KIND_GPUcount: 1

}version_policy: {specific { versions: [1] }}

name: "bert_trt"platform: "tensorrt_plan"max_batch_size: 1input [

{name: "segment_ids"data_type: TYPE_INT32dims: 128

},{

name: "input_ids"data_type: TYPE_INT32dims: 128

},{

name: "input_mask"data_type: TYPE_INT32dims: 128

}]output [

{name: "cls_squad_logits"data_type: TYPE_FP32dims: [128,2,1,1]

}]

instance_group {kind: KIND_GPUcount: 1

}

version_policy: {specific { versions: [32] }}

config.pbtxt config.pbtxt

Model directory

52

HANDS-ON

1. Launch trtserver over http/grpc

1. NV_GPU=x nvidia-docker run --rm -it -name=trtis_bert -p8000:8000 -p8001:8001 -v/path/to/model_repository:/models nvcr.io/nvidia/tensorrtserver:19.11-py3

2. Export LD_PRELOAD=/path/to/{libcommon.so; libbert_plugins.so; libtf_fastertransformer.so}

3. trtserver --model-store=/models --log-verbose=1 --strict-model-config=True


53

HANDS-ON

1. Run perf_client to infer over grpc

1. Launch docker: docker run –-net=host –-rm –it nvcr.io/nvidia/tensorrtserver-clients

2. ./install/bin/perf_client -m bert_trt -d –c8 –l200 –p2000 –b1 -i grpc -u localhost:8001 -t1 --max-threads=8


Request concurrency: 1Client:

Request count: 59Throughput: 944 infer/secAvg latency: 34422 usec (standard deviation 288 usec)p50 latency: 34457 usecp90 latency: 34667 usecp95 latency: 34877 usecp99 latency: 35130 usecAvg gRPC time: 34452 usec ((un)marshal request/response 27 usec + response wait 34425 usec)

Server: Request count: 70Avg request latency: 33473 usec (overhead 26 usec + queue 58 usec + compute 33389 usec)

Result reported by perf_client

54

HANDS-ONBenchmarking - FasterTransformer

SQuAD task inference (FasterTransformer):batchsize=1, tensorflow backend, max QPSTesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

Processor Precision QPS AvgL(ms) TP99(ms) Concurrent

CPU FP32 7.7 131 203 1

CPU FP32 10.4 289 339 4

GPU FP32 104.5 9.5 11.8 1

GPU FP32 137 21.9 23.7 4

GPU FP16 267.5 3.7 3.9 1

GPU FP16 461.5 8.7 10.3 4

CPU -> multi-thread CPU -> GPU FP32 -> concurrent GPU FP32 -> GPU FP16 -> concurrent GPU FP167.7 104.5 461.5131 9.5 8.7

Virtual GPU feature in Tensorflowto enable multi-stream:

--tf-add-vgpu=“0;4;3000”

55

HANDS-ONBenchmarking - FasterTransformer

SQuAD task inference (FasterTransformer):batchsize=32, tensorflow backend, max QPSTesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz


CPU FP32 0.4 2491 2810 1

GPU FP32 5.5 182 184 1

GPU FP32 5 186 186 4

GPU FP16 21.8 46 48.8 1

GPU FP16 21.6 46.1 48.1 4

CPU -> GPU FP32 -> GPU FP160.4 5.5 21.82491 182 46

56

HANDS-ONBenchmarking - TensorRT

SQuAD task inference (TensorRT):batchsize=1, TensorRT backend, max QPSTesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz


GPU FP32 163 12.3 12.4 1

GPU FP32 156 12.8 14 4

GPU FP16 438.5 4.6 4.6 1

GPU FP16 473.5 4.2 5.1 4

GPU FP32 -> GPU FP16163 473.512.3 4.2

57

HANDS-ONBenchmarking - TensorRT

SQuAD task inference (TensorRT):batchsize=32, TensorRT backend, max QPSTesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz


GPU FP32 6.5 157 159 1

GPU FP32 6.5 316 356 4

GPU FP16 29.5 34.2 34.8 1

GPU FP16 30.5 134 151 4

GPU FP32 -> GPU FP166.5 30.5157 134

58

Learn more here:https://nvidia.com/data-center-inference

https://docs.nvidia.com/deeplearning/sdk/inference-release-notes/index.html

https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-

guide/docs/quickstart.html

Get the ready-to-deploy container with monthly updates

from the NGC container registry:https://ngc.nvidia.com/catalog/containers/nvidia%2Ftensorrtserver

Open source GitHub repository:https://github.com/NVIDIA/tensorrt-inference-server

LEARN MORE AND DOWNLOAD TO USE

https://nvidia.com/data-center-inference

https://docs.nvidia.com/deeplearning/sdk/inference-release-notes/index.html

https://ngc.nvidia.com/catalog/containers/nvidia/tensorrtserver

https://github.com/NVIDIA/tensorrt-inference-server

59

Engineering developer blog (benchmarks, model concurrency, etc.):https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/

Kubeflow guest blog:https://www.kubeflow.org/blog/nvidia_tensorrt/

Open source announcement:https://news.developer.nvidia.com/nvidia-tensorrt-inference-server-now-open-source

More:Data center inference page & TensorRT page

DevTalk Forum for Support

TensorRT Hyperscale Inference Platform infographic

NVIDIA AI Inference Platform technical overview

NVIDIA TensorRT Inference Server and Kubeflow

NVIDIA TensorRT Inference Server Now Available

ADDITIONAL RESOURCES

https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/

https://www.kubeflow.org/blog/nvidia_tensorrt/

https://news.developer.nvidia.com/nvidia-tensorrt-inference-server-now-open-source

https://www.nvidia.com/en-us/deep-learning-ai/solutions/inference-platform/hpc/

https://developer.nvidia.com/tensorrt

https://devtalk.nvidia.com/default/board/262/container-inference-server/

https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/Tesla-TensorRT-Inference-Infographic.pdf

https://www.nvidia.com/en-us/data-center/resources/inference-technical-overview/

https://news.developer.nvidia.com/nvidia-tensorrt-inference-server-and-kubeflow-make-deploying-data-center-inference-simple/

https://news.developer.nvidia.com/tensorrt-inference-server-available-now/?ncid=so-fac-59428

60NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

NVIDIA (DLI)

•

•

•

•

• GPU

• DLI

•

www.nvidia.cn/dli

http://www.nvidia.cn/dli

DLI 深度学习全天培训 @ GTC CHINA 2019

全球开发者培训证书 | 全配置GPU实验环境 | 5大新课首发 | 一年一度6折特惠

查看课程

和报名

培训

咨询

NEW!

NEW!NEW!

NEW!

NEW!

用多GPU训练神经网络应对大规模神经网络训练的算法和工程挑战

CUDA Python

轻松实现在GPU上加速运行Python应用

计算机视觉零基础入门，深度学习方法与实践

自然语言处理NLP 必备理论与应用技能

多数据类型机器视觉和NLP技术的融合进阶应用

自动驾驶汽车的感知系统（2019新版）学用 NVIDIA DRIVE AGX 构建自动驾驶汽车

工业检测应用深度学习打造自动化工业检测模型

使用 Jetson Nano 开发AI应用机器人基础入门，获得您的Jetson Nano套件

deep into trtis: bert practical deployment on nvidia …€¦ · natural language processing...

Documents