deep into trtis: bert practical deployment on nvidia …€¦ · natural language processing...

62
DEEP INTO TRTIS : BERT PRACTICAL DEPLOYMENT ON NVIDIA GPU Xu Tianhao, Deep Learning Solution Architect, NVIDIA

Upload: others

Post on 20-Jun-2020

23 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPUXu Tianhao, Deep Learning Solution Architect, NVIDIA

Page 2: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

2

AGENDA

• TensorRT Hyperscale Inference Platform overview

• TensorRT Inference Server

• Overview and Deep Dive: Key features

• Deployment possibilities: Generic deployment ecosystem

• Hands-on

• NVIDA BERT Overview

• FasterTransformer and TRT optimized BERT inference

• Deploy BERT TensorFlow model with custom op

• Deploy BERT TensorRT model with plugins

• Benchmarking

• Open Discussion

DEEP INTO TRTIS

Page 3: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

3

WORLD’S MOST ADVANCEDSCALE-OUT GPU

INTEGRATED INTO TENSORFLOW & ONNX SUPPORT

TENSORRT HYPERSCALE INFERENCE PLATFORM

TENSORRT INFERENCE SERVER

Page 4: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

4

Universal Inference Acceleration

320 Turing Tensor cores

2,560 CUDA cores

65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS

16GB | 320GB/s

ANNOUNCING TESLA T4WORLD’S MOST ADVANCED INFERENCE GPU

Page 5: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

5

NEW TURING TENSOR CORE

MULTI-PRECISION FOR AI INFERENCE

65 TFLOPS FP16 | 130 TeraOPS INT8 | 260 TeraOPS INT4

Page 6: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

6

Up To 36X Faster Than CPUs | Accelerates All AI Workloads

WORLD’S MOST PERFORMANT INFERENCE PLATFORM

Speedup: 36x fasterGNMT

Speedup: 27x fasterResNet-50 (7ms latency limit)

Speedup: 21X fasterDeepSpeech 2

1.0

10X

36X

-0

5

10

15

20

25

30

35

40

Spee

du

p v

. CP

U S

erve

r

Natural Language Processing Inference

CPU Server Tesla P4 Tesla T4

1.0

4X

21X

-0

5

10

15

20

25

Spee

du

p v

. CP

U S

erve

r

Speech Inference

CPU Server Tesla P4 Tesla T4

1.0

10X

27X

-0

5

10

15

20

25

30

Spee

du

p v

. CP

U S

erve

r

Video Inference

CPU Server Tesla P4 Tesla T4

5.522

65

130

260

0

50

100

150

200

250

300

TFLO

PS

/ TO

PS

Peak Performance

T4P4

Float INT8 Float INT8 INT4

Page 7: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

7

NVIDIA TENSORRT OVERVIEWFrom Every Framework, Optimized For Each Target Platform

TESLA V100

DRIVE PX 2

NVIDIA T4

JETSON TX2

NVIDIA DLA

TensorRT

Page 8: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

8

NVIDIA TENSORRT OVERVIEWFrom Every Framework, Optimized For Each Target Platform

Quantized INT8 (Precision Optimization)Significantly improves inference performance of models trained in FP32 full precision by quantizing them to INT8, while minimizing accuracy loss

Layer Fusion (Graph Optimization)Improves GPU utilization and optimizes memory storage and bandwidth by combining successive nodes into a single node, for single kernel execution

Kernel Auto-Tuning (Auto-tuning) Optimizes execution time by choosing the best data layer and best parallel algorithms for the target Jetson, Tesla or DrivePX GPU platform

Dynamic Tensor Memory (Memory optimization)Reduces memory footprint and improves memory re-use by allocating memory for each tensor only for the duration it’s usage

Multi Stream ExecutionScales to multiple input streams, by processing them parallel using the same model and weights

Page 9: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

9

Un-Optimized Network

concat

max pool

input

next input

3x3 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

concat

1x1 conv.

relu

bias5x5 conv.

relu

bias

Non-Optimized Network• Vertical Fusion

• Horizonal Fusion

• Layer Elimination

Network Layers

before

Layers

after

VGG19 43 27

Inception

V3

309 113

ResNet-152 670 159

GRAPH OPTIMIZATION

concat

max pool

input

next input

3x3 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

concat

1x1 conv.

relu

bias5x5 conv.

relu

bias

Page 10: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

10

Un-Optimized Network

concat

max pool

input

next input

3x3 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

concat

1x1 conv.

relu

bias5x5 conv.

relu

bias

max pool

input

next input

3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR

TensorRT Optimized Network• Vertical Fusion

• Horizonal Fusion

• Layer Elimination

Network Layers

before

Layers

after

VGG19 43 27

Inception

V3

309 113

ResNet-152 670 159

GRAPH OPTIMIZATION

Page 11: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

11

140 305

5700

14 ms

6.67 ms 6.83 ms

0

5

10

15

20

25

30

35

40

0

1,000

2,000

3,000

4,000

5,000

6,000

CPU-Only V100 +TensorFlow

V100 + TensorRT

Late

ncy (m

s)Images/

sec

Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2-

16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16),

batch size 2, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587

Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake

with AVX512.

425

550

280 ms

153 ms

117 ms

0

50

100

150

200

250

300

350

400

450

500

0

100

200

300

400

500

600

CPU-Only + Torch V100 + Torch V100 + TensorRT

Late

ncy (m

s)Images/

sec

Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-

PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-

16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 [email protected]

3.5GHz Turbo (Broadwell) HT On

TENSORRT PERFORMANCE

developer.nvidia.com/tensorrt

40x Faster CNNs on V100 vs. CPU-Only

Under 7ms Latency (ResNet50)

140x Faster Language Translation RNNs on

V100 vs. CPU-Only Inference (OpenNMT)

Page 12: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

12

AGENDA

• TensorRT Hyperscale Inference Platform overview

• TensorRT Inference Server

• Overview and Deep Dive: Key features

• Deployment possibilities: Generic deployment ecosystem

• Hands-on

• NVIDA BERT Overview

• FasterTransformer and TRT optimized BERT inference

• Deploy BERT TensorFlow model with custom op

• Deploy BERT TensorRT model with plugins

• Benchmarking

• Open Discussion

DEEP INTO TRTIS

Page 13: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

13

INEFFICIENCY LIMITS INNOVATIONDifficulties with Deploying Data Center Inference

Single Framework OnlySingle Model Only Custom Development

Some systems are overused while

others are underutilizedSolutions can only support

models from one framework

Developers need to reinvent the

plumbing for every application

ASR NLPRec-

ommender

!

Page 14: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

14

NVIDIA TENSORRT INFERENCE SERVERArchitected for Maximum Datacenter Utilization

Maximize real-time inference

performance of GPUs

Quickly deploy and manage multiple

models per GPU per node

Easily scale to heterogeneous GPUs

and multi GPU nodes

Integrates with orchestration

systems and auto scalers via latency

and health metrics

Now open source for thorough

customization and integration

TensorR

T

Infe

rence

Serv

er

NVIDIA

T4

NVIDIA

T4

Te

nso

rRT

Infe

ren

ce

Se

rve

r

Tesla

V100

Tesla

V100

Te

nso

rRT

Infe

ren

ce

Se

rve

r Tesla P4

Tesla P4

Page 15: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

15

FEATURESUtilization Usability Performance Customization

Dynamic BatchingInference requests can be

batched up by the

inference server to 1) the

model-allowed maximum or

2) the user-defined latency

SLA

Concurrent Model

ExecutionMultiple models (or multiple

instances of same model) may

execute on GPU simultaneously

CPU Model Inference

ExecutionFramework native models can

execute inference requests on the

CPU

Multiple Model Format

SupportPyTorch JIT (.pt)

TensorFlow GraphDef/SavedModel

TensorFlow+TensorRT GraphDef

ONNX graph (ONNX Runtime)

TensorRT Plans

Caffe2 NetDef (ONNX import path)

MetricsUtilization, count, memory, and

latency

Model Control APIExplicitly load/unload models into

and out of TRTIS based on changes

made in the model-control

configuration

System/CUDA Shared

MemoryInputs/outputs needed to be

passed to/from TRTIS are stored

in system/CUDA shared memory.

Reduces HTTP/gRPC overhead

Library VersionLink against libtrtserver.so so that

you can include all the inference

server functionality directly in

your application

Custom BackendCustom backend allows the user

more flexibility by providing their

own implementation of an

execution engine through the use

of a shared library

Model EnsemblePipeline of one or more models

and the connection of input and

output tensors between those

models (can be used with custom

backend)

Streaming APIBuilt-in support for audio

streaming input e.g. for speech

recognition

Page 16: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

16

INFERENCE SERVER ARCHITECTURE

Models supported● TensorFlow GraphDef/SavedModel● TensorFlow and TensorRT GraphDef● TensorRT Plans● Caffe2 NetDef (ONNX import)● ONNX graph● PyTorch JIT (.pb)

Multi-GPU support

Concurrent model execution

Server HTTP REST API/gRPC

Python/C++ client libraries

Python/C++ Client Library

Available with Monthly Updates

Page 17: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

17

COMMON WAYS TO FULLY UTILIZE GPU

1. Increase computation intensity – Increase batch size

2. Execute multi- tasks concurrently with multi- streams or MPS (MULTI-PROCESS SERVICE)

Page 18: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

18

DYNAMIC BATCHING SCHEDULER

Framework Backend

Dynamic

Batcher

Runtime

Context

Context

Batch-1 RequestBatch-4 Request

TensorRT Inference Server

Page 19: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

19

DYNAMIC BATCHING SCHEDULER

ModelY Backend

Dynamic

Batcher

Runtime

Context

Context

Preferred batch size and wait time are configuration options.

Assume 4 gives best utilization in this example.

Grouping requests into a single “batch” increases overall GPU throughput

TensorRT Inference Server

Page 20: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

20

DYNAMIC BATCHING

TensorRT Inference Servergroups inference requests based on customer defined metrics for optimal performance

Customer defines 1) batch size (required) and 2) latency requirements (optional)

Example: No dynamic batching (batch size 1 & 8) vs dynamic batching

2.5X Faster Inferences/Second at a 50ms End-to-End Server Latency Threshold

Page 21: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

21

MPS VS CUDA STREAMS IN TRTIS

21

TRTIS CUDA Streams are 1-4% slower than MPS but provide

some usability advantages and other methods to maximize

performance over MPS limitations

MPS

• Multiple processes on a single GPU (no

interconnect/intercommunication between processes)

• Shares GPU memory between multiple processes, if one

process over subscribes the memory, the others are

starved - harder to coordinate memory usage

• Experimental in nv-docker

CUDA Streams

• One process on a single GPU with multiple

streams/execution contexts

• More holistic view of memory - easier to coordinate

memory usage

• Maximize GPU utilization by using batching vs having

several processes executing at batch size 1

Page 22: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

22

Concurrent Model Execution

max_batch_size: 8instance_group [

{count: 4kind: KIND_GPUgpus: [0, 1]

},{count: 4kind: KIND_CPUgpus: [3, 4]

}]

Create one execution context for each instance of a group of a certain model

Page 23: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

23

CONCURRENT MODEL EXECUTION - RESNET 50

Inference

Requests

TensorRT Inference Server

ResNet

50Request

Queue

V100 16GB GPU

Time

RN50 Instance 1 CUDA Stream

RN50 Instance 2CUDA Stream

RN50 Instance 3CUDA Stream

RN50 Instance 4 CUDA Stream

RN50 Instance 5 CUDA Stream

RN50 Instance 6 CUDA Stream

RN50 Instance 8 CUDA Stream

RN50 Instance 7CUDA Stream

RN50 Instance 9 CUDA Stream

RN50 Instance 10CUDA Stream

RN50 Instance 11 CUDA Stream

RN50 Instance 12 CUDA Stream

4x Better Performance and Improved GPU Utilization Through Multiple Model Concurrency

14

concurrent

requests

Common Scenario

One API using multiple copies of the

same model on a GPU

Example: 12 instances of TRT FP16

ResNet50 (each model takes 1.33GB GPU

memory) are loaded onto the GPU and can

run concurrently on a 16GB V100 GPU.

14 concurrent inference requests happen:

each model instance fulfills one request

simultaneously and 2 are queued in the

per-model scheduler queues in TensorRT

Inference Server to execute after the 12

requests finish.

With this configuration, 2832 inferences

per second at 33.94 ms with batch size 8

on each inference server instance is

achieved.

Page 24: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

24

CONCURRENT MODEL EXECUTION - RESNET 504x Better Performance and Improved GPU Utilization Through Multiple Model Concurrency

Common Scenario

One API using multiple copies of the

same model on a GPU

Example: 12 instances of TRT FP16

ResNet50 (each model takes 1.33GB GPU

memory) are loaded onto the GPU and can

run concurrently on a 16GB V100 GPU.

14 concurrent inference requests happen:

each model instance fulfills one request

simultaneously and 2 are queued in the

per-model scheduler queues in TensorRT

Inference Server to execute after the 12

requests finish.

With this configuration, 2832 inferences

per second at 33.94 ms with batch size 8

on each inference server instance is

achieved.

Page 25: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

25

CONCURRENT MODEL EXECUTION - RESNET 504x Better Performance and Improved GPU Utilization Through Multiple Model Concurrency

Common Scenario

One API using multiple copies of the

same model on a GPU

Example: 12 instances of TRT FP16

ResNet50 (each model takes 1.33GB GPU

memory) are loaded onto the GPU and can

run concurrently on a 16GB V100 GPU.

14 concurrent inference requests happen:

each model instance fulfills one request

simultaneously and 2 are queued in the

per-model scheduler queues in TensorRT

Inference Server to execute after the 12

requests finish.

With this configuration, 2832 inferences

per second at 33.94 ms with batch size 8

on each inference server instance is

achieved.

Page 26: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

26

Concurrent Model Execution

max_batch_size: 8instance_group [

{count: 4kind: KIND_GPUgpus: [0, 1]

},{count: 4kind: KIND_CPUgpus: [3, 4]

}]

Create one execution context for each instance of a group of a certain model

Scheduling threadsMultiple streams

Priority: MAX, DEFAUTL, MIN

Page 27: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

27

Model Control and Model Configuration

Perform HTTP POST to /api/modelcontrol/<load|unload>/<model

name> loads or unloads a model from the inference server

Model Control Modes

1) NONE

• Server attempts to load all models at runtime.

• Changes to the model repo will be ignored

• Model control API requests will have no affect

2) POLL

• Server attempts to load all models at runtime

• Changes to model repo will be detected and server will

attempt to load and unload models based on changes

• Model control requests will have no affect

3) EXPLICIT

• Server does not load any models in the model repo at

runtime

• All model loading and unloading must be initiated using the

Model Control API

Local model repository

Page 28: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

28

Model Control and Model Configuration

name: "mymodel"platform: "tensorrt_plan"max_batch_size: 8input [

{name: "input0"data_type: TYPE_FP32dims: [ 16 ]reshape: { shape: [ ] } }

]output [

{name: "output0"data_type: TYPE_FP32dims: [ 16 ]

}]version_policy: { all { }}

instance_group [{count: 2kind: KIND_GPUgpus: [0, 1]

}]dynamic_batching {

preferred_batch_size: [ 4, 8 ]max_queue_delay_microseconds: 100

}optimization {

graph {level: 1

},cuda {graphs: 1

},priority: PRIORITY_MAX

}

• Dims, -1 for dynamic• Reshape for model accepted dims

• Support multiple backends(platform)

• Version control: serve selected versions

• Instances for concurrent exection• Select multiple gpus• Select CPU or GPU for execution• There can be multiple groups

• Preferred batch size is configurable• Set max queue delay for SLA control

• Multiple optimizations• Set graph level to 1 to trigger XLA of TF• Set cuda graphs to 1 to using CUDA graph for

small batch sizes inference• Set priority to max to set scheduler thread

priority and cuda stream priority (only for TRT now)

• ExecutionAccelerators, enable onnx-tensorrtor tensorflow-tensorrt to automatically benefit from tensorrt integration

Page 29: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

29

MODEL ENSEMBLING• Pipeline of one or more models

and the connection of input and

output tensors between those

models

• Use for model stitching or data

flow of multiple models such as

data preprocessing → inference →data post-processing

• Collects the output tensors in

each step, provides them as input

tensors for other steps according

to the specification

• Ensemble models will inherit the

characteristics of the models

involved, so the meta-data in the

request header must comply with

the models within the ensemble

ensemble_scheduling {step [

{model_name: "image_preprocess_model"model_version: -1input_map {

key: "RAW_IMAGE"value: "IMAGE"

}output_map {

key: "PREPROCESSED_OUTPUT"value: "preprocessed_image"

}},{

model_name: "classification_model"model_version: -1input_map {

key: "FORMATTED_IMAGE"value: "preprocessed_image"

}output_map {

key: "CLASSIFICATION_OUTPUT"value: "CLASSIFICATION"

}},{

model_name: "segmentation_model"model_version: -1input_map {

key: "FORMATTED_IMAGE"value: "preprocessed_image"

}output_map {

key: "SEGMENTATION_OUTPUT"value: "SEGMENTATION"

}}

]}

CustomBackend

Page 30: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

30

CUSTOM BACKENDIntegrate custom, non-framework code into TRTIS

Not uncommon for model to have some non-ML-model parts

BERT: tokenizer, feature extractor

Custom backend allows these parts to be integrated into TRTIS

Implement code as shared library using backwards compatible C API

Benefit from the full TRTIS feature set (same as framework backends)

• Dynamic batcher, sequence batcher, concurrent execution, multi-GPU, etc.

Provides deployment flexibility; TRTIS provides standard, consistent interface protocol

between models and custom components

Page 31: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

31

STREAMING INFERENCE REQUESTS

DeepSpeech2

Wave2Letter

Per Model Request Queues

Corr 1Corr 1Corr 1Corr 1 Corr 2 Corr 2 Corr 3 Corr 3

New Streaming API

Based on the correlation ID, the

audio requests are sent to the

appropriate batch slot in the

sequence batcher*

*Correct order of requests is assumed at entry into the endpointNote: Corr = Correlation ID

Inference Request

Corr 1Corr 1Corr 1Corr 1

Corr 2Corr 2Corr 3Corr 3

Corr 1Corr 1Corr 1Corr 1

Corr 2Corr 2

Corr 3Corr 3

NEW

NEW

DeepSpeech2 Sequence Batcher

Wav2Letter Sequence Batcher

Framework Inference Backend

Page 32: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

32

Streaming APIFSM maintained in StreamInferContext

Request

Done

Finished

Done

Non-Streaming

Request

Done

Finished

Done

Response

Done

Initialized

Done

Next

Unexpected or finish

Unexpected or finish

Start to write all remaining data back

Will call CompleteExecution() to

write result back.

Streaming, it’s bidirectional

Reset

Reset

Page 33: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

33

TRTIS LIBRARY VERSIONTightly couple TRTIS functionality into control application via shared library

Smaller binary to Plug TRTIS library into existing application

Removes existing REST and gRPC endpoints

Still leverage GPU optimizations like dynamic batching and model concurrency

Very low communication overhead (same system and CUDA memory address space)

Backward compatible C interface

Page 34: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

34

AVAILABLE METRICS

Category Name Use Case Granularity Frequency

GPU Utilization

Power usage Proxy for load on the GPU Per GPU Per second

Power limit Maximum GPU power limit Per GPU Per second

GPU utilizationGPU utilization rate

[0.0 - 1.0)

Per GPU Per second

GPU Memory

GPU Total Memory Total GPU memory, in bytes Per GPU Per second

GPU Used Memory Used GPU memory, in bytes Per GPU Per second

CountGPU & CPU

Request count Number of inference requests Per model Per request

Execution count

Number of model inference executions

Request count / execution count = avg dynamic request

batching

Per model Per request

Inference countNumber of inferences performed (one request counts as

“batch size” inferences)

Per model Per request

LatencyGPU & CPU

Latency: request time End-to-end inference request handling time Per model Per request

Latency: compute timeTime a request spends executing the inference model (in

the appropriate framework)

Per model Per request

Latency: queue timeTime a request spends waiting in the queue before being

executed

Per model Per request

Page 35: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

35

PERF_CLIENT TOOL

• Measures throughput (inf/s) and latency under varying client loads

• Perf_client Modes

1. Specify how many concurrent outstanding requests and it will find a stable ltency and throughput for that level

2. Generate throughout vs latency curve by increasing the request concurrency until a specific latency or concurrency limit is reached

• Generates a file containing CSV output of the results

• Easy steps to help visualize the throughput vs latency tradeoffs

Page 36: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

36

Clie

nt

R

EC

Clie

nt

IMG

AP

I: A

SR

Model:

Legend

ATTIS

(CPU/ GPU)

Load balancer

Containerized

inference

service

(CPU/ GPU)

Pre

processing

Post

processing

Cluster

Metrics serviceAuto scaler

GPU

Model repository

Persistent volume

Your training/ pruning/

validation flow: dump

model

Model repository(Network Storage Location)

GENERIC INFERENCE SERVER

DEPLOYMENT ARCHITECTURE

GPU

GPU

GPU

TensorRT, TensorFlow, C2/ONNX

Already existing New from NVIDIA

Multiple

workloads

Page 37: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

37For a more detailed explanation and step-by-step guidance for this collaboration, refer to this GitHub repo.

TENSORRT INFERENCE SERVER COLLABORATION WITH KUBEFLOW

What is Kubeflow?

• Open-source project to make ML workflows on Kubernetes simple, portable, and

scalable

• Customizable scripts and configuration files to deploy containers on their chosen

environment

Problems it solves

• Easily set up an ML stack/pipeline that can fit into the majority of enterprise

datacenter and multi-cloud environments

How it helps TensorRT Inference Server

• TensorRT Inference Server is deployed as a component inside of a production

workflow to

• Optimize GPU performance

• Enable auto-scaling, traffic load balancing, and redundancy/failover via

metrics

Page 38: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

38

TRTIS Helm Chart

Helm: Most used “package manager” for Kubernetes

We built a simple chart (“package”) for the TensorRTInference Server.

You can use it to easily deploy an instance of the server.It can also be easily configured to point to a different image, model store, …

https://github.com/NVIDIA/tensorrt-inference-server/tree/b6b45ead074d57e3d18703b7c0273672c5e92893/deploy/single_server

Simple helm chart for installing a single instance of the NVIDIA TensorRT Inference Server

Page 39: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

39

AGENDA

• TensorRT Hyperscale Inference Platform overview

• TensorRT Inference Server

• Overview and Deep Dive: Key features

• Deployment possibilities: Generic deployment ecosystem

• Hands-on

• NVIDA BERT Overview

• FasterTransformer and TRT optimized BERT inference

• Deploy BERT TensorFlow model with custom op

• Deploy BERT TensorRT model with plugins

• Benchmarking

• Open Discussion

DEEP INTO TRTIS

Page 40: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

40

WHAT IS BERT?

BERT: Bidirectional Encoder Representations from Transformers

Widely used in Multiple NLP Tasks, due to high accuracy.

Page 41: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

41

WHAT IS BERTTransformer Encoder Part

Page 42: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

42

TENSORFLOW INFERENCE

Previous TF inference is not efficient:

1. TF Ops are very small, kernel launch causes much time, e.g., GELU/LayerNorm contains several small Ops;2. Multi head self attention lacks efficient GPU implementation;3. TF Scheduling is not good.

Page 43: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

43

NVIDIA’S INFERENCE

Optimization ideas:

1. Optimize the calculations with CUDA, integrate the implementation to TF with custom op

2. Optimize the inference with TensorRT

3. Algorithm Level Acceleration

Page 44: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

44

NVIDIA’S INFERENCECUDA Optimization - Performance

<batch_size,

layers, seq_len,

head_num,

size_per_head>

P4 FP32 (in ms) T4 FP32 (in ms) T4 FP16 (in ms)

(1, 12, 32, 12, 64) 3.43 2.74 1.56

(1, 12, 64, 12, 64) 4.04 3.64 1.77

(1, 12, 128, 12, 64) 6.22 5.93 2.23

Performance over different seq_len on P4, T4

Page 45: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

45

NVIDIA’S INFERENCECUDA Optimization - Resources

Where you can find it:

FasterTransformer project(open-sourced):https://github.com/NVIDIA/DeepLearningExamples/tree/master/FasterTransformer

Page 46: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

46

NVIDIA’S INFERENCETRT Optimization

Page 47: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

47

NVIDIA’S INFERENCETRT Optimization

Before

After

Page 48: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

48

NVIDIA’S INFERENCETRT Optimization - Resources

Where you can find it:

Bert TRT demo(open-sourced):https://github.com/NVIDIA/TensorRT/tree/release/6.0/demo/BERT(To be re-located to DeepLearningExamples)Blog:https://devblogs.nvidia.com/nlu-with-tensorrt-bert/

Page 49: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

49

HANDS-ON

1. Follow FasterTransformer README and generate custom op lib: libtf_fastertransformer.so

2. Prepare gemm_config.in for best algo of gemm by running the built binaries.

3. Modify sample/tensorflow_bert/profile_bert_inference.py to create the squad model, and using saved_model api to export the model.

4. Arrange the exported model in a tree structure ./bert_ft/1/model.savedmodel/xxxexported_files

Deploy BERT TensorFlow Model with custom op

Page 50: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

50

HANDS-ON

1. Follow TensorRT/demo/BERT README and generate plugin lib: libbert_plugins.so and libcommon.so

2. Follow the README and ran sample_bert with additional arg ‘—saveEngine=model.plan’

3. Arrange the model dir in a tree structure: bert_trt/1/model.plan

Deploy BERT TensorRT Model with plugins

Page 51: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

51

HANDS-ON

1. Prepare model_repository

Prepare model_repository, run trtserver and perf_client

model_repository/|-- bert_fastertransformer| |-- 1| | `-- model.savedmodel| | |-- saved_model.pb| | `-- variables| | |-- variables.data-00000-of-00001| | `-- variables.index| `-- config.pbtxt`-- bert_trt

|-- 1| `-- model.plan`-- config.pbtxt

name: “bert_fastertransformer"platform: "tensorflow_savedmodel"input [

{name: "input_ids"data_type: TYPE_INT32dims: [ 1, 128 ]

},{

name: "input_mask"data_type: TYPE_INT32dims: [ 1, 128 ]

},{

name: "segment_ids"data_type: TYPE_INT32dims: [ 1, 128 ]

}]output [

{name: "prediction"data_type: TYPE_FP32dims: [ 2, 1, 128 ]

}]instance_group {

kind: KIND_GPUcount: 1

}version_policy: {specific { versions: [1] }}

name: "bert_trt"platform: "tensorrt_plan"max_batch_size: 1input [

{name: "segment_ids"data_type: TYPE_INT32dims: 128

},{

name: "input_ids"data_type: TYPE_INT32dims: 128

},{

name: "input_mask"data_type: TYPE_INT32dims: 128

}]output [

{name: "cls_squad_logits"data_type: TYPE_FP32dims: [128,2,1,1]

}]

instance_group {kind: KIND_GPUcount: 1

}

version_policy: {specific { versions: [32] }}

config.pbtxt config.pbtxt

Model directory

Page 52: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

52

HANDS-ON

1. Launch trtserver over http/grpc

1. NV_GPU=x nvidia-docker run --rm -it -name=trtis_bert -p8000:8000 -p8001:8001 -v/path/to/model_repository:/models nvcr.io/nvidia/tensorrtserver:19.11-py3

2. Export LD_PRELOAD=/path/to/{libcommon.so; libbert_plugins.so; libtf_fastertransformer.so}

3. trtserver --model-store=/models --log-verbose=1 --strict-model-config=True

Prepare model_repository, run trtserver and perf_client

Page 53: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

53

HANDS-ON

1. Run perf_client to infer over grpc

1. Launch docker: docker run –-net=host –-rm –it nvcr.io/nvidia/tensorrtserver-clients

2. ./install/bin/perf_client -m bert_trt -d –c8 –l200 –p2000 –b1 -i grpc -u localhost:8001 -t1 --max-threads=8

Prepare model_repository, run trtserver and perf_client

Request concurrency: 1Client:

Request count: 59Throughput: 944 infer/secAvg latency: 34422 usec (standard deviation 288 usec)p50 latency: 34457 usecp90 latency: 34667 usecp95 latency: 34877 usecp99 latency: 35130 usecAvg gRPC time: 34452 usec ((un)marshal request/response 27 usec + response wait 34425 usec)

Server: Request count: 70Avg request latency: 33473 usec (overhead 26 usec + queue 58 usec + compute 33389 usec)

Result reported by perf_client

Page 54: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

54

HANDS-ONBenchmarking - FasterTransformer

SQuAD task inference (FasterTransformer):batchsize=1, tensorflow backend, max QPSTesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

Processor Precision QPS AvgL(ms) TP99(ms) Concurrent

CPU FP32 7.7 131 203 1

CPU FP32 10.4 289 339 4

GPU FP32 104.5 9.5 11.8 1

GPU FP32 137 21.9 23.7 4

GPU FP16 267.5 3.7 3.9 1

GPU FP16 461.5 8.7 10.3 4

CPU -> multi-thread CPU -> GPU FP32 -> concurrent GPU FP32 -> GPU FP16 -> concurrent GPU FP167.7 104.5 461.5131 9.5 8.7

Virtual GPU feature in Tensorflowto enable multi-stream:

--tf-add-vgpu=“0;4;3000”

Page 55: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

55

HANDS-ONBenchmarking - FasterTransformer

SQuAD task inference (FasterTransformer):batchsize=32, tensorflow backend, max QPSTesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

Processor Precision QPS AvgL(ms) TP99(ms) Concurrent

CPU FP32 0.4 2491 2810 1

GPU FP32 5.5 182 184 1

GPU FP32 5 186 186 4

GPU FP16 21.8 46 48.8 1

GPU FP16 21.6 46.1 48.1 4

CPU -> GPU FP32 -> GPU FP160.4 5.5 21.82491 182 46

Page 56: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

56

HANDS-ONBenchmarking - TensorRT

SQuAD task inference (TensorRT):batchsize=1, TensorRT backend, max QPSTesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

Processor Precision QPS AvgL(ms) TP99(ms) Concurrent

GPU FP32 163 12.3 12.4 1

GPU FP32 156 12.8 14 4

GPU FP16 438.5 4.6 4.6 1

GPU FP16 473.5 4.2 5.1 4

GPU FP32 -> GPU FP16163 473.512.3 4.2

Page 57: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

57

HANDS-ONBenchmarking - TensorRT

SQuAD task inference (TensorRT):batchsize=32, TensorRT backend, max QPSTesla T4, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

Processor Precision QPS AvgL(ms) TP99(ms) Concurrent

GPU FP32 6.5 157 159 1

GPU FP32 6.5 316 356 4

GPU FP16 29.5 34.2 34.8 1

GPU FP16 30.5 134 151 4

GPU FP32 -> GPU FP166.5 30.5157 134

Page 58: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

58

Learn more here:https://nvidia.com/data-center-inference

https://docs.nvidia.com/deeplearning/sdk/inference-release-notes/index.html

https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-

guide/docs/quickstart.html

Get the ready-to-deploy container with monthly updates

from the NGC container registry:https://ngc.nvidia.com/catalog/containers/nvidia%2Ftensorrtserver

Open source GitHub repository:https://github.com/NVIDIA/tensorrt-inference-server

LEARN MORE AND DOWNLOAD TO USE

Page 59: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

59

Engineering developer blog (benchmarks, model concurrency, etc.):https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/

Kubeflow guest blog:https://www.kubeflow.org/blog/nvidia_tensorrt/

Open source announcement:https://news.developer.nvidia.com/nvidia-tensorrt-inference-server-now-open-source

More:Data center inference page & TensorRT page

DevTalk Forum for Support

TensorRT Hyperscale Inference Platform infographic

NVIDIA AI Inference Platform technical overview

NVIDIA TensorRT Inference Server and Kubeflow

NVIDIA TensorRT Inference Server Now Available

ADDITIONAL RESOURCES

Page 60: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

60NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

NVIDIA (DLI)

• GPU

• DLI

www.nvidia.cn/dli

Page 61: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU

DLI 深度学习全天培训 @ GTC CHINA 2019

全球开发者培训证书 | 全配置GPU实验环境 | 5大新课首发 | 一年一度6折特惠

查看课程

和报名

培训

咨询

NEW!

NEW!NEW!

NEW!

NEW!

用多GPU训练神经网络应对大规模神经网络训练的算法和工程挑战

CUDA Python

轻松实现在GPU上加速运行Python应用

计算机视觉零基础入门,深度学习方法与实践

自然语言处理NLP 必备理论与应用技能

多数据类型机器视觉和NLP技术的融合进阶应用

自动驾驶汽车的感知系统(2019新版)学用 NVIDIA DRIVE AGX 构建自动驾驶汽车

工业检测应用深度学习打造自动化工业检测模型

使用 Jetson Nano 开发AI应用机器人基础入门,获得您的Jetson Nano套件

Page 62: DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA …€¦ · Natural Language Processing Inference CPU Server Tesla P4 Tesla T4 1.0 4X 21X-0 5 10 15 20 25 r Speech Inference CPU