gpu inference platform - nvidia · 2018-11-26 · 효율적인inference를위한hw/sw service...

30
정소영 상무 ([email protected]) / NVIDIA 효율적인 GPU Inference Platform 구축 방안

Upload: others

Post on 31-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

정소영 상무 ([email protected]) / NVIDIA

효율적인GPU Inference Platform

구축방안

Page 2: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

본 세션은 다음과 같은 분들을 위해 준비했습니다

효율적인 Inference를 위한 HW/SW Service Architecture를 고민하는분들

TensorFlow, Caffe, Pytorch 등다양한 Framework 기반으로학습된모델들을제공할수 있는 Inference Platform 구축을고민하는분들

서비스구축 시 GPU의성능과 QoS를가장효율적으로사용할수있는 Inference Platform 구축을고민하는분들

Page 3: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

요약

TESLA T4 TENSORRT

INFERENCE

SERVER

KUBERNETES

ON NVIDIA

GPUs (KONG)

TENSORRT 5

Page 4: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

TESLA T4

Page 5: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

INFERENCE GPU – TESLA T4WORLD’S MOST ADVANCED INFERENCE GPU

Page 6: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

SPECIFICATION

75 watts

Page 7: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

NEW TURING TENSOR CORE4 x 4 Matrix Processing Array

Page 8: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

T4 INFERENCE SPEEDUP

Resnet 50 (27x) DeepSpeech 2 (21x) GNMT (36x)

Page 9: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

INFERENCE EFFICIENCY

CPU-only

Server

Tesla

P4Tesla

V100

Tesla

T4

1

25

21

56

images/sec/watt

https://www.nvidia.com/en-us/data-center/resources/inference-technical-overview/

Page 10: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

TURING MPS (MULTI-PROCESS SERVICE)

Page 11: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

HW TRANSCODING ENGINEDeep Learning과 Video Application의효과적인연동가능

Decode: P4 대비 2배 (38개의 Full-HD video stream decode 지원)

Encode: Performance Mode / Efficiency Mode

Page 12: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

TENSORRT INFERENCE SERVER

(TRTIS)

Page 13: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

TRT INFERENCE SERVER란?

NV DL SDK

NV Docker

DNN Models

TensorRT

Inference

Server

Kubernetes

완성도높은 Datacenter 서비스용모듈

컨테이너기반의 고성능 Inference Server

CPU / GPU 자원이용률극대화

Page 14: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

docker pull nvcr.io/nvidia/tensorrtserver:18.10-py3

http://ngc.nvidia.com/

Page 15: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

특징

다양한 model 지원: TensorRT, TensorFlow, Caffe2, ONNX

Multi-GPU 지원: 모든 GPU에 inference req. 분산처리

Multi-tenancy 지원: 복수개의모델과 복수개의인스턴스, 인스턴스별

복수개의버전동시구동가능

Batch request 지원: Throughput 개선

모니터링 metrics 제공: Service Orchestration, HA, LB, QoS 등으로활용

Page 16: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

기본아키텍쳐

Page 17: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

SAMPLE CONFIGURATION

https://github.com/NVIDIA/dl-inference-

server/blob/18.10/src/core/model_config.proto

Page 18: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

CLIENT SDKhttps://github.com/NVIDIA/dl-inference-server/

TRT Inference Server를 위한 C++ / Python Client Libraries: HTTP / gRPC

Client sample code 제공: image_client, perf_client

브랜치버전은 TRT Inference Server와 동일

Page 19: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

HOW TO USE-FRONT-END / BACK-END 분리를통한 SCALABLE ARCHITECTURE

-MODEL VERSIONING을활용한 온라인모델업데이트

-INSTANCE-GROUP 옵션을활용한 THROUGHPUT / LATENCY 최적화

Page 20: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

HOW TO USE-FRONT-END / BACK-END 분리를통한

-SCALABLE ARCHITECTURE

Front-end servers

Back-end inference servers

with TRTIS

LB

Client requests

Page 21: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

HOW TO USE-MODEL VERSIONING을활용한온라인모델업데이트

-하나의 모델에 복수개의 version 지원

-TRTIS 구동 시에도동적으로 Model Version 변경가능: Atomic Inference 지원

-Rolling Update, A/B test 등에효과적

-Version Policy: All / Latest / Specific

Page 22: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

HOW TO USE-INSTANCE-GROUP 옵션을 활용한 THROUGHPUT / LATENCY 최적화

Page 23: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

REPORT METRICS FOR QOS

Page 24: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

KUBERNETES ON NVIDIA GPU

(KONG)

Page 25: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

KUBERNETES ON NVIDIA GPUS

Scale-up to thousands of GPUs

Self-healing cluster orchestration

GPU optimized out-of-the-box

Powered by NVIDIA Container Runtime

Upstream all diffs

Page 26: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

특징NVIDIA Device Plugin을이용하여 Kubernetes에서 GPU 리소스를관리함

Heterogeneous GPU 환경에서 GPU type과 memory 요구사항등을활용하여효과적인관리가가능함

NVIDIA DGCM (https://developer.nvidia.com/data-center-gpu-manager-dcgm), Prometheus, Grafana를 이용하여다양한 GPU 항목과 health check를모니터링시스템에연동가능

Page 27: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을
Page 28: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

DCGM EXPORTER FOR PROMETHEUS

https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/exporters/prometheus-dcgm

Page 29: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

참조NVIDIA GPU Cloud (NGC) container registry: https://ngc.nvidia.com/

T4: https://www.nvidia.com/en-us/data-center/tesla-t4/

TRTIS Client SDK: https://github.com/NVIDIA/dl-inference-server

TRTIS User Guide:

https://docs.nvidia.com/deeplearning/sdk/inference-user-guide/index.html

Kubernetes on NVIDIA GPUs:

https://developer.nvidia.com/kubernetes-gpu

Page 30: GPU Inference Platform - NVIDIA · 2018-11-26 · 효율적인Inference를위한HW/SW Service Architecture를고민하는분들 TensorFlow, Caffe, Pytorch 등다양한Framework 기반으로학습된모델들을

SEOUL | NOVEMBER 7 - 8,2018

www.nvidia.com/ko-kr/ai-conference/