dcgm overview and hands on · 5 data center gpu manager (dcgm) pre-configured policies job level...

42
June 2019 DCGM OVERVIEW AND HANDS ON

Upload: others

Post on 03-Oct-2020

8 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

June 2019

DCGM OVERVIEW AND HANDS ON

Page 2: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

2

TOOLS FOR MANAGING GPUs

Out-of-Band

GPU Metrics and Monitoring via BMC (SMBPBI)

Provide metrics (thermals, power, etc.) without the NVIDIA driver

Typically used at public CSPs (i.e. multi-tenant environments)

In-Band

Tools use the NVIDIA driver to provide GPU and NVSwitchmetrics

DCGM, NVML (smi) are in-band tools

Typically used at single tenant environments

Page 3: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

3

NVIDIA IN-BAND TOOLS ECOSYSTEM

DCGM

NVML

3rd Party Tools

▶ Customers building their own GPU metrics/monitoring stack using NVML

▶ Customers integrating DCGM; CSPs for system validation

▶ Cluster managers, Job schedulers, TSDBs, Visualization tools

Page 4: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

4

HOW SHOULD I MANAGE MY GPUS?

3RD PARTY TOOLS

DCGMNVML

Stateless queries. Can only query current data

Low overhead while running, high overhead to develop

Low-level control of GPUs

Management app must run on same box as GPUs

Provide database, graphs, and a nice UI

Need management node(s)

Development already done. You just have to configure the tools.

Can query a few hours of metrics

Provides health checks and diagnostics

Can batch queries/operations to groups of GPUs

Can be remote or local

Page 5: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

5

DATA CENTER GPU MANAGER (DCGM)

▶ Pre-configured Policies▶ Job Level Statistics▶ Stateful Configuration

POLICY AND ALERTING

▶ Software Deployment Tests▶ Stress Tests▶ Hardware Issues and Interface Tests

(PCIe, NVLink)

GPU DIAGNOSTICS

▶ Dynamic Power Capping▶ Synchronous Clock Boost▶ Fixed Clocks

CONFIGURATION MANAGEMENT

▶ Runtime Health Checks▶ Prologue Checks▶ Epilogue Checks

ACTIVE HEALTH MONITORING

Page 6: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

6

https://developer.nvidia.com/data-center-gpu-manager-dcgm

GPU Management in the Accelerated Data Center

DCGM OVERVIEW

Supported NVIDIA Hardware

● Fully supported on Tesla GPUs (Kepler+)

● Supported on Quadro, GeForce, and Titan GPUs (Maxwell+,

since v1.3)

● Supports NvSwitch and DGX-2

● Driver R384 or Later (Linux only)

SDK Installer Packages

● .deb and .rpm Packages

● Includes Binaries – CLI (dcgmi) and daemon (nv-hostengine)

● Libraries and Headers (includes NVML)

● C and Python Bindings and Code samples● Documentation - User Guides and API docs

Latest Release: v1.5.6

Page 7: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

7

AVAILABLE NVIDIA MANAGEMENT TOOLS

Software Stack

NVML

NVIDIA Driver

CUDA

Data Center GPU Manager (DCGM)

▶ Additional diagnostics (aka NVVS) and active health monitoring

▶ Policy management and more

NVIDIA Management Library(NVML)

▶ Low level control of GPUs▶ Included as part of driver ▶ Header is part of CUDA Toolkit / DCGM

DCGM Daemon

DCGM-Based 3rd Party Tools

DCGMI

Client Lib Client Lib

GPU

Diagnostics

(NVVS)

Page 8: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

8

Two ways to use DCGM

APIs

Accessible programmatically through APIs

Supporting C/Python

Embedded library within 3-rd management tools

Also auto-manage scripts by using Python bindings

CLI

Command line interface: dcgmi

Simple, interactively

Favored by common users and system admins

Equivalent functionality provided through APIs or CLI

Page 9: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

9

Two modes of DCGM running

Embedded

mode

DCGM agent is loaded as a shared library by 3-rd

party agent

Periodically triggered to gather data and

management activities by 3-rd party agent

Standalone

mode

Embedded into a daemon called NVIDIA Host Engine

DCGM clients prefer to interact with a daemon

Multiple clients wish to interact with DCGM, not just one

node agent

Users who wants to leverage CML tool: DCGMI

Standalone mode is used widely for its flexibility and lowest

maintenance cost to users.

Page 10: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

10

Package Select▪ Linux - x64 and POWER▪ Ubuntu (deb)and CentOS/RHEL(rpm)▪ Normal version used for server without nvswitch▪ FM version used for server with nvswith ,such as DGX2/HGX2

Page 11: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

11

Install and quick Start

▪ $dpkg –i *.deb/ rpm –ivh *.rpm▪ $nv-hostengine▪ Parameter: -b allow other host access this daemon

Page 12: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

12

Device for management

❑ Host: --host

❑ Group: -g

❑ GPU: --gpuid

Usage: dcgmi discovery

dcgmi discovery [--host <IP/FQDN>] -l

dcgmi discovery [--host <IP/FQDN>] -i <flags> [-g <groupId>] [-v]

Page 13: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

13

DCGM Usage of CLI

GroupDmonPolicyJob StatsHealth & DiagnosticsTopologyNVLINK

Page 14: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

14

Group

Page 15: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

15

Groups in DCGM

Groups:Almost all DCGM operations take place on groups. User can create, modify and destroy collections of GPUs on local node, using these constructs to control all subsequent DCGM activities.

Partitioned groups, consisting of only a subset of GPUs, are useful for job-level concepts such as job stats and health.

Page 16: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

16

Groups create, list, delete

Managing group is simple, only with “dcgmi group” subcommand.

“dcgmi group –d GroupID” to delete a group

“dcgmi group –h” will list more detailed usage.

Page 17: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

17

Add GPUs to a group

Page 18: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

18

Health&diag

Page 19: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

19

Health—check select items heath status

Page 20: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

20

ACTIVE HEALTH MONITORING & ANALYSIS

NON INVASIVE CHECKS

Real-time monitoring & aggregated health indicator

Checks health of all GPUs and NVSwitch subsystems • PCIe, ECC, Inforom, Power

Thermal, NVLink

dcgmi health --check -g 1

Health Monitor Report

+------------------+---------------------------------------------------------+

| Overall Health: Healthy |

+==================+=========================================================+

Run Health Check : Healthy System

dcgmi health -g 1 –cHealth Monitor Report

+----------------------------------------------------------------------------+

| Group 1 | Overall Health: Warning |

+==================+=========================================================+

| GPU ID: 0 | Warning |

| | PCIe system: Warning - Detected more than 8 PCIe |

| | replays per minute for GPU 0: 13 |

+------------------+---------------------------------------------------------+

| GPU ID: 1 | Warning |

| | InfoROM system: Warning - A corrupt InfoROM has been |

| | detected in GPU 1. |

+------------------+---------------------------------------------------------+

Run Health Check : System with problems

Page 21: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

21

COMPREHENSIVE DIAGNOSTICS

ACTIVE HEALTH CHECKS

Identification, recovery & isolation of failed GPUs and NVSwitches.

Diagnostics to root cause failures, Pre & post job GPU health checks

System sanity to stress performance, bandwidth, power and thermal characteristics

Multi-level diagnostic options from few seconds to minutes

dcgmi diag -r 3

+---------------------------+-------------+

| Diagnostic | Result |

+===========================+=============+

|----- Deployment --------+-------------|

| Blacklist | Pass |

| NVML Library | Pass |

| CUDA Main Library | Pass |

| CUDA Toolkit Library | Pass |

| Permissions and OS Blocks | Pass |

| Persistence Mode | Pass |

| Environment Variables | Pass |

| Page Retirement | Pass |

| Graphics Processes | Pass |

| Inforom | Pass |

+----- Hardware ----------+-------------+

| GPU Memory | Pass - All |

| Diagnostic | Pass - All |

+----- Integration -------+-------------+

| PCIe | Pass - All |

+----- Stress ------------+-------------+

| SM Stress | Pass - All |

| Targeted Stress | Pass - All |

| Targeted Power | Warn - All |

| Memory Bandwidth | Pass - All |

+---------------------------+-------------+

Page 22: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

22

Policy

Page 23: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

23

Policy: when –T/-P/-e/-n/-x what to do

Page 24: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

24

FLEXIBLE GPU GOVERNANCE POLICIES

Continuous monitoring by the

user

Identify GPUs with double bit errors

Manually perform GPU reset to

correct problems

Auto-detects double bit errors, performs reset gpu, and notifies the

user

Using DCGMWith Existing Tools

Condition Action Notification

Condition: Watch for DBEAction: reset GPUNotification: Callback

Page 25: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

25

Job stats

Page 26: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

26

MANAGING JOB LIFECYCLE

Which GPUs did my job run on?

How much of the GPUs did my job use?

Any error or warning conditions during my job (ECC errors, clock throttling, etc)

Are the GPUs healthy and ready for the next job?

Create GPU group

and check health

Start Job Stats

Run Job

Stop Job Stats

Display Job Stats

Page 27: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

27

JOB STATISTICSdcgmi stats --job demojob -v -g 2Successfully retrieved statistics for job: demojob.

+----------------------------------------------------------------------------+

| GPU ID: 0 |

+==================================+=========================================+

|----- Execution Stats ----------+-----------------------------------------|

| Start Time | Wed Mar 7 10:02:34 2018 |

| End Time | Wed Mar 7 10:10:00 2018 |

| Total Execution Time (sec) | 445.48 |

| No. of Processes | 1 |

| Compute PID | 23112 |

+----- Performance Stats --------+-----------------------------------------+

| Energy Consumed (Joules) | 1437 |

| Max GPU Memory Used (bytes) | 120324096 |

| SM Clock (MHz) | Avg: 998, Max: 1177, Min: 405 |

| Memory Clock (MHz) | Avg: 2068, Max: 2505, Min: 324 |

| SM Utilization (%) | Avg: 76, Max: 100, Min: 0 |

| Memory Utilization (%) | Avg: 0, Max: 1, Min: 0 |

| PCIe Rx Bandwidth (megabytes) | Avg: 0, Max: 0, Min: 0 |

| PCIe Tx Bandwidth (megabytes) | Avg: 0, Max: 0, Min: 0 |

+----- Event Stats --------------+-----------------------------------------+

| Single Bit ECC Errors | 5 |

| Double Bit ECC Errors | 0 |

| PCIe Replay Warnings | 0 |

| Critical XID Errors | 0 |

+----- Slowdown Stats -----------+-----------------------------------------+

| Due to - Power (%) | 0 |

| - Thermal (%) | Not Supported |

| - Reliability (%) | Not Supported |

| - Board Limit (%) | Not Supported |

| - Low Utilization (%) | Not Supported |

| - Sync Boost (%) | 0 |

+----------------------------------+-----------------------------------------+

Detailed stats show

utilization, performance and

more…

Page 28: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

28

Config

Page 29: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

29

GPU CONFIGURATION MANAGEMENT

Initialization: Configure all GPUs (global group)

Per-job basis: Individual partitioned group settings

Maintains settings across driver restarts, GPU resets or at job start

Supports SET, GET and ENFORCE

MAINTAINS CONFIGURATIONSUPPORTED SETTINGSdcgmi config -g 1 --set –P 200

Configuration successfully set.

Disable ECC mode

dcgmi config -g 1 --get+--------------------------+------------------------+------------------------+

| all_gpu_group | | |

| Group of 2 GPUs | TARGET CONFIGURATION | CURRENT CONFIGURATION |

+==========================+========================+========================+

| Sync Boost | Not Specified | Disabled |

| SM Application Clock | Not Specified | 705 |

| Memory Application Clock | Not Specified | 2600 |

| ECC Mode | Disabled | Disabled |

| Power Limit | 200 | 225 |

| Compute Mode | Not Specified | E. Process |

+--------------------------+------------------------+------------------------+

Get Group config [Note DCGM performed reset]

Page 30: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

30

dmon

Page 31: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

31

Dmon—monitor GPUs stats

Page 32: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

32

Lists items for monitor

$dcgmi dmon -l

Page 33: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

33

Create fieldgroup for monitor

Page 34: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

34

Monitor field or fieldgroup

$dcgmi dmon –e/-f

Page 35: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

35

集成Promethues&Grafana

Page 36: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

36

Promethues

❑ https://prometheus.io/

❑ Open Source systems monitoring and alerting toolkit

❑ Third-party exporters

Page 37: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

37

$dcgm exporter

https://github.com/NVIDIA/gpu-monitoring-tools

$ docker run -d --rm --net="host" --pid="host" --volumes-from nvidia-dcgm-exporter:ro

quay.io/prometheus/node-exporter --collector.textfile.directory="/run/prometheus"

$ curl localhost:9100/metrics# HELP dcgm_dec_utilization Decoder utilization (in %).

# TYPE dcgm_dec_utilization gauge

dcgm_dec_utilization{gpu="0",uuid="GPU-c09da0a8-361d-4a42-0e58-9fe098e4d6d4"} 0

dcgm_dec_utilization{gpu="1",uuid="GPU-3d598278-88a6-6979-f600-5b7c33341502"} 0

dcgm_dec_utilization{gpu="2",uuid="GPU-e0725273-7885-235f-be70-5a333ac6fd63"} 0

dcgm_dec_utilization{gpu="3",uuid="GPU-eea85835-2787-da88-cf6f-a8b790f6ec2c"} 0

dcgm_dec_utilization{gpu="4",uuid="GPU-fc9dd930-84df-d270-dc93-dc3122bb901f"} 0

dcgm_dec_utilization{gpu="5",uuid="GPU-95320c1d-4f08-66a8-e834-c731136a7822"} 0

# HELP dcgm_ecc_dbe_aggregate_total Total number of double-bit persistent ECC errors.

# TYPE dcgm_ecc_dbe_aggregate_total counter

dcgm_ecc_dbe_aggregate_total{gpu="0",uuid="GPU-c09da0a8-361d-4a42-0e58-9fe098e4d6d4"} 0

dcgm_ecc_dbe_aggregate_total{gpu="1",uuid="GPU-3d598278-88a6-6979-f600-5b7c33341502"} 0

dcgm_ecc_dbe_aggregate_total{gpu="2",uuid="GPU-e0725273-7885-235f-be70-5a333ac6fd63"} 0

dcgm_ecc_dbe_aggregate_total{gpu="3",uuid="GPU-eea85835-2787-da88-cf6f-a8b790f6ec2c"} 0

dcgm_ecc_dbe_aggregate_total{gpu="4",uuid="GPU-fc9dd930-84df-d270-dc93-dc3122bb901f"} 0

dcgm_ecc_dbe_aggregate_total{gpu="5",uuid="GPU-95320c1d-4f08-66a8-e834-c731136a7822"} 0

Page 38: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

38

Promethues config file

scrape_configs:

# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

- job_name: 'prometheus'

# metrics_path defaults to '/metrics'

# scheme defaults to 'http'.

static_configs:

- targets: ['localhost:9090']

- job_name: 'szdgxstation'

static_configs:

- targets: ['127.0.0.1:9100']

- job_name: 'bjdlserver'

static_configs:

- targets: ['10.19.203.85:9100']

./promethues –config.file=***.yml

Page 39: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

39

Dcgm+promethues

Page 40: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

40

Grafana与数据源配置The open platform for beautiful analysitic and monitoring

Page 41: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress

41

DCGM+Promethues+Grafana

Page 42: DCGM OVERVIEW AND HANDS ON · 5 DATA CENTER GPU MANAGER (DCGM) Pre-configured Policies Job Level Statistics Stateful Configuration POLICY AND ALERTING Software Deployment Tests Stress