gpu resource pooling and the benefit of deploying cupti · 0.00e+00 2.00e+07 4.00e+07 6.00e+07...
TRANSCRIPT
![Page 1: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/1.jpg)
GPU Resource Pooling and the
benefit of deploying CUPTIAlibaba Group
Lingling Jin, Lingjie Xu
© 2019 Alibaba Group
![Page 2: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/2.jpg)
Alibaba is an AI company
© 2019 Alibaba Group
• E-commerce• Search• Advertisement
• Financing &payment
• Video• Logistics
All these services arebacked up by Alibabacloud
![Page 3: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/3.jpg)
GPU pool
IDC data center
© 2019 Alibaba Group
• GPU accelerator powereddata center
![Page 4: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/4.jpg)
GPU servers are managedelastically
© 2019 Alibaba Group
Kubernetes
KVM
However GPUutilization is still low
![Page 5: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/5.jpg)
Agenda
• GPU accelerated inference service pool: multi-tenant
• GPU resource pool over ethernet
© 2019 Alibaba Group
![Page 6: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/6.jpg)
GPU inference service pool
© 2019 Alibaba Group
Model Zoo
• GPUs are allocatedin single cardgranularity
• Capacity isdesigned for worstcase traffic
• Mixing differentinference possible?
Model1
Model1
Model2
![Page 7: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/7.jpg)
Experiment setup
• Single GPU V100 system
• Model: resnet50
• Start inference server, capture NVML GPU utilization at backend
• Client performance test• multiple concurrent request,
• max allowed latency
• desired batchsize
• Calculate server throughput
© 2019 Alibaba Group
![Page 8: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/8.jpg)
NVML GPU utilization
© 2019 Alibaba Group
GPU_utilization: Percent of time over the past
sample period during which one or more
kernels was executing on the GPU
![Page 9: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/9.jpg)
CUPTI
• The CUDA Profiling Tools Interface (CUPTI) enables the creation of profiling and tracing tools that target CUDA applications.
• all event/metrics available in NVPROF• sm_efficiency
• achieved_occupancy
• dram_read/write
• Single/double/half_precision_fu_utilization
• Collect GPU metrics during runtime
• 1%-10% overhead
© 2019 Alibaba Group
![Page 10: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/10.jpg)
SM efficiency: The percentage of time at least one
warp is active on a multiprocessor averaged over all
multiprocessors on the GPU.
CUPTI is what we want.
CUPTI SM Efficiency
© 2019 Alibaba Group
![Page 11: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/11.jpg)
Explore two processes GPUsharing
0.00E+00
1.00E+07
2.00E+07
3.00E+07
4.00E+07
5.00E+07
6.00E+07
7.00E+07
1 6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
10
1
10
6GP
UEl
apse
d/a
ctiv
eC
ycle
Time Stamp (60ms)
One Alexnet program running
SM_active_cycles GPU_elapsed_cycle
• We used cupti to capture GPU elapsed cycle and SM_active_cycle
© 2019 Alibaba Group
![Page 12: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/12.jpg)
Explore multi-process GPUsharing
© 2019 Alibaba Group
0.00E+00
2.00E+07
4.00E+07
6.00E+07
8.00E+07
1.00E+08
1.20E+08
1.40E+08
1.60E+081 3 5 7 9 11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53 55 57
59
61
63
65
67
69 71 73
75
77
79
81
83
85 87 89
91
93
95
97
99
10
1
GP
UEl
apse
d/a
ctiv
eC
ycle
Time Stamp (60ms)
2 Concurrent Tensorflow Alexnet Inference application share GPU
Task1_elapsed_cycle Task2_elapsed_cycle
Two process time-share GPUresource, NOT concurrent
![Page 13: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/13.jpg)
0.00E+00
2.00E+07
4.00E+07
6.00E+07
8.00E+07
1.00E+08
1.20E+08
1.40E+08
1.60E+08
1 3 5 7 9
11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99
10
1
2 Concurrent Tensorflow Alexnet Inference application share GPU
Task1_active_cycle Task1_elapsed_cycle Task2_active_cycle Task2_elapsed_cycle
GP
UEl
apse
d/a
ctiv
eC
ycle
Val
ue
Time Stamp (60 MS)
Explore multi-process GPUsharing
GPU active cycles doesn’t overlap, so noperf impact when running together
© 2019 Alibaba Group
![Page 14: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/14.jpg)
Running two inference servers
• Single GPU V100 system
• Model• resnet50
• Inception
• Batchsize 4
© 2019 Alibaba Group
![Page 15: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/15.jpg)
Performance with two inferenceservers
© 2019 Alibaba Group
No QOScontrol
Perf overhead
![Page 16: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/16.jpg)
Nvidia inference server
• Each model is implemented as GPUstream, can be run concurrently onGPU
• better control with scheduler
© 2019 Alibaba Group
![Page 17: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/17.jpg)
Nvidia inference server
© 2019 Alibaba Group
![Page 18: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/18.jpg)
1 server vs 2 serverperformance comparison
© 2019 Alibaba Group
![Page 19: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/19.jpg)
Agenda
• GPU accelerated inference service pool: multi-tenant
• GPU resource pool over ethernet
© 2019 Alibaba Group
![Page 20: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/20.jpg)
Why
© 2019 Alibaba Group
DGX2
DGX1 1U GPU server
• Each AI application has differentCPU/GPU ratio
• many types GPU servers
• Single GPU card and GPU server isbecoming so powerful, cannot befully utilized
![Page 21: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/21.jpg)
Distributed storage system-bigsuccess
© 2019 Alibaba Group
• seperated CPU(client) andstorage(chunk server)
• Stable• High performance• Easy to maintain• Low cost
![Page 22: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/22.jpg)
distributed acceleration
Ethernethttp/grpc/rdma/ib
CPU poolWithoutGPU
Accelerationpool
StoragePool
HDD SSD
PhysicalMachine
VirtualMachine
Container
GPU pool
Distributed storage pool
Distributed computingpool
NVME
© 2019 Alibaba Group
![Page 23: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/23.jpg)
GPU pool over ethernet
CUDA Applicationlike TensorFlow (GPU version)
CUDARuntime
Proxy
CUDA Libraries(like cuDNN/
cuBLAS)
Proxy
CUDADriver
Proxy
Communicator
Client Server
Communicator
API Service
Real CUDA Runtime/Library/Driver
GPU
TCP/
RD
MA
Components can be shared by GPU/FPGA/etc.
Components can be generated automatically with minimum manual work
© 2019 Alibaba Group
![Page 24: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/24.jpg)
Challenge 1-A lot of CUDAAPI
• A intermediate and auto-generated YAML file to define the APIs that need to be forwarded to remote server.
• Need to add some information on pointer arguments manually
© 2019 Alibaba Group
![Page 25: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/25.jpg)
Challenge 2-Host memory
• With remote GPU, we need to detect such memories in CUDA kernel parameters to make sure the function correctness• cudaMallocHost
© 2019 Alibaba Group
![Page 26: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/26.jpg)
Challenge 3 - Hide API forwarding Latency• Changing blocking API calls to asynchronous calls
• Benefit: 2x speedup
API run time in server
API run time in client
© 2019 Alibaba Group
BeforeAfter
![Page 27: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/27.jpg)
• website:https://aimatrix.ai
• Code repo:https://github.com/alibaba/ai-matrix
![Page 28: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/28.jpg)
CNN Training Performance
© 2019 Alibaba Group
![Page 29: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/29.jpg)
Summary
• GPU accelerated inference service pool: multi-tenant
• GPU resource pool over ethernet
• All results are preliminary and we are actively working on it
• Discussions and collaborations are welcome
• Email: [email protected]
© 2019 Alibaba Group
![Page 30: GPU Resource Pooling and the benefit of deploying CUPTI · 0.00e+00 2.00e+07 4.00e+07 6.00e+07 8.00e+07 1.00e+08 1.20e+08 1.40e+08 1.60e+08 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29](https://reader033.vdocuments.site/reader033/viewer/2022060418/5f15b2c54bc8231f547a436d/html5/thumbnails/30.jpg)
Thanks!Q&A
© 2019 Alibaba Group