general purpose processing using embedded gpus: a study of … · 2016. 6. 1. · zürcher...
TRANSCRIPT
![Page 1: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/1.jpg)
Zürcher Fachhochschule 1
GeneralpurposeprocessingusingembeddedGPUs:Astudyof
latencyanditsvaria:on
Ma#hiasRosenthalandAminMazloumianMay,2016
![Page 2: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/2.jpg)
Zürcher Fachhochschule 2
Agenda
• GeneralPurposeGPUCompuAng• EmbeddedCPU/GPUversusCPU/FPGA• CPU–GPUDataTransfer
– UnifiedVirtualAddressing(DMA)– Memorymapped(ZeroCopy)
• LatencyResults• Kernel-LoopSoluAonavoidingGPUKernellaunch
![Page 3: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/3.jpg)
Zürcher Fachhochschule 3
GPUCompuAng
Originallyused3DgamerenderingGPUsareheavilyusedin
HighPerformanceCompuAng Financialmodeling RoboAcs GasandoilexploraAon CuYng-edgescienAficresearch
àWhataboutembeddedsystems??
![Page 4: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/4.jpg)
Zürcher Fachhochschule 4
CPUvs.GPU
[h#p://michaelgalloy.com/2013/06/11/cpu-vs-gpu-performance.html]
SPSinglePrecisionDPDoublePrecision
![Page 5: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/5.jpg)
Zürcher Fachhochschule 5
CPUvs.GPU
• CPUs:Hugecache,opAmizedforseveralthreads:Sequen:alinstruc:ons
• GPUs:100+simplecoresforhugeparallelizaAon:Intensiveparalleliza:on
![Page 6: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/6.jpg)
Zürcher Fachhochschule 6
DiscretevsIntegratedGPU
DiscreteGPU IntegratedGPU
Cache Cache
CPU GPU CPU GPU
![Page 7: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/7.jpg)
Zürcher Fachhochschule 7
CPU/GPUCompuAngvs.CPU/FPGA
Flexibility&MaintenancePowerConsumpAonDevelopmentCostLatencyLatencyvariaAon
High
HighNanosecondsMicroseconds
High LowMid
Low
? NovariaAon
CPU/GPU CPU/FPGA
(CPU/GPU/DSP/FPGA)
![Page 8: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/8.jpg)
Zürcher Fachhochschule 8
Example:NvidiaTK1
- GPU:192Cudacore
- CPU:ARMA-15Quad-core- Videodecode:Full-HD60Hz
- Videoencode:Full-HD30Hz
- Networking:1GBEthernet
![Page 9: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/9.jpg)
Zürcher Fachhochschule 9
GPUProgramming:CUDA
[https://code.msdn.microsoft.com/vstudio/NVIDIA-GPU-Architecture-45c11e6d]
LinuxcompilaAonmodel
AddiAonalLibraries
StandardCudaProgramm
![Page 10: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/10.jpg)
Zürcher Fachhochschule 10
NvidiaTK1
[GPUperformanceAnalysis,Nvidia(2012)]
64KByteConfigurable
L1/SMEM/RO
128KByteL2
192Cores 192Cores 192Cores TK1
![Page 11: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/11.jpg)
Zürcher Fachhochschule 11
DataTransferonTK1
InputVideo/Audio/
Data
TK1
CPU GPU
CPUCache GPUCache
OutputVideo/Audio/
DataInput
DRAM
Output
2OpAonsforDataTransfertoGPUinCuda:• UnifiedVirtualAddressing(GPUDMATransfer)• Memorymapped(ZeroCopy)
?
![Page 12: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/12.jpg)
Zürcher Fachhochschule 12
CudaDataTransfer
Method1:UnifiedVirtualAddressing(withCPU-GPUDMA)
• AllocaAoninGPUmemory• LocalaccessforfirstGPU• NodirectCPUaccess• DMATransferCPU<->GPU
cudaMemcpy
CPU GPU GPU
![Page 13: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/13.jpg)
Zürcher Fachhochschule 13
CudaDataTransfer
GPUprocessingUnifiedVirtualAddressing(DMA): Step1:CopydatatoGPUmemory
Step2:ProcessdatainGPUusing1000softhreads
Step3:Copyresultsbacktohostmemory
![Page 14: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/14.jpg)
Zürcher Fachhochschule 14
CudaDataTransfer
// Step 0: allocate memory cudaMalloc( &dev_a, size ); cudaMalloc( &dev_b, size ); cudaMalloc( &dev_c, size ); // Step 1: copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); // GPU-DMA cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // GPU-DMA // Step 2: launch add() kernel on GPU add <<< N, M >>>( dev_a, dev_b, dev_c ); // Step 3: copy device result back to host copy of c cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost )
![Page 15: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/15.jpg)
Zürcher Fachhochschule 15
CudaDataTransfer
Method2:Memorymapped(ZeroCopy)
• AllocaAoninCPUmemory• LocalaccessforCPU• MemorymappedforGPUs CPU GPU GPU
![Page 16: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/16.jpg)
Zürcher Fachhochschule 16
CudaDataTransfer
GPUprocessingMemoryMapped(ZeroCopy): Step1:CopydatatoGPUmemory
Step2:ProcessdatainGPUusing1000softhreads
Step3:Copyresultsbacktohostmemory
![Page 17: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/17.jpg)
Zürcher Fachhochschule 17
// Step 0: allocate memory cudaMalloc( &dev_a, size ); cudaMallocHost(&dev_a,size); cudaMalloc( &dev_b, size ); cudaMallocHost(&dev_b,size); cudaMalloc( &dev_c, size ); cudaMallocHost(&dev_c,size); // Step 1: copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // Step 2: launch add() kernel on GPU add <<< N, M >>>( dev_a, dev_b, dev_c ); // Step 3: copy device result back to host copy of c cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost )
TypicalGPUworkflow:Memory-mapped
![Page 18: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/18.jpg)
Zürcher Fachhochschule 18
DMAvs.Memory-mapped
DMA(cudaMemcpy)
Memory-mapped(ZeroCopy)
Factor2
![Page 19: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/19.jpg)
Zürcher Fachhochschule 19
GPULatencyVariaAon:output=input
__device__voididenAty(float*input,float*output,intnumElem):
for(intindex=0;index<numElem;index++){
output[index]=input[index]
}
Inputsize=25
(90%)
(0.01%)
TestedonLinux-KernelwithPREEMPT_RT/FullPreempt
![Page 20: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/20.jpg)
Zürcher Fachhochschule 20
-ThereisahugevariaAoninprocessingAme.-For100bytesdata(25floatvalues)perthread:
-90%ofthelaunchestakelessthan40microsec. -0.01%ofthelaunchestakearound500microsec.
-Slowlaunchesdropupdateratefrom25KHzto2KHz.
GPULatencyVariaAon
![Page 21: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/21.jpg)
Zürcher Fachhochschule 21
GPULatencyVariaAon
Inputsize25 250 2500 25000
Jetson TK1
RT Kernel
identity<<<1,1>>>
(90%)
![Page 22: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/22.jpg)
Zürcher Fachhochschule 22
OurSoluAonforLatencyVariaAon
Kernel Loop: while (true) { poll_CPU_flag(); output_data = fct(input_data); }
GPU
... wait_for_input_in_DRAM(); flag_to_GPU(); ...
TK1
CPU GPU
CPUCache GPUCache
Input
DRAM
Output
• Implementkernel-loopsinGPUcores• Memorymapped(zerocopy)dataaccess• EachGPUkernel-loopproducesoutputfromitsinputdata(memory-mapped)
• ThenumberofGPUcoreslimitthenumberofkernelloops
CPU
![Page 23: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/23.jpg)
Zürcher Fachhochschule
SoCs with GPU as Industrial Modules
23
NvidiaTK1Module Snapdragon820Module AllwinnerA80Module
Sources: Nvidia, Avionic Design, Toradex, Intrinisic, Theobroma Systems
NvidiaTX1ModuleNvidiaTK1Module
![Page 24: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/24.jpg)
Zürcher Fachhochschule
SoCs with GPU as Industrial Modules
24
Mobile Processor
Android TV Video Conferencing
Lecture recording streaming Medical Imaging
Driving Assistance Source: Google / PMK
![Page 25: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and](https://reader035.vdocuments.site/reader035/viewer/2022081411/60aaaa48ebb1d950045a8a71/html5/thumbnails/25.jpg)
Zürcher Fachhochschule 25
- Ourresultsconfirmthatforsmalldatachunksmemorymappedtransfers
ismoreefficient
- WeobserveahugebutrarevariaAoninGPUprocessingAme
- ThevariaAondramaAcallyreducesupdateratebyanorderofmagnitude
- OursoluAonistoimplementGPUkernel-loopsandmemory-mappedtransfer
Conclusion