in the...
TRANSCRIPT
GPU INFERENCEIN THE DATACENTERDrew Farris, Chief Technologist @ Booz | Allen | HamiltonNvidia GPU Technology Conference, Washington DC
NOVEMBER2017EglinAFB,FL
BOOZ ALLEN HAMILTON
MICROPROCESSORS NO LONGER SCALE AT THE LEVEL OF PERFORMANCE THEY USED TO — THE END OF WHAT YOU WOULD CALL MOORE’S LAW, SEMICONDUCTOR PHYSICS PREVENTS US FROM TAKING DENNARD SCALING ANY FURTHER.- Jen-Hsun Huang,CEO,NVIDIA
1Booz Allen Hamilton
THEDAYSOFEASYPERFORMANCEGAINSAREGONEWe need alternatives to general purpose CPU computation
GRAPHICPROCESSINGUNITSPROVIDEANALTERNATIVE- AlgebraicStrengthsofGPU- EnableNewAlgorithms- Adaptabletoavarietyoftasks- ReadablyAvailableLibraries(CUDA,CV/DLFrameworks)
HOWCANWELEVERAGEOUREXISTINGINVESTMENTS?- HPCvs.CommodityHardwareintheDatacenter- Evolutionvs.Revolution- ScalingOutvs.ScalingUp
INTRODUCTION
2Booz Allen Hamilton
WEHAVEAPROBLEMHow to apply complex algorithms as a part of our ingest process?- ComputationallyExpensiveAlgorithms- HeterogeneousDataflow- Horizontal/LinearScalabilityatdatacenterlevel.
How to accommodate this within our existing compute fabric?- HadoopClusters:HDFS,YARN,etc.- CommodityNodes- SmallNumbersofSpecialPurposeNodes- 10GInterconnects- Cost,Power,SpaceandCooling
NOSUPERCOMPUTERS,NOMODELTRAINING- PowerEfficientGPUs(50-75w)- WecouldfocusonmodelInferenceo ApplicationofModelstonewData,e.g:Classification
THE REALITY
3Booz Allen Hamilton
DATACENTER ARCHITECTURE
4Booz Allen Hamilton
SINGLENODE- 128GRAM- 12xDrives- 24Core/48HT- PCIExpressSlot?
SINGLERACK- 40Nodes- 10GToR Switch- 15-16KW
DATACENTER-ManyRacks- Interconnect- Rows?
DATAIDENTIFICATION,TRANSFORMATION,ANALYSISAs a part of the data ingest pipeline, this system must extract and analyze data in a wide variety of formats and perform normalization in order to prepare for indexing.- Unpacking,UncompressingArchiveso Zip,Tar,etc.
- ConvertingBinaryFormatstoTexto Word,PDF
- ExtractingMetadatao EXIFdatafromimages
- ClassifyingImages/SegmentingImages/DetectingObjectso SearchImagesusingText
- OpticalCharacterRecognitiono ExtractTextfromScannedDocuments
- DetectingMalwareo Executables,PDFs,RTF
The heterogeneous nature of this data was a problem, complex data and analysis would disrupt latency across all datatypes.
DATA EXTRACTION PIPELINE
5Booz Allen Hamilton
DATAIDENTIFICATION,TRANSFORMATION,ANALYSISSome of these tasks are straightforward to accelerate using GPUs. So we decided to start with the following:- Unpacking,UncompressingArchiveso Zip,Tar,etc.
- ConvertingBinaryFormatstoTexto Word,PDF
- ExtractingMetadatao EXIFdatafromimages
- ClassifyingImages/SegmentingImages/DetectingObjectso SearchImagesusingText
- OpticalCharacterRecognitiono ExtractTextfromScannedDocuments
- DetectingMalwareo Executables,PDFs,RTF
GPU ACCELERATED DATA EXTRACTION PIPELINE
6Booz Allen Hamilton
CPU,MEMORYAND THROUGHPUTIn order to scale linearly as we add more resources, our system must have the following characteristics
- SharedNothingvs.ShareLittle- Statelessvs.MinimallyState-ful- CPUbound(NoDiskIO,NetworkIO,BusorMemoryBottlenecks)- UniformlyFast:IndividualDocumentProcessing:10-100ms- RAMFrugal:Nolargemodelsinmemory- SomethingthatplayswellwithJava
DATA EXTRACTION REQUIREMENTS
7Booz Allen Hamilton
What plays well with Java? –or- How the heck to we get it to talk to the CUDA libraries?- PureJavao NoGPUacceleration
- JavaNativeInterface(orsomederivative)o HandwrappedAPIcalls(JNI,JNA)o JavaCPP (JavaandnativeC++bridge)o Deeplearning4jcuDNN integration(asof0.9.1)o Tensorflow JavaAPI
- ExternalProcesseso ForkedExecutableo SharedMemoryo Sockets(TCP,UDP,Raw,etc)
INTEGRATION OPTIONS
8Booz Allen Hamilton
SING
LE N
ODE
JAVA VMME
MCP
U
THREAD
THREAD
THREAD
THREAD
JVM HEAP
JNI LIB
WRAPPED LIB
THREADJAVA LIB
FORKED EXE
LOCAL STORAGE
SING
LE N
ODE
JAVA VMME
MCP
U
THREAD
THREAD
THREAD
THREAD
JVM HEAP
THREAD
LOCAL STORAGE
GPU
MEM
GPUJNI LIB? TENSORRT
WRAPPED LIB? CUDA LIB
FORKED EXE? OPCV LIB
JAVA LIB? CAFFE
What do we want to be able to do?- MultipleLibraryorFrameworkSupporto Caffe /Tensorflow /Torch/Otherso Cuda AcceleratedOpenCVo TensorRTo OtherCUDALibraries
NOTIONAL INTEGRATION
9Booz Allen Hamilton
So, what components make up the solution?
SOLUTION
10Booz Allen Hamilton
“ULTRA-EFFICIENTDEEPLEARNINGINSCALE-OUTSERVERS”- NVidiaPASCAL- 5.5TeraFLOPS Single-PrecisionPerformance- 22Tera-OperationsPerSecondInteger8Performance- 8GGPUMemory- 192GB/sGPUMemoryBandwidth- Low-ProfilePCIExpress- 50W/75WMaxPower
- http://www.nvidia.com/object/accelerate-inference.html
NVIDIA TESLA P4 INFERENCE ACCELERATOR
11Booz Allen Hamilton
IMAGECLASSIFICATIONWITHALEXNETUSINGCAFFEWe used CaffeNet, a pre-trained AlexNet model based on the ISRVC 2012 Dataset.
o Agoodstand-informorecompleximagemodelso EvaluatedbothCPU-onlyandGPUvariantsofCaffe tocharacterizeperformance
differenceo OneImagePerBatcho LightlymodifiedtoproperlyhandlemultithreadingandCUDAstreams
CAFFE
12Booz Allen Hamilton
CUDAACCELERATEDCOMPUTERVISIONLIBRARYImages were resized using GPU resources instead of CPU resources, and as a result it is not necessary to copy the resized image data to the input layer.- AlexNET inputlayersizeis224pxx224px- ProducesaGpuMat objectforimagedataallocatedfromGPUmemory- GpuMat wrappedtouseasinputlayerfornetwork,avoidingtheneedforanextracopy- CustomGpuMat allocatorsintroducedinOpenCV 3.2.0
OPEN CV
13Booz Allen Hamilton
HIGHPERFORMANCEDEEPLEARNINGINFERENCEOPTIMIZERTensorRT can load and optimize Caffe or Tensorflow models for optimized inference performance. In this case, we used it to host the same Caffe model used for the image classification task.- FP32toINT8whileminimizingaccuracyloss- BetterGPUUtilization- KernelAutotuning- ImprovedMemoryFootprint- Multi-streamExecution
- UsedunchangedCaffe modelforImageclassification
NVIDIA TENSORRT
14Booz Allen Hamilton
MALCONV:MALWAREDETECTIONWITHDEEPLEARNINGA convolutional neural network digests entire binaries for malware identification- ACustomMalwareIdentificationModel- CurrentIngestframeworkleveragesanun-acceleratedpredecessoro Can’tuseMalConv becauseit’stoocomputationallyintense
- IntegrationwithPyTorch willrequiresomeworko NotagreatinferencelayeravailableforPyTorcho ModelTranslationwithONNXtoCaffe2(orOther?)o Currentlyawork-in-progress
- Howdotheergonomicsdifferfromtheimagecaptioningtask?
PYTORCH
15Booz Allen Hamilton
DEEPLEARNINIGINFERENCEVIARESTThe GRE provided memory and process isolation and native libraries for hardware access- Multi-threadedHTTPserverinGolang- RESTFul interface- MultithreadedCaffe- TensorRT – NVidia’sinferenceengine- CUDA-AcceleratedOpenCV- ContainerizedinDocker- Frameworkforotherinferenceengines
- https://developer.nvidia.com/gre- https://github.com/NVIDIA/gpu-rest-engine
NVIDIA GPU REST ENGINE
16Booz Allen Hamilton
SIMPLIFIEDPACKAGINGANDDEPLOYMENTVIACONTAINERSPackaging performed in one environment and rapidly deployed to a large number of nodes.- Dockerimagebuilding/testingonAmazonElasticComputeCloud- TestEnvironmentonanIsolatedNetwork- InstallDocker,CUDALibraries/DriversandNVidiaDockerandgo.- PortabilityacrossCentos7Nodes- SupportedLaptopDevelopmentofnewanalytics.- Workedout-of-theboxforCaffe ModelsinCaffe /TensorRT
- https://github.com/NVIDIA/nvidia-docker
NVIDIA DOCKER
17Booz Allen Hamilton
We collected telemetry during evaluation with a suite of components we use for tracking system performance on production systems- CollectD /StatsD APIwithvariouspluginsforCPU,Disk,Memory,IO- nvidia-smi forGPUinformation- Timelyformetricstorage,analysis- Grafana forvisualization,daskboarding,analysis.
- NVidiaDataCenterGPUManagero ActiveHealthMonitoring,EarlyFaultDetection(SMARTforGPUs)o PowerManagemento Configuration&Reporting
INSTRUMENTATION
18Booz Allen Hamilton
SING
LE N
ODE
NVIDIA DOCKERJAVA VM
MEM
CPU
THREAD
THREAD
THREAD
THREAD
JVM HEAP
THREAD
LOCAL STORAGE
GPU
MEM
P4 G
PU19Booz Allen Hamilton
GPU
REST
ENGI
NE
HTTP
HTTP
HTTP TENSORRT
CAFFE LIB
OPENCV LIB
FINAL INTEGRATION- RESTCallsfromJavatoGRE- Golang Coordinatoro CopyImagetoGPUMato ResizeGPUMat inOpenCVo ResizedGPUMat becomesinputlayero Callstoframeworkforinferenceo Caffe ReferenceModelHostedinCaffe or
TensorRT
What did we evaluate and observe?
EXPERIMENTS AND RESULTS
20Booz Allen Hamilton
What effect does concurrency have on the ability to classify images? How quickly can we classify images using only the CPU?
We processed 9000 images through the ETL framework, GRE and Caffe CPU Only
BASELINE CONCURRENCY TESTS WITH CAFFE CPU
21Booz Allen Hamilton
JavaThreadCount 10 24 32TotalElapsedTime(Seconds) 271.65 175.59 416
MinimumProcessingTime(Msec) 239 465.8 619.2Mean(Msec) 300 100.49 149.87Max.(Msec) 483 880 1066
CPUMaxUser(%) 83.0 99.8 100.00GPUMaxUtilization(%) 0 0 0
0
50
100
150
200 400 600 800 1000Milliseconds per Image
Coun
t Threads102432
What effect does concurrency have on the ability to classify images? Can the CPU provide enough work to keep the GPUs busy?
CONCURRENCY TESTS WITH CAFFE GPU
22Booz Allen Hamilton
JavaThreadCount 10 24 32TotalElapsedTime(Seconds) 37.425 38.451 38.153
MinimumProcessingTime(Msec) 7 8 10Mean(Msec) 39.66 100.49 149.87Max.(Msec) 163 251 415
CPUMaxUser(%) 56 45 55GPUMaxUtilization(%) 82 79 81
0
100
200
300
0 100 200 300 400Milliseconds per Image
Coun
t Threads102432
How does TensorRT Performance Differ from Caffe CPU / Caffe GPU?
CONCURRENCY TESTS WITH TENSORRT
23Booz Allen Hamilton
JavaThreadCount 10 24 32TotalElapsedTime(Seconds) 33.327 33.260 33.399
MinimumProcessingTime(Msec) 5 5 8Mean(Msec) 35.01 86.30 116.08Max (Msec) 188 258 416
CPUMaxUser(%) 47 54 57GPUMaxUtilization(%) 85 83 84
0
100
200
300
0 100 200 300 400Milliseconds per Image
Coun
t Threads102432
How does performance compare between TensorRT and Caffe? o TensorRT isgenerallyfasterandusesmoreoftheGPUthanCaffe.o ThemodelloadedintoTensorRT consumes33%lessGPUMemory.
TENSORRT VS CAFFE
24Booz Allen Hamilton
Framework /ThreadCount Caffe GPU10Threads TensorRT 10Threads
TotalElapsedTime(Seconds) 37.425 33.327MinimumProcessingTime(Msec) 7 5
Mean(Msec) 39.66 35.01Max (Msec) 163 188
CPUMaxUser(%) 56 47GPUMaxUtilization(%) 82 85
GPUMemoryUtilization(MB) 1339 895
0
100
200
300
0 50 100 150Milliseconds per Image
Coun
t FrameworkCaffe GPUTensorRT
How does performance compare between TensorRT and Caffe? o TensorRT seemstohandleconcurrencybetterthanCaffe
TENSORRT VS CAFFE
25Booz Allen Hamilton
Framework /ThreadCount Caffe GPU32Threads TensorRT 32Threads
TotalElapsedTime(Seconds) 38.153 33.399MinimumProcessingTime(Msec) 10 8
Mean(Msec) 39.66 35.01Max (Msec) 149.9 116
CPUMaxUser(%) 58 64GPUMaxUtilization(%) 81 84
GPUMemoryUtilization(MB) 1339 895
0
100
200
300
0 100 200 300 400Milliseconds per Image
Coun
t FrameworkCaffe GPUTensorRT
How does performance compare between TensorRT and Caffe? o BothTensorRT andCaffe GPUareconsiderablymoreperformantthanCaffe CPU
TENSORRT VS CAFFE
26Booz Allen Hamilton
Framework /ThreadCount Caffe CPU 10Threads TensorRT 10Threads
TotalElapsedTime(Seconds) 271.659 33.327MinimumProcessingTime(Msec) 239 5
Mean(Msec) 280 35.01Max (Msec) 296 188
CPUMaxUser(%) 83 47GPUMaxUtilization(%) 0 85
0
100
200
300
0 100 200 300 400 500Milliseconds per Image
Coun
t FrameworkCaffe GPUTensorRTCaffe CPU
How does power utilization compare across both TensorRT and Caffe? o Theyarerelativelythesame,withtheTeslaP4Consuming24Wattsatnearidleo 45Wattsat80%GPUUtilizationforthe50Wmodelo LowGPUMemoryUtilizationo Additional1.8kWperrack==IncreasedPowerConsumptionof~10%
POWER UTILIZATION
27Booz Allen Hamilton
Framework Caffe TensorRT
ThreadCount 10 24 32 10 24 32
Min.PowerUse(Watts) 24.26 24.55 25.13 24.36 23.78 24.36
Max.(Watts) 44.02 46.3 45.88 45.32 45.01 45.36
CPUMaxUser(%) 56 45 55 47 54 57
GPUUtilization (%) 82 79 81 85 83 84
GPUMax MemoryUsed(MB) 1339 1339 1341 895 895 895
GPUMaxMemoryUsed(%) ~16% ~10%
There’s much more to explore, what are some of the things we should tackle next?- DevelopabetterunderstandingofthedriversofGPUusageo What’srequiredtoexceed80%utilization?o Investigaterelativelyconstantelapsedtimeo Whereistherewaste/overheadinthisarchitecture?
- Exploreadditionaluse-casesandmodelso CompleteintegrationpathforTorch-basedMalwaremodelo Evaluatemultiplemodelsinmemory- howmuchGPUmemoryisneeded?
- Scaleitouto OperationalmanagementofGPUSatscale:NVidiaDatacenterGPUManager
WHAT’S NEXT?
28Booz Allen Hamilton Internal
Ken Singer & Jake Gingrich @ HP Enterprise, SGI Federal Systems
Rob Zuppert, Larry Brown and Brad Rees @ NVidia
Felix Abecassis & the GPU REST Engine Team @ NVidia
Edward Raff, Jared Sylvester & other MalConv Researchers @ UMD LPS
Sterling Foster & others @ US Department of Defense
Steven Mills, Data Solutions & Machine Intelligence Team @ Booz Allen Hamilton
THANK YOU
29Booz Allen Hamilton
Find me at:o LinkedIn:https://www.linkedin.com/in/drewfarris/o Twitter:@drewfarriso Web:https://www.boozallen.com/expertise/analytics.html
QUESTIONS?
30Booz Allen Hamilton