in the...

GPU INFERENCEIN THE DATACENTERDrew Farris, Chief Technologist @ Booz | Allen | HamiltonNvidia GPU Technology Conference, Washington DC

NOVEMBER2017EglinAFB,FL

BOOZ ALLEN HAMILTON

MICROPROCESSORS NO LONGER SCALE AT THE LEVEL OF PERFORMANCE THEY USED TO — THE END OF WHAT YOU WOULD CALL MOORE’S LAW, SEMICONDUCTOR PHYSICS PREVENTS US FROM TAKING DENNARD SCALING ANY FURTHER.- Jen-Hsun Huang,CEO,NVIDIA

1Booz Allen Hamilton

THEDAYSOFEASYPERFORMANCEGAINSAREGONEWe need alternatives to general purpose CPU computation

GRAPHICPROCESSINGUNITSPROVIDEANALTERNATIVE- AlgebraicStrengthsofGPU- EnableNewAlgorithms- Adaptabletoavarietyoftasks- ReadablyAvailableLibraries(CUDA,CV/DLFrameworks)

HOWCANWELEVERAGEOUREXISTINGINVESTMENTS?- HPCvs.CommodityHardwareintheDatacenter- Evolutionvs.Revolution- ScalingOutvs.ScalingUp

INTRODUCTION


WEHAVEAPROBLEMHow to apply complex algorithms as a part of our ingest process?- ComputationallyExpensiveAlgorithms- HeterogeneousDataflow- Horizontal/LinearScalabilityatdatacenterlevel.

How to accommodate this within our existing compute fabric?- HadoopClusters:HDFS,YARN,etc.- CommodityNodes- SmallNumbersofSpecialPurposeNodes- 10GInterconnects- Cost,Power,SpaceandCooling

NOSUPERCOMPUTERS,NOMODELTRAINING- PowerEfficientGPUs(50-75w)- WecouldfocusonmodelInferenceo ApplicationofModelstonewData,e.g:Classification

THE REALITY


DATACENTER ARCHITECTURE


SINGLENODE- 128GRAM- 12xDrives- 24Core/48HT- PCIExpressSlot?

SINGLERACK- 40Nodes- 10GToR Switch- 15-16KW

DATACENTER-ManyRacks- Interconnect- Rows?

DATAIDENTIFICATION,TRANSFORMATION,ANALYSISAs a part of the data ingest pipeline, this system must extract and analyze data in a wide variety of formats and perform normalization in order to prepare for indexing.- Unpacking,UncompressingArchiveso Zip,Tar,etc.

- ConvertingBinaryFormatstoTexto Word,PDF

- ExtractingMetadatao EXIFdatafromimages

- ClassifyingImages/SegmentingImages/DetectingObjectso SearchImagesusingText

- OpticalCharacterRecognitiono ExtractTextfromScannedDocuments

- DetectingMalwareo Executables,PDFs,RTF

The heterogeneous nature of this data was a problem, complex data and analysis would disrupt latency across all datatypes.

DATA EXTRACTION PIPELINE


DATAIDENTIFICATION,TRANSFORMATION,ANALYSISSome of these tasks are straightforward to accelerate using GPUs. So we decided to start with the following:- Unpacking,UncompressingArchiveso Zip,Tar,etc.

- ConvertingBinaryFormatstoTexto Word,PDF

- ExtractingMetadatao EXIFdatafromimages

- ClassifyingImages/SegmentingImages/DetectingObjectso SearchImagesusingText

- OpticalCharacterRecognitiono ExtractTextfromScannedDocuments

- DetectingMalwareo Executables,PDFs,RTF

GPU ACCELERATED DATA EXTRACTION PIPELINE


CPU,MEMORYAND THROUGHPUTIn order to scale linearly as we add more resources, our system must have the following characteristics

- SharedNothingvs.ShareLittle- Statelessvs.MinimallyState-ful- CPUbound(NoDiskIO,NetworkIO,BusorMemoryBottlenecks)- UniformlyFast:IndividualDocumentProcessing:10-100ms- RAMFrugal:Nolargemodelsinmemory- SomethingthatplayswellwithJava

DATA EXTRACTION REQUIREMENTS


What plays well with Java? –or- How the heck to we get it to talk to the CUDA libraries?- PureJavao NoGPUacceleration

- JavaNativeInterface(orsomederivative)o HandwrappedAPIcalls(JNI,JNA)o JavaCPP (JavaandnativeC++bridge)o Deeplearning4jcuDNN integration(asof0.9.1)o Tensorflow JavaAPI

- ExternalProcesseso ForkedExecutableo SharedMemoryo Sockets(TCP,UDP,Raw,etc)

INTEGRATION OPTIONS


SING

LE N

ODE

JAVA VMME

MCP

U

THREAD

THREAD

THREAD

THREAD

JVM HEAP

JNI LIB

WRAPPED LIB

THREADJAVA LIB

FORKED EXE

LOCAL STORAGE

SING

LE N

ODE

JAVA VMME

MCP

U

THREAD

THREAD

THREAD

THREAD

JVM HEAP

THREAD

LOCAL STORAGE

GPU

MEM

GPUJNI LIB? TENSORRT

WRAPPED LIB? CUDA LIB

FORKED EXE? OPCV LIB

JAVA LIB? CAFFE

What do we want to be able to do?- MultipleLibraryorFrameworkSupporto Caffe /Tensorflow /Torch/Otherso Cuda AcceleratedOpenCVo TensorRTo OtherCUDALibraries

NOTIONAL INTEGRATION


So, what components make up the solution?

SOLUTION


“ULTRA-EFFICIENTDEEPLEARNINGINSCALE-OUTSERVERS”- NVidiaPASCAL- 5.5TeraFLOPS Single-PrecisionPerformance- 22Tera-OperationsPerSecondInteger8Performance- 8GGPUMemory- 192GB/sGPUMemoryBandwidth- Low-ProfilePCIExpress- 50W/75WMaxPower

- http://www.nvidia.com/object/accelerate-inference.html

NVIDIA TESLA P4 INFERENCE ACCELERATOR


IMAGECLASSIFICATIONWITHALEXNETUSINGCAFFEWe used CaffeNet, a pre-trained AlexNet model based on the ISRVC 2012 Dataset.

o Agoodstand-informorecompleximagemodelso EvaluatedbothCPU-onlyandGPUvariantsofCaffe tocharacterizeperformance

differenceo OneImagePerBatcho LightlymodifiedtoproperlyhandlemultithreadingandCUDAstreams

CAFFE


CUDAACCELERATEDCOMPUTERVISIONLIBRARYImages were resized using GPU resources instead of CPU resources, and as a result it is not necessary to copy the resized image data to the input layer.- AlexNET inputlayersizeis224pxx224px- ProducesaGpuMat objectforimagedataallocatedfromGPUmemory- GpuMat wrappedtouseasinputlayerfornetwork,avoidingtheneedforanextracopy- CustomGpuMat allocatorsintroducedinOpenCV 3.2.0

OPEN CV


HIGHPERFORMANCEDEEPLEARNINGINFERENCEOPTIMIZERTensorRT can load and optimize Caffe or Tensorflow models for optimized inference performance. In this case, we used it to host the same Caffe model used for the image classification task.- FP32toINT8whileminimizingaccuracyloss- BetterGPUUtilization- KernelAutotuning- ImprovedMemoryFootprint- Multi-streamExecution

- UsedunchangedCaffe modelforImageclassification

NVIDIA TENSORRT


MALCONV:MALWAREDETECTIONWITHDEEPLEARNINGA convolutional neural network digests entire binaries for malware identification- ACustomMalwareIdentificationModel- CurrentIngestframeworkleveragesanun-acceleratedpredecessoro Can’tuseMalConv becauseit’stoocomputationallyintense

- IntegrationwithPyTorch willrequiresomeworko NotagreatinferencelayeravailableforPyTorcho ModelTranslationwithONNXtoCaffe2(orOther?)o Currentlyawork-in-progress

- Howdotheergonomicsdifferfromtheimagecaptioningtask?

PYTORCH


DEEPLEARNINIGINFERENCEVIARESTThe GRE provided memory and process isolation and native libraries for hardware access- Multi-threadedHTTPserverinGolang- RESTFul interface- MultithreadedCaffe- TensorRT – NVidia’sinferenceengine- CUDA-AcceleratedOpenCV- ContainerizedinDocker- Frameworkforotherinferenceengines

- https://developer.nvidia.com/gre- https://github.com/NVIDIA/gpu-rest-engine

NVIDIA GPU REST ENGINE


SIMPLIFIEDPACKAGINGANDDEPLOYMENTVIACONTAINERSPackaging performed in one environment and rapidly deployed to a large number of nodes.- Dockerimagebuilding/testingonAmazonElasticComputeCloud- TestEnvironmentonanIsolatedNetwork- InstallDocker,CUDALibraries/DriversandNVidiaDockerandgo.- PortabilityacrossCentos7Nodes- SupportedLaptopDevelopmentofnewanalytics.- Workedout-of-theboxforCaffe ModelsinCaffe /TensorRT

- https://github.com/NVIDIA/nvidia-docker

NVIDIA DOCKER


We collected telemetry during evaluation with a suite of components we use for tracking system performance on production systems- CollectD /StatsD APIwithvariouspluginsforCPU,Disk,Memory,IO- nvidia-smi forGPUinformation- Timelyformetricstorage,analysis- Grafana forvisualization,daskboarding,analysis.

- NVidiaDataCenterGPUManagero ActiveHealthMonitoring,EarlyFaultDetection(SMARTforGPUs)o PowerManagemento Configuration&Reporting

INSTRUMENTATION


SING

LE N

ODE

NVIDIA DOCKERJAVA VM

MEM

CPU

THREAD

THREAD

THREAD

THREAD

JVM HEAP

THREAD

LOCAL STORAGE

GPU

MEM

P4 G

PU19Booz Allen Hamilton

GPU

REST

ENGI

NE

HTTP

HTTP

HTTP TENSORRT

CAFFE LIB

OPENCV LIB

FINAL INTEGRATION- RESTCallsfromJavatoGRE- Golang Coordinatoro CopyImagetoGPUMato ResizeGPUMat inOpenCVo ResizedGPUMat becomesinputlayero Callstoframeworkforinferenceo Caffe ReferenceModelHostedinCaffe or

TensorRT

What did we evaluate and observe?

EXPERIMENTS AND RESULTS


What effect does concurrency have on the ability to classify images? How quickly can we classify images using only the CPU?

We processed 9000 images through the ETL framework, GRE and Caffe CPU Only

BASELINE CONCURRENCY TESTS WITH CAFFE CPU


JavaThreadCount 10 24 32TotalElapsedTime(Seconds) 271.65 175.59 416

MinimumProcessingTime(Msec) 239 465.8 619.2Mean(Msec) 300 100.49 149.87Max.(Msec) 483 880 1066

CPUMaxUser(%) 83.0 99.8 100.00GPUMaxUtilization(%) 0 0 0

0

50

100

150

200 400 600 800 1000Milliseconds per Image

Coun

t Threads102432

What effect does concurrency have on the ability to classify images? Can the CPU provide enough work to keep the GPUs busy?

CONCURRENCY TESTS WITH CAFFE GPU


JavaThreadCount 10 24 32TotalElapsedTime(Seconds) 37.425 38.451 38.153

MinimumProcessingTime(Msec) 7 8 10Mean(Msec) 39.66 100.49 149.87Max.(Msec) 163 251 415

CPUMaxUser(%) 56 45 55GPUMaxUtilization(%) 82 79 81

0

100

200

300


Coun

t Threads102432

How does TensorRT Performance Differ from Caffe CPU / Caffe GPU?

CONCURRENCY TESTS WITH TENSORRT


JavaThreadCount 10 24 32TotalElapsedTime(Seconds) 33.327 33.260 33.399

MinimumProcessingTime(Msec) 5 5 8Mean(Msec) 35.01 86.30 116.08Max (Msec) 188 258 416

CPUMaxUser(%) 47 54 57GPUMaxUtilization(%) 85 83 84

0

100

200

300


Coun

t Threads102432

How does performance compare between TensorRT and Caffe? o TensorRT isgenerallyfasterandusesmoreoftheGPUthanCaffe.o ThemodelloadedintoTensorRT consumes33%lessGPUMemory.

TENSORRT VS CAFFE


Framework /ThreadCount Caffe GPU10Threads TensorRT 10Threads

TotalElapsedTime(Seconds) 37.425 33.327MinimumProcessingTime(Msec) 7 5

Mean(Msec) 39.66 35.01Max (Msec) 163 188

CPUMaxUser(%) 56 47GPUMaxUtilization(%) 82 85

GPUMemoryUtilization(MB) 1339 895

0

100

200

300

0 50 100 150Milliseconds per Image

Coun

t FrameworkCaffe GPUTensorRT

How does performance compare between TensorRT and Caffe? o TensorRT seemstohandleconcurrencybetterthanCaffe

TENSORRT VS CAFFE


Framework /ThreadCount Caffe GPU32Threads TensorRT 32Threads


Mean(Msec) 39.66 35.01Max (Msec) 149.9 116


GPUMemoryUtilization(MB) 1339 895

0

100

200

300


Coun

t FrameworkCaffe GPUTensorRT

How does performance compare between TensorRT and Caffe? o BothTensorRT andCaffe GPUareconsiderablymoreperformantthanCaffe CPU

TENSORRT VS CAFFE


Framework /ThreadCount Caffe CPU 10Threads TensorRT 10Threads


Mean(Msec) 280 35.01Max (Msec) 296 188


0

100

200

300

0 100 200 300 400 500Milliseconds per Image

Coun

t FrameworkCaffe GPUTensorRTCaffe CPU

How does power utilization compare across both TensorRT and Caffe? o Theyarerelativelythesame,withtheTeslaP4Consuming24Wattsatnearidleo 45Wattsat80%GPUUtilizationforthe50Wmodelo LowGPUMemoryUtilizationo Additional1.8kWperrack==IncreasedPowerConsumptionof~10%

POWER UTILIZATION


Framework Caffe TensorRT

ThreadCount 10 24 32 10 24 32

Min.PowerUse(Watts) 24.26 24.55 25.13 24.36 23.78 24.36

Max.(Watts) 44.02 46.3 45.88 45.32 45.01 45.36

CPUMaxUser(%) 56 45 55 47 54 57

GPUUtilization (%) 82 79 81 85 83 84

GPUMax MemoryUsed(MB) 1339 1339 1341 895 895 895

GPUMaxMemoryUsed(%) ~16% ~10%

There’s much more to explore, what are some of the things we should tackle next?- DevelopabetterunderstandingofthedriversofGPUusageo What’srequiredtoexceed80%utilization?o Investigaterelativelyconstantelapsedtimeo Whereistherewaste/overheadinthisarchitecture?

- Exploreadditionaluse-casesandmodelso CompleteintegrationpathforTorch-basedMalwaremodelo Evaluatemultiplemodelsinmemory- howmuchGPUmemoryisneeded?

- Scaleitouto OperationalmanagementofGPUSatscale:NVidiaDatacenterGPUManager

WHAT’S NEXT?

28Booz Allen Hamilton Internal

Ken Singer & Jake Gingrich @ HP Enterprise, SGI Federal Systems

Rob Zuppert, Larry Brown and Brad Rees @ NVidia

Felix Abecassis & the GPU REST Engine Team @ NVidia

Edward Raff, Jared Sylvester & other MalConv Researchers @ UMD LPS

Sterling Foster & others @ US Department of Defense

Steven Mills, Data Solutions & Machine Intelligence Team @ Booz Allen Hamilton

THANK YOU


Find me at:o LinkedIn:https://www.linkedin.com/in/drewfarris/o Twitter:@drewfarriso Web:https://www.boozallen.com/expertise/analytics.html

QUESTIONS?