in the...

31
GPU INFERENCE IN THE DATACENTER Drew Farris, Chief Technologist @ Booz | Allen | Hamilton Nvidia GPU Technology Conference, Washington DC NOVEMBER 2017

Upload: others

Post on 13-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

GPU INFERENCEIN THE DATACENTERDrew Farris, Chief Technologist @ Booz | Allen | HamiltonNvidia GPU Technology Conference, Washington DC

NOVEMBER2017EglinAFB,FL

BOOZ ALLEN HAMILTON

Page 2: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

MICROPROCESSORS NO LONGER SCALE AT THE LEVEL OF PERFORMANCE THEY USED TO — THE END OF WHAT YOU WOULD CALL MOORE’S LAW, SEMICONDUCTOR PHYSICS PREVENTS US FROM TAKING DENNARD SCALING ANY FURTHER.- Jen-Hsun Huang,CEO,NVIDIA

1Booz Allen Hamilton

Page 3: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

THEDAYSOFEASYPERFORMANCEGAINSAREGONEWe need alternatives to general purpose CPU computation

GRAPHICPROCESSINGUNITSPROVIDEANALTERNATIVE- AlgebraicStrengthsofGPU- EnableNewAlgorithms- Adaptabletoavarietyoftasks- ReadablyAvailableLibraries(CUDA,CV/DLFrameworks)

HOWCANWELEVERAGEOUREXISTINGINVESTMENTS?- HPCvs.CommodityHardwareintheDatacenter- Evolutionvs.Revolution- ScalingOutvs.ScalingUp

INTRODUCTION

2Booz Allen Hamilton

Page 4: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

WEHAVEAPROBLEMHow to apply complex algorithms as a part of our ingest process?- ComputationallyExpensiveAlgorithms- HeterogeneousDataflow- Horizontal/LinearScalabilityatdatacenterlevel.

How to accommodate this within our existing compute fabric?- HadoopClusters:HDFS,YARN,etc.- CommodityNodes- SmallNumbersofSpecialPurposeNodes- 10GInterconnects- Cost,Power,SpaceandCooling

NOSUPERCOMPUTERS,NOMODELTRAINING- PowerEfficientGPUs(50-75w)- WecouldfocusonmodelInferenceo ApplicationofModelstonewData,e.g:Classification

THE REALITY

3Booz Allen Hamilton

Page 5: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

DATACENTER ARCHITECTURE

4Booz Allen Hamilton

SINGLENODE- 128GRAM- 12xDrives- 24Core/48HT- PCIExpressSlot?

SINGLERACK- 40Nodes- 10GToR Switch- 15-16KW

DATACENTER-ManyRacks- Interconnect- Rows?

Page 6: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

DATAIDENTIFICATION,TRANSFORMATION,ANALYSISAs a part of the data ingest pipeline, this system must extract and analyze data in a wide variety of formats and perform normalization in order to prepare for indexing.- Unpacking,UncompressingArchiveso Zip,Tar,etc.

- ConvertingBinaryFormatstoTexto Word,PDF

- ExtractingMetadatao EXIFdatafromimages

- ClassifyingImages/SegmentingImages/DetectingObjectso SearchImagesusingText

- OpticalCharacterRecognitiono ExtractTextfromScannedDocuments

- DetectingMalwareo Executables,PDFs,RTF

The heterogeneous nature of this data was a problem, complex data and analysis would disrupt latency across all datatypes.

DATA EXTRACTION PIPELINE

5Booz Allen Hamilton

Page 7: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

DATAIDENTIFICATION,TRANSFORMATION,ANALYSISSome of these tasks are straightforward to accelerate using GPUs. So we decided to start with the following:- Unpacking,UncompressingArchiveso Zip,Tar,etc.

- ConvertingBinaryFormatstoTexto Word,PDF

- ExtractingMetadatao EXIFdatafromimages

- ClassifyingImages/SegmentingImages/DetectingObjectso SearchImagesusingText

- OpticalCharacterRecognitiono ExtractTextfromScannedDocuments

- DetectingMalwareo Executables,PDFs,RTF

GPU ACCELERATED DATA EXTRACTION PIPELINE

6Booz Allen Hamilton

Page 8: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

CPU,MEMORYAND THROUGHPUTIn order to scale linearly as we add more resources, our system must have the following characteristics

- SharedNothingvs.ShareLittle- Statelessvs.MinimallyState-ful- CPUbound(NoDiskIO,NetworkIO,BusorMemoryBottlenecks)- UniformlyFast:IndividualDocumentProcessing:10-100ms- RAMFrugal:Nolargemodelsinmemory- SomethingthatplayswellwithJava

DATA EXTRACTION REQUIREMENTS

7Booz Allen Hamilton

Page 9: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

What plays well with Java? –or- How the heck to we get it to talk to the CUDA libraries?- PureJavao NoGPUacceleration

- JavaNativeInterface(orsomederivative)o HandwrappedAPIcalls(JNI,JNA)o JavaCPP (JavaandnativeC++bridge)o Deeplearning4jcuDNN integration(asof0.9.1)o Tensorflow JavaAPI

- ExternalProcesseso ForkedExecutableo SharedMemoryo Sockets(TCP,UDP,Raw,etc)

INTEGRATION OPTIONS

8Booz Allen Hamilton

SING

LE N

ODE

JAVA VMME

MCP

U

THREAD

THREAD

THREAD

THREAD

JVM HEAP

JNI LIB

WRAPPED LIB

THREADJAVA LIB

FORKED EXE

LOCAL STORAGE

Page 10: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

SING

LE N

ODE

JAVA VMME

MCP

U

THREAD

THREAD

THREAD

THREAD

JVM HEAP

THREAD

LOCAL STORAGE

GPU

MEM

GPUJNI LIB? TENSORRT

WRAPPED LIB? CUDA LIB

FORKED EXE? OPCV LIB

JAVA LIB? CAFFE

What do we want to be able to do?- MultipleLibraryorFrameworkSupporto Caffe /Tensorflow /Torch/Otherso Cuda AcceleratedOpenCVo TensorRTo OtherCUDALibraries

NOTIONAL INTEGRATION

9Booz Allen Hamilton

Page 11: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

So, what components make up the solution?

SOLUTION

10Booz Allen Hamilton

Page 12: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

“ULTRA-EFFICIENTDEEPLEARNINGINSCALE-OUTSERVERS”- NVidiaPASCAL- 5.5TeraFLOPS Single-PrecisionPerformance- 22Tera-OperationsPerSecondInteger8Performance- 8GGPUMemory- 192GB/sGPUMemoryBandwidth- Low-ProfilePCIExpress- 50W/75WMaxPower

- http://www.nvidia.com/object/accelerate-inference.html

NVIDIA TESLA P4 INFERENCE ACCELERATOR

11Booz Allen Hamilton

Page 13: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

IMAGECLASSIFICATIONWITHALEXNETUSINGCAFFEWe used CaffeNet, a pre-trained AlexNet model based on the ISRVC 2012 Dataset.

o Agoodstand-informorecompleximagemodelso EvaluatedbothCPU-onlyandGPUvariantsofCaffe tocharacterizeperformance

differenceo OneImagePerBatcho LightlymodifiedtoproperlyhandlemultithreadingandCUDAstreams

CAFFE

12Booz Allen Hamilton

Page 14: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

CUDAACCELERATEDCOMPUTERVISIONLIBRARYImages were resized using GPU resources instead of CPU resources, and as a result it is not necessary to copy the resized image data to the input layer.- AlexNET inputlayersizeis224pxx224px- ProducesaGpuMat objectforimagedataallocatedfromGPUmemory- GpuMat wrappedtouseasinputlayerfornetwork,avoidingtheneedforanextracopy- CustomGpuMat allocatorsintroducedinOpenCV 3.2.0

OPEN CV

13Booz Allen Hamilton

Page 15: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

HIGHPERFORMANCEDEEPLEARNINGINFERENCEOPTIMIZERTensorRT can load and optimize Caffe or Tensorflow models for optimized inference performance. In this case, we used it to host the same Caffe model used for the image classification task.- FP32toINT8whileminimizingaccuracyloss- BetterGPUUtilization- KernelAutotuning- ImprovedMemoryFootprint- Multi-streamExecution

- UsedunchangedCaffe modelforImageclassification

NVIDIA TENSORRT

14Booz Allen Hamilton

Page 16: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

MALCONV:MALWAREDETECTIONWITHDEEPLEARNINGA convolutional neural network digests entire binaries for malware identification- ACustomMalwareIdentificationModel- CurrentIngestframeworkleveragesanun-acceleratedpredecessoro Can’tuseMalConv becauseit’stoocomputationallyintense

- IntegrationwithPyTorch willrequiresomeworko NotagreatinferencelayeravailableforPyTorcho ModelTranslationwithONNXtoCaffe2(orOther?)o Currentlyawork-in-progress

- Howdotheergonomicsdifferfromtheimagecaptioningtask?

PYTORCH

15Booz Allen Hamilton

Page 17: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

DEEPLEARNINIGINFERENCEVIARESTThe GRE provided memory and process isolation and native libraries for hardware access- Multi-threadedHTTPserverinGolang- RESTFul interface- MultithreadedCaffe- TensorRT – NVidia’sinferenceengine- CUDA-AcceleratedOpenCV- ContainerizedinDocker- Frameworkforotherinferenceengines

- https://developer.nvidia.com/gre- https://github.com/NVIDIA/gpu-rest-engine

NVIDIA GPU REST ENGINE

16Booz Allen Hamilton

Page 18: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

SIMPLIFIEDPACKAGINGANDDEPLOYMENTVIACONTAINERSPackaging performed in one environment and rapidly deployed to a large number of nodes.- Dockerimagebuilding/testingonAmazonElasticComputeCloud- TestEnvironmentonanIsolatedNetwork- InstallDocker,CUDALibraries/DriversandNVidiaDockerandgo.- PortabilityacrossCentos7Nodes- SupportedLaptopDevelopmentofnewanalytics.- Workedout-of-theboxforCaffe ModelsinCaffe /TensorRT

- https://github.com/NVIDIA/nvidia-docker

NVIDIA DOCKER

17Booz Allen Hamilton

Page 19: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

We collected telemetry during evaluation with a suite of components we use for tracking system performance on production systems- CollectD /StatsD APIwithvariouspluginsforCPU,Disk,Memory,IO- nvidia-smi forGPUinformation- Timelyformetricstorage,analysis- Grafana forvisualization,daskboarding,analysis.

- NVidiaDataCenterGPUManagero ActiveHealthMonitoring,EarlyFaultDetection(SMARTforGPUs)o PowerManagemento Configuration&Reporting

INSTRUMENTATION

18Booz Allen Hamilton

Page 20: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

SING

LE N

ODE

NVIDIA DOCKERJAVA VM

MEM

CPU

THREAD

THREAD

THREAD

THREAD

JVM HEAP

THREAD

LOCAL STORAGE

GPU

MEM

P4 G

PU19Booz Allen Hamilton

GPU

REST

ENGI

NE

HTTP

HTTP

HTTP TENSORRT

CAFFE LIB

OPENCV LIB

FINAL INTEGRATION- RESTCallsfromJavatoGRE- Golang Coordinatoro CopyImagetoGPUMato ResizeGPUMat inOpenCVo ResizedGPUMat becomesinputlayero Callstoframeworkforinferenceo Caffe ReferenceModelHostedinCaffe or

TensorRT

Page 21: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

What did we evaluate and observe?

EXPERIMENTS AND RESULTS

20Booz Allen Hamilton

Page 22: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

What effect does concurrency have on the ability to classify images? How quickly can we classify images using only the CPU?

We processed 9000 images through the ETL framework, GRE and Caffe CPU Only

BASELINE CONCURRENCY TESTS WITH CAFFE CPU

21Booz Allen Hamilton

JavaThreadCount 10 24 32TotalElapsedTime(Seconds) 271.65 175.59 416

MinimumProcessingTime(Msec) 239 465.8 619.2Mean(Msec) 300 100.49 149.87Max.(Msec) 483 880 1066

CPUMaxUser(%) 83.0 99.8 100.00GPUMaxUtilization(%) 0 0 0

0

50

100

150

200 400 600 800 1000Milliseconds per Image

Coun

t Threads102432

Page 23: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

What effect does concurrency have on the ability to classify images? Can the CPU provide enough work to keep the GPUs busy?

CONCURRENCY TESTS WITH CAFFE GPU

22Booz Allen Hamilton

JavaThreadCount 10 24 32TotalElapsedTime(Seconds) 37.425 38.451 38.153

MinimumProcessingTime(Msec) 7 8 10Mean(Msec) 39.66 100.49 149.87Max.(Msec) 163 251 415

CPUMaxUser(%) 56 45 55GPUMaxUtilization(%) 82 79 81

0

100

200

300

0 100 200 300 400Milliseconds per Image

Coun

t Threads102432

Page 24: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

How does TensorRT Performance Differ from Caffe CPU / Caffe GPU?

CONCURRENCY TESTS WITH TENSORRT

23Booz Allen Hamilton

JavaThreadCount 10 24 32TotalElapsedTime(Seconds) 33.327 33.260 33.399

MinimumProcessingTime(Msec) 5 5 8Mean(Msec) 35.01 86.30 116.08Max (Msec) 188 258 416

CPUMaxUser(%) 47 54 57GPUMaxUtilization(%) 85 83 84

0

100

200

300

0 100 200 300 400Milliseconds per Image

Coun

t Threads102432

Page 25: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

How does performance compare between TensorRT and Caffe? o TensorRT isgenerallyfasterandusesmoreoftheGPUthanCaffe.o ThemodelloadedintoTensorRT consumes33%lessGPUMemory.

TENSORRT VS CAFFE

24Booz Allen Hamilton

Framework /ThreadCount Caffe GPU10Threads TensorRT 10Threads

TotalElapsedTime(Seconds) 37.425 33.327MinimumProcessingTime(Msec) 7 5

Mean(Msec) 39.66 35.01Max (Msec) 163 188

CPUMaxUser(%) 56 47GPUMaxUtilization(%) 82 85

GPUMemoryUtilization(MB) 1339 895

0

100

200

300

0 50 100 150Milliseconds per Image

Coun

t FrameworkCaffe GPUTensorRT

Page 26: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

How does performance compare between TensorRT and Caffe? o TensorRT seemstohandleconcurrencybetterthanCaffe

TENSORRT VS CAFFE

25Booz Allen Hamilton

Framework /ThreadCount Caffe GPU32Threads TensorRT 32Threads

TotalElapsedTime(Seconds) 38.153 33.399MinimumProcessingTime(Msec) 10 8

Mean(Msec) 39.66 35.01Max (Msec) 149.9 116

CPUMaxUser(%) 58 64GPUMaxUtilization(%) 81 84

GPUMemoryUtilization(MB) 1339 895

0

100

200

300

0 100 200 300 400Milliseconds per Image

Coun

t FrameworkCaffe GPUTensorRT

Page 27: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

How does performance compare between TensorRT and Caffe? o BothTensorRT andCaffe GPUareconsiderablymoreperformantthanCaffe CPU

TENSORRT VS CAFFE

26Booz Allen Hamilton

Framework /ThreadCount Caffe CPU 10Threads TensorRT 10Threads

TotalElapsedTime(Seconds) 271.659 33.327MinimumProcessingTime(Msec) 239 5

Mean(Msec) 280 35.01Max (Msec) 296 188

CPUMaxUser(%) 83 47GPUMaxUtilization(%) 0 85

0

100

200

300

0 100 200 300 400 500Milliseconds per Image

Coun

t FrameworkCaffe GPUTensorRTCaffe CPU

Page 28: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

How does power utilization compare across both TensorRT and Caffe? o Theyarerelativelythesame,withtheTeslaP4Consuming24Wattsatnearidleo 45Wattsat80%GPUUtilizationforthe50Wmodelo LowGPUMemoryUtilizationo Additional1.8kWperrack==IncreasedPowerConsumptionof~10%

POWER UTILIZATION

27Booz Allen Hamilton

Framework Caffe TensorRT

ThreadCount 10 24 32 10 24 32

Min.PowerUse(Watts) 24.26 24.55 25.13 24.36 23.78 24.36

Max.(Watts) 44.02 46.3 45.88 45.32 45.01 45.36

CPUMaxUser(%) 56 45 55 47 54 57

GPUUtilization (%) 82 79 81 85 83 84

GPUMax MemoryUsed(MB) 1339 1339 1341 895 895 895

GPUMaxMemoryUsed(%) ~16% ~10%

Page 29: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

There’s much more to explore, what are some of the things we should tackle next?- DevelopabetterunderstandingofthedriversofGPUusageo What’srequiredtoexceed80%utilization?o Investigaterelativelyconstantelapsedtimeo Whereistherewaste/overheadinthisarchitecture?

- Exploreadditionaluse-casesandmodelso CompleteintegrationpathforTorch-basedMalwaremodelo Evaluatemultiplemodelsinmemory- howmuchGPUmemoryisneeded?

- Scaleitouto OperationalmanagementofGPUSatscale:NVidiaDatacenterGPUManager

WHAT’S NEXT?

28Booz Allen Hamilton Internal

Page 30: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

Ken Singer & Jake Gingrich @ HP Enterprise, SGI Federal Systems

Rob Zuppert, Larry Brown and Brad Rees @ NVidia

Felix Abecassis & the GPU REST Engine Team @ NVidia

Edward Raff, Jared Sylvester & other MalConv Researchers @ UMD LPS

Sterling Foster & others @ US Department of Defense

Steven Mills, Data Solutions & Machine Intelligence Team @ Booz Allen Hamilton

THANK YOU

29Booz Allen Hamilton

Page 31: IN THE DATACENTERon-demand.gputechconf.com/gtcdc/2017/presentation/...gpu-inference-in-the-data-center.pdfDATA IDENTIFICATION, TRANSFORMATION, ANALYSIS As a part of the data ingest

Find me at:o LinkedIn:https://www.linkedin.com/in/drewfarris/o Twitter:@drewfarriso Web:https://www.boozallen.com/expertise/analytics.html

QUESTIONS?

30Booz Allen Hamilton