effectively scaling out deep learning frameworks with gpusgpgpu10.athoura.com/sra_scalingout.pdf ·...

Hars VardhanOnBehalfofDr.StevenEliuk

ComputerScienceInnovationCenterSamsungResearchAmerica

Feb5th 2017

EffectivelyScalingoutDeepLearningFrameworkswithGPUs

SamsungResearchAmerica– SiliconValley

Outline

i. WhyScale?i. Motivation

ii. CurrentOpen-Sourceofferingsi. Oursentiments

iii. Bottlenecksi. MatrixMultiplicationii. DataLoading&Augmentation

iv. TipsandTricksv. Real-WorldResults

i. Strong&weakscaling,trainingtimevi. HPCEnvironment

i. Whatmakesforaneffectiveenvironment?vii. Q&A


DeepLearning@ Samsung:

DeepLearning: utilizedinanevergrowingnumberofprojectsthroughoutSamsung,froms/wtoh/w.

SpeechTranslation

AIR

ASR

HandwritingRecognition

AutonomousVehicles

Object&SpeechRecognition

Machinetranslation&search

SRADLPlatform

ModelParallelism

LargeScale

Distributed&multi-GPU

AdTargeting

Training

Data

SceneUnderstanding


Long Training Times Impact Time to Market

EffectofExperimentTime

Minutes,Hours• Interactiveinvestigation!Instantgratification!• Parameterexploration1-4Days• Tolerable• Interactivityreplacedbyparallelizationofexperiments1-4Weeks• Highvalueexperimentsonly• Progressstalls>1Month• Don’teventry Keynoteat2015NVIDIAGTCConference,JeffDean


Current Open-Source Offerings:

Single-GPU Multi-GPU Distributed

Past


Current Open-Source Offerings:

Single-GPU Multi-GPU Distributed

Now


DeepLearningPlatform:Framework

“ScalesOut”EffectivelyComprehensiveLibraryDevelopedLargermodels– frontierresearchExtendedfeatureset

LAPACK FFT GeneralPurposeMath

Application(Caffe,Theano,CNTK,TF,MxNet)

CUDAcuBLAS cuDNN

Doesnot“ScalesOut”effectivelyAccuracycandegradeSmallmodelsTractabilityissues

CurrentApproaches Samsung– SRAApproach

LAPACK FFT GeneralPurposeMath

dMath

Node1

GPU#1 GPU#16…

NodeN

GPU#1 GPU#16…

Frameworks(DNN– Kaldi ASR,Caffe AIR,Theano,Torch,etc)

CUDAcuBLAS cuDNN

ApplicationApplication

Node1

GPU#1 GPU#16

NodeN

GPU#1 GPU#16… …… …


dMath : A distributed math library

• Abstractedawaycomplicateddistributed&multi-GPU

programmingfromhigh-levelMachineLearninguser

• UsesMPI-basedclient-servercomputingparadigm

• Hasin-device(GPU)datastorage

• CustomizedroutinetoexploitPCIswitchingandunderlyingHW

• FullsupportforDNNpipeline

• Integrationw/existingopen-sourceframeworks.


(b)(a)

(d)(c)

• Matrixdataisdividedintomultipleblocksanddistributedacrosstheworkers

• Ontheleftarefourexamplesofdistributingthesamematrixacrossfourworkers,thenumbersrepresenttheworkerthatstoresthatblock

• Blocksizeandlocationareimportantforefficiency

• Mostalgorithmsrequireconsistentblocksizes,somerequireconsistentlayoutontheworkers

• ArbitrarylayoutsaresupportedindMath

Data Layout


= +

Inefficient,offdiagonalblocksrequirecommunicationbetweenGPUs

= +

Efficient,nocommunicationbetweenGPUs

=

Efficient,nocommunicationbetweenGPUs

=

Lessefficient,duetomemoryaccesspatterns

=

Veryinefficient,requirestemporarybuffersandcommunicationbetweenGPUs

DataLayoutandAlgorithmEfficiency


MatrixMultiplication

• OneofthemostimportantalgorithmsinDL

• Probablythemostdifficulttooptimize

• MostbuiltoncuBLAS routinesfortheindividualblocks

• Typicalsizesindeeplearningapplicationshavecommunication

betweenGPUsandserversasthebottleneck

• Multipleimplementationsareneededfordifferentcircumstances

• Wehaveover15differentdistributedGEMMimplementationsthatare

usedforvarioustypesofmatrixdimensionsandh/wconfigurations


MatrixMultiplication Scaling

• Smallerblocksdoesn’tscale,duetocommunicationoverheadandcuBLASGEMMefficiency

• Asmatricesgetlarger,communicationbecomeslessofanissuecomparedtocomputation,scalingislessdifficult

• Shownbelowistherelativeperformanceforthesamemultiplication,distributedoveradifferentnumberofGPUswithinourframework

0

5

10

15

1GPU

2GPUs

4GPUs

8GPUs

Badn

ess


MatrixMultiplicationPerformance

• Shownbelowareruntimesforsquarematrixmultiplicationsofvarioussizes

• dMath performanceonavarietyofGPUsiscomparedwithacuBLAS 7.5baselineandcuBLAS-XT

* Tesla K80 result


ReplicationMotivation

• Oftendesirableforworkerstohaveidenticalcopiesofamatrix

(abundantmemoryandrarelychangingdata)

• Usefulforparametersinanetwork

• Equivalenttothedistributionofparametersinaparameterserver

• Caching(usingreplication)ofadistributedweightmatrix,forwardpass

cachesit,backwardpassrequiresnocommunication


ReplicationMotivation(AlexNet Example)

• Thematrixmultiplicationinthethreefullyconnectedlayersshownaboverequirestheweightmatrixtobesharedbetweentheworkersintheforwardpass

• Wecachethismatrix,allowingthebackwarddatagradienttobecomputedwithoutanycommunicationbetweenGPUs


ReplicationImplementation

• Eachworkerstoresanentirecopyofthematrix,inadditiontotheblockstheyareresponsiblefor

• Whenamatrixischanged,theworkersautomaticallyredistributethematrix

• Replicationcanbetemporarilysuspendedwhilemultipleupdatesaremadetoamatrix,thenresumed


Asynchronous Replication

• ParametersareupdatedallatonceandthenreplicatedinSGDtraining

• Updatedparametersoftennotneededuntilmuchdeeperintothenetwork

• Replicationfrequentlyfollowedbyforwardconvolutionlayers(highcomputation,zerocommunication)

• Replacesynchronousreplicationwithasynchronousreplication,overlappingparameterredistributionwithcomputation


TipsandTricks

• Anyserialportionoftheprogramwillhurtscaling(Amdahl’slaw)• Dispatchingjobstotheworkershasasmalloverhead(dozensof

microseconds)• Additionaltimeisspentbroadcastingthemetadata,andsettingupdata

structuresrequiredtoperformadistributedjob• Forlargejobsthisoverheadisnotaproblem• ScalingtomoreGPUswithoutincreasingbatchsize,strongscaling,presents

significantissues(smallconvolutionsinGoogLeNet)• Canbealleviatedbyavoidinglazysolutions

• Beforehistory_data->scale(momentum);history_data->addMatrix(local_rate, *param_diff);

• Afterhistory_data->addMatrix(momentum, *history_data,

local_rate, *param_diff);


TipsandTricks

• Combinecommonoperationstogetherintoasinglejob• Backwardconvolutionsinvolvecomputingdata,filterandbiasgradients,

canbedoneallatonce• AllowsformultipleCUDAstreamstobeused• Overlappingthefilterandbiasgradientreductionswithdatagradient

computationhelpsoptimizeperformance• AvoidmultipleMPIcallswhenpossible,insteadcopythedataintoasingle

bufferanddoasinglecall(e.g.filterandbiasgradientreductions)• Implementbatchedversionsofjobs(e.g.matrixadditionusedinthe

parameterupdate)• ForNvidia TeslaK80,wehavefoundmuchbetterperformancebylockingthe

clocksat758Mhzanddisablingauto-boost,e.g.~5-10%formulti-GPUjobs• RegisteringmemorywiththeIBdriveriscostly,trytoreuseyourbuffersto

preventthiscostlyregistration,i.e.useamemorymanagerofsomesort


Real World: soumith/convnet-benchmarks -- 128 Batch, i.e. Strong Scaling

Expresso 121.8msaverageononetitan

• Expresso 64.5msaverageontwotitans• Expresso 48.6msaverageonfourtitans• Expresso 39.1msaverageoneighttitans


Real World: Training AlexNet -- 256 Batch, i.e. Strong Scaling

0

500

1000

1500

2000

2500

Expresso CNTK Caffe Nvidia- Caffe(0.14)

1GPU 4GPU 8GPU 16GPU 32GPU 64GPU

FPS


Real World: Inception v3, 32 Batch per GPU

0

200

400

600

800

1000

1200

1400

Expresso TensorFlow

Inceptionv3 2GPU 4GPU 8GPU 16GPU 32GPU 64GPU

FPS


Real World: Training Time

AlexNet 1024:• Caffe -- 8 K80s -- 17h 20min• NV-Caffe 0.14 -- 8 K80s -- 12h 15min• Expresso 0.8 -- 8 K80s -- 05h 15min

GoogLeNet v11024Batch:• Caffe -- 8 K80s -- 03d 20hrs• Expresso -- 8 K80s -- 02d 12hrs• Expresso -- 32 K80s -- 01d 06hrs

ScalingbatchfromsingleGPUto64GPUsresultsinthesameaccuracy,e.g.AlexNet 25658.59+-0.5%topone.

Expresso results are for that of our version hybrid parallelism [Krizhevsky14].


Real World: Training Time – continued…

GoogLeNet V3(akaInceptionv3):Tesla K40 GPUs• TensorFlow -- 16, 32, 64 GPUs -- 507h 58min• TensorFlow Internal – 100 GPUs -- 65h 00min• Expresso v0.8 -- 64 GPUs -- 48h 22min

Tesla m40 GPUs• Expresso v0.8 -- 96 GPUs -- 13h 45min

ImageNet2016 Scene Classification competition- Third place in single model category- 7th place in ensemble category


Data Loading & Augmentation

• DataloadingquicklybecomesabottleneckwhenscalingtomultipleGPUs• Notpracticaltoloadfromasingleprocess,queues,etc.don’tworkwell• AutomaticallytunedCPU,GPUmulti-threadeddataloaderrunsoneach

worker• Periodicallyruns,(0.1%)acheckisdonetodetermineifthecurrentpolicyis

adequate…rememberingmulti-tenancy• Nopolicyisgoodforallmodelsasthereisadifferentproportionofworkin

differentmodels.



• Atruntime,themasterprocesssamplesperformancefordifferentcombinationsofthreadcountsandtransferindices

• Thisallowsthesystemtoacceleratedataaugmentationwhenitisthebottleneck.

MemoryManagement:• Page-lockedhostallocationsanddeviceallocationscauseimplicit

synchronizationandareexpensive• Host&devicebuffersareallocatedbyeachthreadonstart-upandonly

resizedwhenasampleexceedsthecurrentbuffersize.Memoryfootprintquicklystabilizes



AlexNet Perf on8M40s

SingleThreadMulti-ThreadAuto-tuned


Low precision Training & Inference

Motivations• Lowerdatastorageandcommunication• BetterperformancewithnewerGPUs(Pascal)• Proventoworkgreatforinference– Fast,Accurate

Challenges• Instabilityforcertainlayersoftraining• Partialsupportbysoftwarestack(yet)

AddressingAccuracyLoss• Performingcertainoperationsinhigherprecession

• Softmax,Innerproduct,weightupdates


H/W:SamsungAdvancedLearning


Conclusion

- dMath provideseffectivescalingtoDLframework:Caffe,Kaldi

- DNNpipelinewithdataloaderandarrayoffeatures

- CanalsobeusedinotherapplicationslikeMedicalImaging,Finite

ElementAnalysis,etc.

FutureWork

- Halfprecisiontraining

- LayersimplementationforRNN,etc.


Acknowledgments

….


Contact

[email protected]


Q & A

?

effectively scaling out deep learning frameworks with gpusgpgpu10.athoura.com/sra_scalingout.pdf ·...

Documents