using fpgas to accelerate neural network inference jahre.pdf · large neural networks force weights...

Using FPGAs to Accelerate Neural Network Inference1st FPL Workshop on Reconfigurable Computing for Deep Learning (RC4DL)8. September 2017, Ghent, Belgium

Associate Professor Magnus JahreDepartment of Computer ScienceNorwegian University of Science and Technology

ManyofthecontributionsdiscussedinthispresentationhavebeendevelopedinclosecollaborationwithXilinxResearchLabs

Outline

Part1: CNNaccelerationonFPGAsshouldexploitcustomization

Part2: CNNcustomizationpossibilities

Quantization

Acceleratorarchitecture

FPGA-friendlynetworktransformations

Memorysystemchallenges

Part3: Conclusion

Outline

Quantization

Part3: Conclusion

Properties of Acceleration-friendly Workloads

LotsofComputeandParallelism

Higharithmeticintensity

Feworpredictablebranches

Regularmemoryaccesses

rmance

ns/secon

ArithmeticIntensity(Operations/Byte)

Compute-bound

RooflineModelofaPlatform

Williamsetal.,“Roofline:AnInsightfulVisualPerformanceModelforMulticoreArchitectures”,CACM,2009

ApplicationBApplicationA

Convolutional Neural Networks (CNNs)

HeavyComputation

“Cat”

Krizhevsky etal.,“ImageNetClassificationwithDeepConvolutionalNeuralNetworks”,NIPS,2012

AlexNet

CONV-Layer as Matrix-Matrix Multiplication

Aconvolutionallayerconvolvesasetoffiltersovertheinputdata

Reorganizedatatocreateamatrix-matrixmultiplicationoranumberof

matrix-vectormultiplications

Potentialproblem:Ifthefiltersoverlap,theinputdatais

duplicated(intheory)

Choice of Matrix Multiplication Algorithm

LotsofComputeandParallelism

Higharithmeticintensity

Feworpredictablebranches

Regularmemoryaccesses

DenseMatrixMultiplication

SparseMatrixMultiplication

Choiceofalgorithmisatrade-offthatdependsonplatformandinputcharacteristics

(canhave)

CNN Inference Platform Alternatives

HighPerformance

SufficientPerformance

HighEnergyEfficiency

ShortDevelopmentTime

LowCost

CPU GPU ASIC FPGA

++0 + +

++ ? ++

+0 ++ +

++++ ? +

+++ ? +

Differentiator:FPGAscancustomizethearchitecturetofitthecurrentproblem

“Winner”

CPU/GPU

Overall:Noplatformstandsoutasaclearwinneracrossallmetrics

Metric

CNN Customization PotentialExploitthecharacteristicsofneuralnetworks• ManydifferentCNNscanprovidesimilaraccuracy• Chooseonethatmatchesthestrengthsofthetargetplatform• Retrainingmaybenecessary• PotentialsynergiesbetweenCNNalgorithmdevelopmentandaccelerationpotential

Dimensionsofcustomization• Acceleratorarchitecture(genericvsnetwork-specific)• Datatype(fixedpointvsfloatingpoint)• Dataprecision(binary,8bit,64bit,etc.)• FPGA-friendlynetworktransformations

Outline

Quantization

Part3: Conclusion

Redundancy and Quantization

Evidenceofredundancyintrainednetworks•sparsification,low-rankapproximations,faulttolerance…

Reducedprecision(quantization)•Restrictweightsand/oractivationstoQ-bitvalues•HWbenefits:Low-bitwidthdatapaths,regularcompute

Sungetal:Quantizationworkswellwhen:•…thenetworkis“bigenough”•…thenetworkisawareofquantizationduringtraining

“(…)theperformancegapbetweenthefloating-pointandtheretrain-basedternary(+1,0,-1)weightneuralnetworks(…)almostvanishesinfullycomplexnetworks(…)”(Sungetal,ResiliencyofDeepNNsUnderQuantization)

Binarized Neural Networks (BNNs)Theextremecaseofquantization

• Permitonlytwovalues:+1and-1• Binaryweights,binaryactivations

ByCourbariauxandHubaraetal.(NIPS2016)

• Opensourcetrainingflow,basedonTheanoandLasagne• Layers:convolutional,fully-connected,batchnormandmaxpool

Closetostate-of-the-artaccuracyonsmallerimageclassificationbenchmarks• Andgettingsteadilybetteratthebiggerbenchmarks.

Quantization MNIST SVHN CIFAR-10

Binaryweights,Binaryactivations

0.96% 2.53% 10.15%

Binaryweights,FPactivations

1.29% 2.30% 9.90%

FPweights,FPactivations

0.94% 1.69% 7.62%

BNNaccuracyloss -0.2% -0.84% -2.53%

%classificationerror(lowerisbetter)

BNN Performance Potential on FPGAsfewerLUTs/op:higherpeakperformance

66TOPS

stayon-chip:achievemoreofthepeak

0.1TOPS 40TOPS

Outline

Quantization

Part3: Conclusion

FINN’s Heterogeneous Streaming Architecture

Layer0 Layer1 LayerN

…image result

BNNtopology

10Mops

1xPE 10xPE

OnehardwarelayerperBNNlayer

Heterogeneous:Avoid“one-size-fits-all”penalties(computationnotuniformacrosslayers)

Streaming:Maximizethroughputandminimizelatency(overlapcommunicationandcomputation)

Umuroglu etal.,“FINN:AFrameworkforFast,ScalableBinarized NeuralNetworkInference”,FPGA,2017

Outline

Quantization

Part3: Conclusion

Transformation Example: StreamliningState-of-the-artquantizedneuralnetworkmethodsstillemploysomefloatingpointcomputationstoimproveaccuracy• Batchnormalization• Alpha-scaling• Non-integerquantizationlevels

Streamliningavoidsfloatingpointcomputationsby:

• Viewingquantizationassuccessivethresholding•Movingandcollapsinglineartransformations• Absorbinglinearoperationsintothresholds

Umuroglu andJahre,“StreamlinedDeploymentforQuantizedNeuralNetworks”,

ToappearatHENND,2017

FewerlayersreduceFPGAresourceconsumptioninnetwork-specificarchitectures

Floatingpointlayer

Integerlayers

Outline

Quantization

Part3: Conclusion

Off-Chip Weight Storage

Largeneuralnetworksforceweightstobestoredoff-chip• Increasesbandwidthneeds• NeedtoexploitMemoryLevelParallelism(MLP)tomaximizebandwidthutilization

Neuralnetworkstructurescanbemadesparse• Potentialforreducingcomputeandmemoryrequirements

• Efficientlyexploitingsparsityistricky

for(j=0 to n-1)for(i=colptr[j] to colptr[j+1])y[rowind[i]] += values[i] * x[j]

Accelerator Performance TricksMostoftherandomaccessvectorisunusedatanygiventime• Uselight-weightpreprocessingtodetermineneededcachecapacity

• Usespareonchipmemoryforsomethingelse

Overlapmemoryaccessesandcomputation

• Balancedsystem: Acceleratorcompute capabilityshouldmatchmemorysubsystemperformance

• Parallelismeffectivelyhidesthecomputelatency• ExploitMemoryLevelParallelism(MLP)tofurtherimprovememorybusutilizationandperformance

• Solution:Non-blockingcaches

Umuroglu andJahre,“AVectorCachingSchemeforStreamingFPGASpMV Accelerators”,ARC,2015

Umuroglu andJahre,“RandomAccessSchemesforEfficientFPGASpMVAcceleration”,MicroprocessorsandMicrosystems,2016

Outline

Quantization

Part3: Conclusion

ConclusionDeeplearningiswellsuitedforacceleration• Nocomputeplatformisaclearwinneracrossallperformancemetrics• FPGAsexcelwhenwecanleverageheavilycustomizedaccelerators• NeedtoidentifyneuralnetworkswithcomputationandmemorypatternsthataresuitabletoFPGAplatformcharacteristics

Possibilitiesforcustomization• Examples:Datatype,precision,architectureandnetworktransformations• Significantpotentialforco-designofneuralnetworkalgorithmsandFPGAaccelerators

The Telenor-NTNU AI-Lab• To enable both basic and applied

research• To allow wide variety of research

areas• Maintain research at highest

international level• To enable cross-disciplinary

collaboration

Thank You!

Thefollowingpeoplehavecontributedtotheideascoveredinthispresentation:• Yaman Umuroglu,NTNU• MichaelaBlott,XilinxResearchLabs• ResearchersattheNTNUComputer

ArchitectureLab• ResearchersatXilinxResearchLabs

using fpgas to accelerate neural network inference jahre.pdf · large neural networks force weights...

Documents

fpgas!? now what?

polarfire soc fpgas

fpgas in network equipment fpgas as the focus

neural networks on fpgas

metropolitan road traﬃc simulation on fpgas ·...

"efficient implementation of convolutional neural networks...

caffeinated fpgas: fpga f training and inference of...

designing fpgas & asics

next generation fpgas

low cost fpgas

putting fpgas to work in software radio systems · putting...

© copyright 2018 xilinx...xq kintex® & virtex®...

diseño con fpgas: verificación funcional guillermo...

introduction to fpgas - inspiring...

deep learning on fpgas - the atrium€¦ · deep neural...

fpgas libres

custonn: customizing neural networks on fpgas · 2017. 3....

a taste of binary neural network inference for on-board...

polarfire fpgas - microsemi

reprogrammable fpgas at astrium - esa...