using fpgas to accelerate neural network inference jahre.pdf · large neural networks force weights...
TRANSCRIPT
1
Using FPGAs to Accelerate Neural Network Inference1st FPL Workshop on Reconfigurable Computing for Deep Learning (RC4DL)8. September 2017, Ghent, Belgium
Associate Professor Magnus JahreDepartment of Computer ScienceNorwegian University of Science and Technology
ManyofthecontributionsdiscussedinthispresentationhavebeendevelopedinclosecollaborationwithXilinxResearchLabs
2
Outline
Part1: CNNaccelerationonFPGAsshouldexploitcustomization
Part2: CNNcustomizationpossibilities
Quantization
Acceleratorarchitecture
FPGA-friendlynetworktransformations
Memorysystemchallenges
Part3: Conclusion
3
Outline
Part1: CNNaccelerationonFPGAsshouldexploitcustomization
Part2: CNNcustomizationpossibilities
Quantization
Acceleratorarchitecture
FPGA-friendlynetworktransformations
Memorysystemchallenges
Part3: Conclusion
4
Properties of Acceleration-friendly Workloads
LotsofComputeandParallelism
Higharithmeticintensity
Feworpredictablebranches
Regularmemoryaccesses
Perfo
rmance
(Ope
ratio
ns/secon
d)
ArithmeticIntensity(Operations/Byte)
Compute-bound
RooflineModelofaPlatform
Williamsetal.,“Roofline:AnInsightfulVisualPerformanceModelforMulticoreArchitectures”,CACM,2009
ApplicationBApplicationA
5
Convolutional Neural Networks (CNNs)
HeavyComputation
“Cat”
Krizhevsky etal.,“ImageNetClassificationwithDeepConvolutionalNeuralNetworks”,NIPS,2012
AlexNet
6
CONV-Layer as Matrix-Matrix Multiplication
Aconvolutionallayerconvolvesasetoffiltersovertheinputdata
Reorganizedatatocreateamatrix-matrixmultiplicationoranumberof
matrix-vectormultiplications
Potentialproblem:Ifthefiltersoverlap,theinputdatais
duplicated(intheory)
7
Choice of Matrix Multiplication Algorithm
LotsofComputeandParallelism
Higharithmeticintensity
Feworpredictablebranches
Regularmemoryaccesses
DenseMatrixMultiplication
SparseMatrixMultiplication
Choiceofalgorithmisatrade-offthatdependsonplatformandinputcharacteristics
(canhave)
8
CNN Inference Platform Alternatives
HighPerformance
SufficientPerformance
HighEnergyEfficiency
ShortDevelopmentTime
LowCost
CPU GPU ASIC FPGA
++0 + +
++ ? ++
+0 ++ +
++++ ? +
+++ ? +
Differentiator:FPGAscancustomizethearchitecturetofitthecurrentproblem
“Winner”
CPU
CPU/GPU
ASIC
GPU
FPGA
Overall:Noplatformstandsoutasaclearwinneracrossallmetrics
Metric
9
CNN Customization PotentialExploitthecharacteristicsofneuralnetworks• ManydifferentCNNscanprovidesimilaraccuracy• Chooseonethatmatchesthestrengthsofthetargetplatform• Retrainingmaybenecessary• PotentialsynergiesbetweenCNNalgorithmdevelopmentandaccelerationpotential
Dimensionsofcustomization• Acceleratorarchitecture(genericvsnetwork-specific)• Datatype(fixedpointvsfloatingpoint)• Dataprecision(binary,8bit,64bit,etc.)• FPGA-friendlynetworktransformations
10
Outline
Part1: CNNaccelerationonFPGAsshouldexploitcustomization
Part2: CNNcustomizationpossibilities
Quantization
Acceleratorarchitecture
FPGA-friendlynetworktransformations
Memorysystemchallenges
Part3: Conclusion
11
Redundancy and Quantization
Evidenceofredundancyintrainednetworks•sparsification,low-rankapproximations,faulttolerance…
Reducedprecision(quantization)•Restrictweightsand/oractivationstoQ-bitvalues•HWbenefits:Low-bitwidthdatapaths,regularcompute
Sungetal:Quantizationworkswellwhen:•…thenetworkis“bigenough”•…thenetworkisawareofquantizationduringtraining
“(…)theperformancegapbetweenthefloating-pointandtheretrain-basedternary(+1,0,-1)weightneuralnetworks(…)almostvanishesinfullycomplexnetworks(…)”(Sungetal,ResiliencyofDeepNNsUnderQuantization)
12
Binarized Neural Networks (BNNs)Theextremecaseofquantization
• Permitonlytwovalues:+1and-1• Binaryweights,binaryactivations
ByCourbariauxandHubaraetal.(NIPS2016)
• Opensourcetrainingflow,basedonTheanoandLasagne• Layers:convolutional,fully-connected,batchnormandmaxpool
Closetostate-of-the-artaccuracyonsmallerimageclassificationbenchmarks• Andgettingsteadilybetteratthebiggerbenchmarks.
Quantization MNIST SVHN CIFAR-10
Binaryweights,Binaryactivations
0.96% 2.53% 10.15%
Binaryweights,FPactivations
1.29% 2.30% 9.90%
FPweights,FPactivations
0.94% 1.69% 7.62%
BNNaccuracyloss -0.2% -0.84% -2.53%
%classificationerror(lowerisbetter)
13
BNN Performance Potential on FPGAsfewerLUTs/op:higherpeakperformance
66TOPS
1TOPS
stayon-chip:achievemoreofthepeak
0.1TOPS 40TOPS
14
Outline
Part1: CNNaccelerationonFPGAsshouldexploitcustomization
Part2: CNNcustomizationpossibilities
Quantization
Acceleratorarchitecture
FPGA-friendlynetworktransformations
Memorysystemchallenges
Part3: Conclusion
15
FINN’s Heterogeneous Streaming Architecture
Layer0 Layer1 LayerN
…image result
FPGA
BNNtopology
1Mops
10Mops
1xPE 10xPE
OnehardwarelayerperBNNlayer
Heterogeneous:Avoid“one-size-fits-all”penalties(computationnotuniformacrosslayers)
Streaming:Maximizethroughputandminimizelatency(overlapcommunicationandcomputation)
Umuroglu etal.,“FINN:AFrameworkforFast,ScalableBinarized NeuralNetworkInference”,FPGA,2017
16
Outline
Part1: CNNaccelerationonFPGAsshouldexploitcustomization
Part2: CNNcustomizationpossibilities
Quantization
Acceleratorarchitecture
FPGA-friendlynetworktransformations
Memorysystemchallenges
Part3: Conclusion
17
Transformation Example: StreamliningState-of-the-artquantizedneuralnetworkmethodsstillemploysomefloatingpointcomputationstoimproveaccuracy• Batchnormalization• Alpha-scaling• Non-integerquantizationlevels
Streamliningavoidsfloatingpointcomputationsby:
• Viewingquantizationassuccessivethresholding•Movingandcollapsinglineartransformations• Absorbinglinearoperationsintothresholds
Umuroglu andJahre,“StreamlinedDeploymentforQuantizedNeuralNetworks”,
ToappearatHENND,2017
FewerlayersreduceFPGAresourceconsumptioninnetwork-specificarchitectures
Floatingpointlayer
Integerlayers
18
Outline
Part1: CNNaccelerationonFPGAsshouldexploitcustomization
Part2: CNNcustomizationpossibilities
Quantization
Acceleratorarchitecture
FPGA-friendlynetworktransformations
Memorysystemchallenges
Part3: Conclusion
19
Off-Chip Weight Storage
Largeneuralnetworksforceweightstobestoredoff-chip• Increasesbandwidthneeds• NeedtoexploitMemoryLevelParallelism(MLP)tomaximizebandwidthutilization
Neuralnetworkstructurescanbemadesparse• Potentialforreducingcomputeandmemoryrequirements
• Efficientlyexploitingsparsityistricky
for(j=0 to n-1)for(i=colptr[j] to colptr[j+1])y[rowind[i]] += values[i] * x[j]
20
Accelerator Performance TricksMostoftherandomaccessvectorisunusedatanygiventime• Uselight-weightpreprocessingtodetermineneededcachecapacity
• Usespareonchipmemoryforsomethingelse
Overlapmemoryaccessesandcomputation
• Balancedsystem: Acceleratorcompute capabilityshouldmatchmemorysubsystemperformance
• Parallelismeffectivelyhidesthecomputelatency• ExploitMemoryLevelParallelism(MLP)tofurtherimprovememorybusutilizationandperformance
• Solution:Non-blockingcaches
Umuroglu andJahre,“AVectorCachingSchemeforStreamingFPGASpMV Accelerators”,ARC,2015
Umuroglu andJahre,“RandomAccessSchemesforEfficientFPGASpMVAcceleration”,MicroprocessorsandMicrosystems,2016
21
Outline
Part1: CNNaccelerationonFPGAsshouldexploitcustomization
Part2: CNNcustomizationpossibilities
Quantization
Acceleratorarchitecture
FPGA-friendlynetworktransformations
Memorysystemchallenges
Part3: Conclusion
22
ConclusionDeeplearningiswellsuitedforacceleration• Nocomputeplatformisaclearwinneracrossallperformancemetrics• FPGAsexcelwhenwecanleverageheavilycustomizedaccelerators• NeedtoidentifyneuralnetworkswithcomputationandmemorypatternsthataresuitabletoFPGAplatformcharacteristics
Possibilitiesforcustomization• Examples:Datatype,precision,architectureandnetworktransformations• Significantpotentialforco-designofneuralnetworkalgorithmsandFPGAaccelerators
23
The Telenor-NTNU AI-Lab• To enable both basic and applied
research• To allow wide variety of research
areas• Maintain research at highest
international level• To enable cross-disciplinary
collaboration
24
Thank You!
Thefollowingpeoplehavecontributedtotheideascoveredinthispresentation:• Yaman Umuroglu,NTNU• MichaelaBlott,XilinxResearchLabs• ResearchersattheNTNUComputer
ArchitectureLab• ResearchersatXilinxResearchLabs