using fpgas to accelerate neural network inference jahre.pdf · large neural networks force weights...

1

Using FPGAs to Accelerate Neural Network Inference1st FPL Workshop on Reconfigurable Computing for Deep Learning (RC4DL)8. September 2017, Ghent, Belgium

Associate Professor Magnus JahreDepartment of Computer ScienceNorwegian University of Science and Technology

ManyofthecontributionsdiscussedinthispresentationhavebeendevelopedinclosecollaborationwithXilinxResearchLabs

2

Outline

Part1: CNNaccelerationonFPGAsshouldexploitcustomization

Part2: CNNcustomizationpossibilities

Quantization

Acceleratorarchitecture

FPGA-friendlynetworktransformations

Memorysystemchallenges

Part3: Conclusion

3

Outline



Quantization




Part3: Conclusion

4

Properties of Acceleration-friendly Workloads

LotsofComputeandParallelism

Higharithmeticintensity

Feworpredictablebranches

Regularmemoryaccesses

Perfo

rmance

(Ope

ratio

ns/secon

d)

ArithmeticIntensity(Operations/Byte)

Compute-bound

RooflineModelofaPlatform

Williamsetal.,“Roofline:AnInsightfulVisualPerformanceModelforMulticoreArchitectures”,CACM,2009

ApplicationBApplicationA

5

Convolutional Neural Networks (CNNs)

HeavyComputation

“Cat”

Krizhevsky etal.,“ImageNetClassificationwithDeepConvolutionalNeuralNetworks”,NIPS,2012

AlexNet

6

CONV-Layer as Matrix-Matrix Multiplication

Aconvolutionallayerconvolvesasetoffiltersovertheinputdata

Reorganizedatatocreateamatrix-matrixmultiplicationoranumberof

matrix-vectormultiplications

Potentialproblem:Ifthefiltersoverlap,theinputdatais

duplicated(intheory)

7

Choice of Matrix Multiplication Algorithm

LotsofComputeandParallelism

Higharithmeticintensity

Feworpredictablebranches

Regularmemoryaccesses

DenseMatrixMultiplication

SparseMatrixMultiplication

Choiceofalgorithmisatrade-offthatdependsonplatformandinputcharacteristics

(canhave)

8

CNN Inference Platform Alternatives

HighPerformance

SufficientPerformance

HighEnergyEfficiency

ShortDevelopmentTime

LowCost

CPU GPU ASIC FPGA

++0 + +

++ ? ++

+0 ++ +

++++ ? +

+++ ? +

Differentiator:FPGAscancustomizethearchitecturetofitthecurrentproblem

“Winner”

CPU

CPU/GPU

ASIC

GPU

FPGA

Overall:Noplatformstandsoutasaclearwinneracrossallmetrics

Metric

9

CNN Customization PotentialExploitthecharacteristicsofneuralnetworks• ManydifferentCNNscanprovidesimilaraccuracy• Chooseonethatmatchesthestrengthsofthetargetplatform• Retrainingmaybenecessary• PotentialsynergiesbetweenCNNalgorithmdevelopmentandaccelerationpotential

Dimensionsofcustomization• Acceleratorarchitecture(genericvsnetwork-specific)• Datatype(fixedpointvsfloatingpoint)• Dataprecision(binary,8bit,64bit,etc.)• FPGA-friendlynetworktransformations

10

Outline



Quantization




Part3: Conclusion

11

Redundancy and Quantization

Evidenceofredundancyintrainednetworks•sparsification,low-rankapproximations,faulttolerance…

Reducedprecision(quantization)•Restrictweightsand/oractivationstoQ-bitvalues•HWbenefits:Low-bitwidthdatapaths,regularcompute

Sungetal:Quantizationworkswellwhen:•…thenetworkis“bigenough”•…thenetworkisawareofquantizationduringtraining

“(…)theperformancegapbetweenthefloating-pointandtheretrain-basedternary(+1,0,-1)weightneuralnetworks(…)almostvanishesinfullycomplexnetworks(…)”(Sungetal,ResiliencyofDeepNNsUnderQuantization)

12

Binarized Neural Networks (BNNs)Theextremecaseofquantization

• Permitonlytwovalues:+1and-1• Binaryweights,binaryactivations

ByCourbariauxandHubaraetal.(NIPS2016)

• Opensourcetrainingflow,basedonTheanoandLasagne• Layers:convolutional,fully-connected,batchnormandmaxpool

Closetostate-of-the-artaccuracyonsmallerimageclassificationbenchmarks• Andgettingsteadilybetteratthebiggerbenchmarks.

Quantization MNIST SVHN CIFAR-10

Binaryweights,Binaryactivations

0.96% 2.53% 10.15%

Binaryweights,FPactivations

1.29% 2.30% 9.90%

FPweights,FPactivations

0.94% 1.69% 7.62%

BNNaccuracyloss -0.2% -0.84% -2.53%

%classificationerror(lowerisbetter)

13

BNN Performance Potential on FPGAsfewerLUTs/op:higherpeakperformance

66TOPS

1TOPS

stayon-chip:achievemoreofthepeak

0.1TOPS 40TOPS

14

Outline



Quantization




Part3: Conclusion

15

FINN’s Heterogeneous Streaming Architecture

Layer0 Layer1 LayerN

…image result

FPGA

BNNtopology

1Mops

10Mops

1xPE 10xPE

OnehardwarelayerperBNNlayer

Heterogeneous:Avoid“one-size-fits-all”penalties(computationnotuniformacrosslayers)

Streaming:Maximizethroughputandminimizelatency(overlapcommunicationandcomputation)

Umuroglu etal.,“FINN:AFrameworkforFast,ScalableBinarized NeuralNetworkInference”,FPGA,2017

16

Outline



Quantization




Part3: Conclusion

17

Transformation Example: StreamliningState-of-the-artquantizedneuralnetworkmethodsstillemploysomefloatingpointcomputationstoimproveaccuracy• Batchnormalization• Alpha-scaling• Non-integerquantizationlevels

Streamliningavoidsfloatingpointcomputationsby:

• Viewingquantizationassuccessivethresholding•Movingandcollapsinglineartransformations• Absorbinglinearoperationsintothresholds

Umuroglu andJahre,“StreamlinedDeploymentforQuantizedNeuralNetworks”,

ToappearatHENND,2017

FewerlayersreduceFPGAresourceconsumptioninnetwork-specificarchitectures

Floatingpointlayer

Integerlayers

18

Outline



Quantization




Part3: Conclusion

19

Off-Chip Weight Storage

Largeneuralnetworksforceweightstobestoredoff-chip• Increasesbandwidthneeds• NeedtoexploitMemoryLevelParallelism(MLP)tomaximizebandwidthutilization

Neuralnetworkstructurescanbemadesparse• Potentialforreducingcomputeandmemoryrequirements

• Efficientlyexploitingsparsityistricky

for(j=0 to n-1)for(i=colptr[j] to colptr[j+1])y[rowind[i]] += values[i] * x[j]

20

Accelerator Performance TricksMostoftherandomaccessvectorisunusedatanygiventime• Uselight-weightpreprocessingtodetermineneededcachecapacity

• Usespareonchipmemoryforsomethingelse

Overlapmemoryaccessesandcomputation

• Balancedsystem: Acceleratorcompute capabilityshouldmatchmemorysubsystemperformance

• Parallelismeffectivelyhidesthecomputelatency• ExploitMemoryLevelParallelism(MLP)tofurtherimprovememorybusutilizationandperformance

• Solution:Non-blockingcaches

Umuroglu andJahre,“AVectorCachingSchemeforStreamingFPGASpMV Accelerators”,ARC,2015

Umuroglu andJahre,“RandomAccessSchemesforEfficientFPGASpMVAcceleration”,MicroprocessorsandMicrosystems,2016

21

Outline



Quantization




Part3: Conclusion

22

ConclusionDeeplearningiswellsuitedforacceleration• Nocomputeplatformisaclearwinneracrossallperformancemetrics• FPGAsexcelwhenwecanleverageheavilycustomizedaccelerators• NeedtoidentifyneuralnetworkswithcomputationandmemorypatternsthataresuitabletoFPGAplatformcharacteristics

Possibilitiesforcustomization• Examples:Datatype,precision,architectureandnetworktransformations• Significantpotentialforco-designofneuralnetworkalgorithmsandFPGAaccelerators

23

The Telenor-NTNU AI-Lab• To enable both basic and applied

research• To allow wide variety of research

areas• Maintain research at highest

international level• To enable cross-disciplinary

collaboration

24

Thank You!

Thefollowingpeoplehavecontributedtotheideascoveredinthispresentation:• Yaman Umuroglu,NTNU• MichaelaBlott,XilinxResearchLabs• ResearchersattheNTNUComputer

ArchitectureLab• ResearchersatXilinxResearchLabs

using fpgas to accelerate neural network inference jahre.pdf · large neural networks force weights...

Documents