using fpgas to accelerate neural network inference jahre.pdf · large neural networks force weights...

24
1 Using FPGAs to Accelerate Neural Network Inference 1 st FPL Workshop on Reconfigurable Computing for Deep Learning (RC4DL) 8. September 2017, Ghent, Belgium Associate Professor Magnus Jahre Department of Computer Science Norwegian University of Science and Technology Many of the contributions discussed in this presentation have been developed in close collaboration with Xilinx Research Labs

Upload: others

Post on 08-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

1

Using FPGAs to Accelerate Neural Network Inference1st FPL Workshop on Reconfigurable Computing for Deep Learning (RC4DL)8. September 2017, Ghent, Belgium

Associate Professor Magnus JahreDepartment of Computer ScienceNorwegian University of Science and Technology

ManyofthecontributionsdiscussedinthispresentationhavebeendevelopedinclosecollaborationwithXilinxResearchLabs

Page 2: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

2

Outline

Part1: CNNaccelerationonFPGAsshouldexploitcustomization

Part2: CNNcustomizationpossibilities

Quantization

Acceleratorarchitecture

FPGA-friendlynetworktransformations

Memorysystemchallenges

Part3: Conclusion

Page 3: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

3

Outline

Part1: CNNaccelerationonFPGAsshouldexploitcustomization

Part2: CNNcustomizationpossibilities

Quantization

Acceleratorarchitecture

FPGA-friendlynetworktransformations

Memorysystemchallenges

Part3: Conclusion

Page 4: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

4

Properties of Acceleration-friendly Workloads

LotsofComputeandParallelism

Higharithmeticintensity

Feworpredictablebranches

Regularmemoryaccesses

Perfo

rmance

(Ope

ratio

ns/secon

d)

ArithmeticIntensity(Operations/Byte)

Compute-bound

RooflineModelofaPlatform

Williamsetal.,“Roofline:AnInsightfulVisualPerformanceModelforMulticoreArchitectures”,CACM,2009

ApplicationBApplicationA

Page 5: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

5

Convolutional Neural Networks (CNNs)

HeavyComputation

“Cat”

Krizhevsky etal.,“ImageNetClassificationwithDeepConvolutionalNeuralNetworks”,NIPS,2012

AlexNet

Page 6: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

6

CONV-Layer as Matrix-Matrix Multiplication

Aconvolutionallayerconvolvesasetoffiltersovertheinputdata

Reorganizedatatocreateamatrix-matrixmultiplicationoranumberof

matrix-vectormultiplications

Potentialproblem:Ifthefiltersoverlap,theinputdatais

duplicated(intheory)

Page 7: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

7

Choice of Matrix Multiplication Algorithm

LotsofComputeandParallelism

Higharithmeticintensity

Feworpredictablebranches

Regularmemoryaccesses

DenseMatrixMultiplication

SparseMatrixMultiplication

Choiceofalgorithmisatrade-offthatdependsonplatformandinputcharacteristics

(canhave)

Page 8: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

8

CNN Inference Platform Alternatives

HighPerformance

SufficientPerformance

HighEnergyEfficiency

ShortDevelopmentTime

LowCost

CPU GPU ASIC FPGA

++0 + +

++ ? ++

+0 ++ +

++++ ? +

+++ ? +

Differentiator:FPGAscancustomizethearchitecturetofitthecurrentproblem

“Winner”

CPU

CPU/GPU

ASIC

GPU

FPGA

Overall:Noplatformstandsoutasaclearwinneracrossallmetrics

Metric

Page 9: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

9

CNN Customization PotentialExploitthecharacteristicsofneuralnetworks• ManydifferentCNNscanprovidesimilaraccuracy• Chooseonethatmatchesthestrengthsofthetargetplatform• Retrainingmaybenecessary• PotentialsynergiesbetweenCNNalgorithmdevelopmentandaccelerationpotential

Dimensionsofcustomization• Acceleratorarchitecture(genericvsnetwork-specific)• Datatype(fixedpointvsfloatingpoint)• Dataprecision(binary,8bit,64bit,etc.)• FPGA-friendlynetworktransformations

Page 10: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

10

Outline

Part1: CNNaccelerationonFPGAsshouldexploitcustomization

Part2: CNNcustomizationpossibilities

Quantization

Acceleratorarchitecture

FPGA-friendlynetworktransformations

Memorysystemchallenges

Part3: Conclusion

Page 11: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

11

Redundancy and Quantization

Evidenceofredundancyintrainednetworks•sparsification,low-rankapproximations,faulttolerance…

Reducedprecision(quantization)•Restrictweightsand/oractivationstoQ-bitvalues•HWbenefits:Low-bitwidthdatapaths,regularcompute

Sungetal:Quantizationworkswellwhen:•…thenetworkis“bigenough”•…thenetworkisawareofquantizationduringtraining

“(…)theperformancegapbetweenthefloating-pointandtheretrain-basedternary(+1,0,-1)weightneuralnetworks(…)almostvanishesinfullycomplexnetworks(…)”(Sungetal,ResiliencyofDeepNNsUnderQuantization)

Page 12: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

12

Binarized Neural Networks (BNNs)Theextremecaseofquantization

• Permitonlytwovalues:+1and-1• Binaryweights,binaryactivations

ByCourbariauxandHubaraetal.(NIPS2016)

• Opensourcetrainingflow,basedonTheanoandLasagne• Layers:convolutional,fully-connected,batchnormandmaxpool

Closetostate-of-the-artaccuracyonsmallerimageclassificationbenchmarks• Andgettingsteadilybetteratthebiggerbenchmarks.

Quantization MNIST SVHN CIFAR-10

Binaryweights,Binaryactivations

0.96% 2.53% 10.15%

Binaryweights,FPactivations

1.29% 2.30% 9.90%

FPweights,FPactivations

0.94% 1.69% 7.62%

BNNaccuracyloss -0.2% -0.84% -2.53%

%classificationerror(lowerisbetter)

Page 13: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

13

BNN Performance Potential on FPGAsfewerLUTs/op:higherpeakperformance

66TOPS

1TOPS

stayon-chip:achievemoreofthepeak

0.1TOPS 40TOPS

Page 14: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

14

Outline

Part1: CNNaccelerationonFPGAsshouldexploitcustomization

Part2: CNNcustomizationpossibilities

Quantization

Acceleratorarchitecture

FPGA-friendlynetworktransformations

Memorysystemchallenges

Part3: Conclusion

Page 15: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

15

FINN’s Heterogeneous Streaming Architecture

Layer0 Layer1 LayerN

…image result

FPGA

BNNtopology

1Mops

10Mops

1xPE 10xPE

OnehardwarelayerperBNNlayer

Heterogeneous:Avoid“one-size-fits-all”penalties(computationnotuniformacrosslayers)

Streaming:Maximizethroughputandminimizelatency(overlapcommunicationandcomputation)

Umuroglu etal.,“FINN:AFrameworkforFast,ScalableBinarized NeuralNetworkInference”,FPGA,2017

Page 16: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

16

Outline

Part1: CNNaccelerationonFPGAsshouldexploitcustomization

Part2: CNNcustomizationpossibilities

Quantization

Acceleratorarchitecture

FPGA-friendlynetworktransformations

Memorysystemchallenges

Part3: Conclusion

Page 17: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

17

Transformation Example: StreamliningState-of-the-artquantizedneuralnetworkmethodsstillemploysomefloatingpointcomputationstoimproveaccuracy• Batchnormalization• Alpha-scaling• Non-integerquantizationlevels

Streamliningavoidsfloatingpointcomputationsby:

• Viewingquantizationassuccessivethresholding•Movingandcollapsinglineartransformations• Absorbinglinearoperationsintothresholds

Umuroglu andJahre,“StreamlinedDeploymentforQuantizedNeuralNetworks”,

ToappearatHENND,2017

FewerlayersreduceFPGAresourceconsumptioninnetwork-specificarchitectures

Floatingpointlayer

Integerlayers

Page 18: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

18

Outline

Part1: CNNaccelerationonFPGAsshouldexploitcustomization

Part2: CNNcustomizationpossibilities

Quantization

Acceleratorarchitecture

FPGA-friendlynetworktransformations

Memorysystemchallenges

Part3: Conclusion

Page 19: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

19

Off-Chip Weight Storage

Largeneuralnetworksforceweightstobestoredoff-chip• Increasesbandwidthneeds• NeedtoexploitMemoryLevelParallelism(MLP)tomaximizebandwidthutilization

Neuralnetworkstructurescanbemadesparse• Potentialforreducingcomputeandmemoryrequirements

• Efficientlyexploitingsparsityistricky

for(j=0 to n-1)for(i=colptr[j] to colptr[j+1])y[rowind[i]] += values[i] * x[j]

Page 20: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

20

Accelerator Performance TricksMostoftherandomaccessvectorisunusedatanygiventime• Uselight-weightpreprocessingtodetermineneededcachecapacity

• Usespareonchipmemoryforsomethingelse

Overlapmemoryaccessesandcomputation

• Balancedsystem: Acceleratorcompute capabilityshouldmatchmemorysubsystemperformance

• Parallelismeffectivelyhidesthecomputelatency• ExploitMemoryLevelParallelism(MLP)tofurtherimprovememorybusutilizationandperformance

• Solution:Non-blockingcaches

Umuroglu andJahre,“AVectorCachingSchemeforStreamingFPGASpMV Accelerators”,ARC,2015

Umuroglu andJahre,“RandomAccessSchemesforEfficientFPGASpMVAcceleration”,MicroprocessorsandMicrosystems,2016

Page 21: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

21

Outline

Part1: CNNaccelerationonFPGAsshouldexploitcustomization

Part2: CNNcustomizationpossibilities

Quantization

Acceleratorarchitecture

FPGA-friendlynetworktransformations

Memorysystemchallenges

Part3: Conclusion

Page 22: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

22

ConclusionDeeplearningiswellsuitedforacceleration• Nocomputeplatformisaclearwinneracrossallperformancemetrics• FPGAsexcelwhenwecanleverageheavilycustomizedaccelerators• NeedtoidentifyneuralnetworkswithcomputationandmemorypatternsthataresuitabletoFPGAplatformcharacteristics

Possibilitiesforcustomization• Examples:Datatype,precision,architectureandnetworktransformations• Significantpotentialforco-designofneuralnetworkalgorithmsandFPGAaccelerators

Page 23: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

23

The Telenor-NTNU AI-Lab• To enable both basic and applied

research• To allow wide variety of research

areas• Maintain research at highest

international level• To enable cross-disciplinary

collaboration

Page 24: Using FPGAs to Accelerate Neural Network Inference Jahre.pdf · Large neural networks force weights to be stored off-chip •Increases bandwidth needs •Need to exploit Memory Level

24

Thank You!

Thefollowingpeoplehavecontributedtotheideascoveredinthispresentation:• Yaman Umuroglu,NTNU• MichaelaBlott,XilinxResearchLabs• ResearchersattheNTNUComputer

ArchitectureLab• ResearchersatXilinxResearchLabs