improving the performance of opencl-based fpga accelerator ... · improving the performance of...
TRANSCRIPT
ImprovingthePerformanceofOpenCL-basedFPGAAcceleratorforConvolutionalNeuralNetwork
JialiangZhangandJingLi
Outline
• Background• Motivation• Balance analysis model• Proposed design• Performance• Conclusion
2
ConvolutionalNeuralNetwork
3
Marco Architecture of VGG
Imagefrom:https://blog.heuritech.com/2016/02/29/a-brief-report-of-the-heuritech-deep-learning-meetup-5/l
OpenCL Framework
Application Source CodeOpenCL Kernel
OpenCL FrameworkOpenCL CHost API
FPGA
InfrastructureKernel
FPGARuntimeFPGA Driver
Offline FPGA OpenCL Compiler
CLANG Frontend
IR Analysis &Optimization
VerilogBackend
BasicBlockLibrary
(i)
(iii)
(ii)
• OpenCLprovidesagoodFPGAabstration
• UnlikeOpenCLonGPUorCPU,OpenCLFPGAdescribesbothhardwareandsoftware
• Use #prgma to guide hardwaregeneration:– loopunrolling; SIMDfactor; compute
unitreplication
4
OpenCL FPGA Framework
CU
LocalMemory
CU
LocalMemory
CU
LocalMemory
CU
LocalMemory
2x2 PE Replication (Controlled by User) 4x CU Replication (Controlled by User)
16x Onchip Memory Replication ( Compiler Inferred )
PCIe Controller
SOC BUS
Dispather
Infrastructure Region
HOST
DDR Controller
GlobalMemory
ComputationSubsystem
Local Memory
Subsystem
Control Flow Data Flow
DDR Controller
…
… GlobalMemory
KernelRegion
(a) Top Level (b) Compute Unit Level (c) Processing Element Level
PE
BRAMBRAM
PE
BRAMBRAM
PE
BRAMBRAM
PE
BRAMBRAM
5
Related works• GenericCNNaccelerators: [1]-[3]• Balancecomputation and external memoryaccess: [4] -[7]• Hardware Abstraction: [7][8]• Contribution:– IdentifytheperformancebottleneckinlargescaleFPGACNNacceleratorison-chipmemorybandwidth
– A CNN accelerator achieved optimized balance among computation,on-chipmemory and external memory access
6
[1]Farabet,etal.Hardwareacceleratedconvolutional neuralnetworksforsyntheticvision systems.ISCAS2010[2]M.Peemen,etal.Memory-centricacceleratordesignforconvolutional neuralnetworks.ICCD2013[3]V.Gokhale, etal.A240G-ops/smobilecoprocessor fordeepneuralnetworks.CVPRWorkshops, 2014.[4]C.Zhang,etal.OptimizingFPGA-basedacceleratordesign fordeepconvolutional neuralnetworks.ISFPGA2015[5]N.Suda, et al. Throughput-OptimizedOpenCL-based FPGAAcceleratorforLarge-ScaleConvolutional NeuralNetworks. ISFPGA 2016[6] J. Qiu, et al. GoingDeeperwithEmbedded FPGAPlatformforConvolutional NeuralNetwork. ISFPGA 2016[7]C. Zhang Caffeine:TowardsUniformedRepresentationandAccelerationforDeepConvolutional NeuralNetworks. ICCAD 2016[8] H. Sharma, et al. FromHigh-LevelDeepNeuralModels toFPGAs. MICRO2016
Motivation
• New FPGA devices havemoreand faster DSP resources
• We should useallDSPresourceseverycycletoobtaintheGFLOPSnumberindatasheet
• The existing design cannot utilizethe DSP resources well
7
BalanceAnalysis Model• Balanceistherelationshipbetweenstorage and computation resources• Assumption:• Onlyonebottleneckinthesystem• Bandwidthbounded• Computationandmemoryaccessoverlapperfectly.
8
MachineBalance
CodeBalance
On-chip 𝑩𝒎𝑶𝒏 𝑩𝒎𝑶𝒏
Off-chip 𝑩𝒎𝑶𝒇𝒇 𝑩𝑪
𝑶𝒇𝒇
Hardwaredetermined
Algorithmdetermined
Refertoon-chipMemBWRefertooff-chipMemBW
MachineBalance
9
PE PE PE
Buffer Buffer
Computational Resources
On-chip Memory
… …
…
NBRAM *WIDTHBRAM *fBRAM
PE PE PEComputational Resources
… …
NDDR*𝐵𝑊))*
DDR DDROff-chip Memory
…
NDSP*OPS/DSP*fDSP NDSP*OPS/DSP*fDSP
CodeBalance
10
• On-chipCodeBalanceisdeterminedbyinputandoutputdatasizeofinstructions
• Forexample,MAC operation hasa
1
• Off-chipCodeBalanceisdeterminedbytheinputandoutputdatasizeofthewholeprogram
• Assumeunlimited on-chipmemorysize
BreakthememoryWall
11
Toachieve𝑃,-. ,weneedtosatisfy𝐵,122 >𝐵3
122 and 𝐵,14 >𝐵3122
Off-chipbalance
Layer 𝑩𝑪𝑶𝒇𝒇
CONV1 0.0021CONV2 0.0010CONV3 0.00062CONV4 0.00087CONV5 0.00277FC 0.5
12
TechnologyNode
𝑩𝒎𝑶𝒇𝒇
28nm 0.16820nm 0.217
14/16nm 0.17714/16nmw/HBM
0.177
• 𝑩𝑪𝑶𝒇𝒇 <𝑩𝒎
𝑶𝒇𝒇 forconvolutionlayers
• 𝑩𝑪𝑶𝒇𝒇 >𝑩𝒎
𝑶𝒇𝒇 forFC layer,butonlycontributeasmallportionofcomputation
On-chipbalance
13
TechnologyNode
𝑩𝒎𝑶𝒏
28nm 0.0033
20nm 0.0026
14/16nm 0.0010
Layer 𝑩𝑪𝑶𝒏
CONV1 1CONV2 1CONV3 1CONV4 1CONV5 1FC 1
• 𝑩𝑪𝑶𝒏 >𝑩𝒎𝑶𝒏 foralllayer
• On-chipmemorybandwidthbecomesthebottleneck
• Weneedtoincrease𝑩𝒎𝑶𝒏
DataReuseRequirement
303384
72
1000
0
200
400
600
800
1000
1200
28nm 20nm 14nmw/HMC
14nmw/DDR
• Thebalanceanalysis assumesunlimitedon-chipmemorysize(loaddataonce)
• Underlimitedon-chipmemorycapacity,weneedtoloaddatamorethanonce.
• Tosatisfiedexternalmemorybandwidthrequirement,weneedtoreusedataatleast.
14
OptimizationDirection
• Increase to utilize all DSP resources
• Satisfy thedatareuserequirementwithlimitedon-chipmemorysize.
15
OptimizationDirection
• Increase to utilize all DSP resources
• Satisfy thedatareuserequirementwithlimitedon-chipmemorysize.
16
MatrixMultiplicationKernel
BRAM
PE
BRAM
PE
BRAM
BRAM
PE PE
17
CU
LocalMemory
CU
LocalMemory
CU
LocalMemory
CU
LocalMemory
2x2 PE Replication (Controlled by User) 4x CU Replication (Controlled by User)
16x Onchip Memory Replication ( Compiler Inferred )
PCIe Controller
SOC BUS
Dispather
Infrastructure Region
HOST
DDR Controller
GlobalMemory
ComputationSubsystem
Local Memory
Subsystem
Control Flow Data Flow
DDR Controller
…
… GlobalMemory
KernelRegion
(a) Top Level (b) Compute Unit Level (c) Processing Element Level
PE
BRAMBRAM
PE
BRAMBRAM
PE
BRAMBRAM
PE
BRAMBRAM
EnableMulticasting
BRAMusagereduction
18
Bymulticasting,wecanincreaseon-chipmachinebalanceby 𝑁)67
ExperimentonArria10GX1150FPGA(w/ 1518 DSP and 2713 M20kMemory):
OptimizationDirection
• Increase to utilize all DSP resources
• Satisfy thedatareuserequirementwithlimitedon-chipmemorysize.
19
Kernel Design
2D DISPATHER
Compute Units
Compute Units
Compute Units
Compute Units
Buffer
PointerParameters
…
ShiftRegsiter
Crossbar
ExternalMemory
ExternalMemory
20
CU Design
21
BRAM32bit x512
MAC
BRAM32bit x512
MAC
BRAM32bit x512
…
MAC
BRAM32bit x512
BRAM32bit x512
MAC
MAC
MAC
BRAM32bit x512
MAC
MAC
MAC
…
…
…
INPUT A512x
32x32x
32x
32x
…
…
INPUT B512x
32x
32x
32x 32x 32x
32x32x32x
32x 32x 32x
FF
FF
FF
…
MUX
OutputAddress
Output512x
512x
512x
512x
8x Duplication8x
Dup
licat
ion
(a)
Work-item Scheduling
22
2D Scheduling
0,0 0,1 0,2 0,n…
1,0 1,1 1,2 1,n…
2,0 2,1 2,2 2,n…
m,0 m,1 m,2 m,n…
… … … …
…
0,0 0,1 0,2 0,n…
1,0 1,1 1,2 1,n…
2,0 2,1 2,2 2,n…
m,0 m,1 m,2 m,n…… … … …
…
m
n(a) One Dimensional Dispatching (b) Two Dimensional Dispatching
23
𝑥9
𝑥:
Minimizingexternalmemorybandwidth
• Objective:– Findtheoptimized schedulingpolicy
– To minimizeexternalmemorybandwidth
• Constraint:– On-chipmemorysize
• Basedon:– CNNlayerparameters
24
Optimizationresult
25
Theoptimizationcaneffectivelyreducetherequirementofoff-chipmemorybandwidth
Experiment Setup
• FPGA: Arria10GX1150• 1518 DSPs• 2713 M20k memory• 1GBDDR4 with17GB/sBW
• Kernel implemented as anOpenCLIPlibrarypackage usingVerilog
26
Arria 10 GX FPGA Development Kit
Implementation Result
27
Total Ours Percentage
DSP 1518 1320 86
BRAM 2713 1250 46
Logic 1506k 437k 43
Frequency 370 MHz
VGG Performance
Layers Number of Ops Duration (ms) PerformanceGOP/s
Conv 30.69G 11.9 2568
FC 0.073G 5.5 13
Total 30.76G 17.4 1790
28
Overall VGG Classification Throughput: 57 image/s
Performance ComparisonSuda et. al [1] Qiu et. al [2] Ours
Platform Stratix V GSD8 Zynq XC7Z045 Arria 10 GX1150Performance(GOP/s) 117.8 136.9 1790
PerformanceDensity(OPs/DSP/Cycle)
0.36 1.17 3.06
Power efficiency(GOP/J)
1.84 14.22 47.88
29
[1]N.Suda, et al. Throughput-OptimizedOpenCL-basedFPGAAcceleratorforLarge-ScaleConvolutionalNeuralNetworks. ISFPGA 2016[2] J. Qiu, et al. GoingDeeperwithEmbeddedFPGAPlatformforConvolutionalNeuralNetwork. ISFPGA 2016
Summary
• Motivation• Balance analysis model• Our Work:– Multicasting among PEs– 2D Scheduling
• Performance comparison
30
Thanks!
Q&A
31