automatic generation of efficient accelerator designs for ... talks/raghu_prabhakar.pdf · dram a b...
TRANSCRIPT
![Page 1: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/1.jpg)
AutomaticGenerationofEfficientAcceleratorDesignsforReconfigurableHardware
Raghu PrabhakarStanfordUniversity
PervasiveParallelismLab
![Page 2: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/2.jpg)
TheTeam
StefanHadjis TianZhaoChristinaDelimitrou
ChristosKozyrakis KunleOlukotun
2
DavidKoeplingerYaqiZhangMattFeldman
![Page 3: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/3.jpg)
FPGAsinDataCentersn IncreasinginterestinuseofFPGAsasapplication
acceleratorsindatacenters
3
Keyadvantage:Performance/Watt
![Page 4: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/4.jpg)
Problem#1:Programmabilityn VerilogandVHDLtoolowlevelforsoftwaredevelopers
n Highlevelsynthesis(HLS)toolsneeduserpragmastohelpdiscoverparallelismn C-basedinput,pragmasrequiringhardwareknowledgen Limitedinexploitingdatalocalityn Difficulttosynthesizecomplexdatapathswithnestedparallelism
4
![Page 5: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/5.jpg)
Problem#2:LargeDesignSpacesn Designspacesgrowexponentiallywiththenumberof
parametersn Parameterscanchangeruntimebyordersofmagnituden Parametersdependoneachothern Manualexplorationistedious,suboptimal
5
![Page 6: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/6.jpg)
OurApproach
PatternTransformationsTiling
ParallelPatterns
TiledParallelPatterns
BitstreamGeneration
FPGAConfiguration
HardwareGenerationMetapipelineAnalysis
MaxJ
6
SpatialDesignSpaceExplorationLatency,AreaEstimation
![Page 7: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/7.jpg)
OurApproach
PatternTransformationsTiling
ParallelPatterns
TiledParallelPatterns
BitstreamGeneration
FPGAConfiguration
HardwareGenerationMetapipelineAnalysis
MaxJ
7
SpatialDesignSpaceExplorationLatency,AreaEstimation
GeneratingConfigurableHardwarefromParallelPatterns,ASPLOS’16RaghuPrabhakar,DavidKoeplinger,KevinJ.Brown,HyoukJoong Lee,ChrisDeSa,ChristosKozyrakis,KunleOlukotun
AutomaticGenerationofEfficientAcceleratorsforReconfigurableHardware,ISCA’16DavidKoeplinger,RaghuPrabhakar,Yaqi Zhang,ChristinaDelimitrou,ChristosKozyrakis,KunleOlukotun
![Page 8: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/8.jpg)
OurApproach
PatternTransformationsTiling
ParallelPatterns
TiledParallelPatterns
BitstreamGeneration
FPGAConfiguration
HardwareGenerationMetapipelineAnalysis
MaxJ
8
SpatialDesignSpaceExplorationLatency,AreaEstimation
AutomaticGenerationofEfficientAcceleratorsforReconfigurableHardware,ISCA’16DavidKoeplinger,RaghuPrabhakar,Yaqi Zhang,ChristinaDelimitrou,ChristosKozyrakis,KunleOlukotun
![Page 9: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/9.jpg)
ParallelPatternsn Constructswithspecialpropertieswithrespectto
parallelismandmemoryaccess
9
map zip reduce groupBy
key1 key3key2
![Page 10: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/10.jpg)
ParallelPatterns:Map
10
val vectorB = vectorA map { v => v + 1 }
Performthegivenfunctiononthei’th elementoftheinputTheresultformsthei’th elementoftheoutput
3 8 1 4 2 6 5 1
+1 +1 +1 +1 +1 +1 +1 +1
![Page 11: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/11.jpg)
ParallelPatterns:Zip
11
val vectorC = vectorA zip (vectorB) { _ + _ }
Performthegivenfunctiononthei’th elementoftheinputsTheresultformsthei’th elementoftheoutput
3 8 1 4 2 5 9 4 3 2
+ + + + +
![Page 12: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/12.jpg)
ParallelPatterns:Reduce
12
val sum = vectorA reduce { (a,b) => a + b }
Combinetheelementsoftheinputusingthegivenfunction**Functionmustbeassociative
+ +
+
3 8 1 4
![Page 13: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/13.jpg)
ParallelPatterns:GroupBy
13
val bins = vectorA groupBy { v => v/3 }
Grouptheelementsofthegivencollectionbaseduponthegivenfunction
0
/ 3
3 8 1 4 2 6 5 1
2
2 1
1
1 3 4 5 1 2 1
/ 3 / 3 / 3 / 3 / 3 / 3 / 3
![Page 14: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/14.jpg)
ParallelPatterns:Filter
14
val vectorB = vectorA filter {v => v%2 == 1}
Produceacollectionforthei’th elementoftheinputTheoutputistheconcatenationofallproducedcollections
3 8 1 4 2 6 5 1
%
1
% % % %
1 53 1
% % %
![Page 15: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/15.jpg)
K-means:ParallelPatterns
15
// Group points by closest centroid:val groups = points groupBy { point =>
// Compute distance for each centroidval dists = guesses map { guess => mean.zip(sample){ (a,b) => (a – b)**2 } reduce { (a,b) => a + b }
}// Find the index of the closest centroid(0 until dists.length) reduce { (i,j) =>if (dists(i) < dists(j)) i else j
}}// Average each groupval newKmeans = groups map { g => val sum = g reduce { (v1,v2) => v1.zip(v2){ (a,b) => a + b } }
sum map { a => a / g.size }}
![Page 16: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/16.jpg)
OurApproach
PatternTransformationsTiling
ParallelPatterns
TiledParallelPatterns
BitstreamGeneration
FPGAConfiguration
HardwareGenerationMetapipelineAnalysis
MaxJ
16
SpatialDesignSpaceExplorationLatency,AreaEstimation
AutomaticGenerationofEfficientAcceleratorsforReconfigurableHardware,ISCA’16DavidKoeplinger,RaghuPrabhakar,Yaqi Zhang,ChristinaDelimitrou,ChristosKozyrakis,KunleOlukotun
![Page 17: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/17.jpg)
Key
DRAM
A
B
DesignSpaceExample:DotProduct
17
FPGA
+×TileB
TileA
Algorithm: Dot Product of Vectors A and B
Small andsimple,butslow!
acc
Scratchpad
Reg op
![Page 18: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/18.jpg)
DRAM
A
B
ImportantParameters:TileSizes
n IncreaseslengthofDRAMaccesses Runtimen Increasesexploitedspatiallocality Runtimen Increaseslocalmemorysizes Area
18
FPGA
+×TileB
TileA
Algorithm: Dot Product of Vectors A and B
acc
Key
Scratchpad
Reg op
![Page 19: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/19.jpg)
DRAM
A
B
FPGA
Stage 2
Stage 1
+×TileB
TileA
ImportantParameters:Pipelining
19
Algorithm: Dot Product of Vectors A and B
n Overlaps memoryandcompute Runtimen Increaseslocalmemorysizes Arean Addssynchronizationlogic Area
acc
Key
Double
Reg op
Buffer
![Page 20: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/20.jpg)
DRAM
ImportantParameters:Parallelization
20
FPGA
+
×
Algorithm: Dot Product of Vectors A and B
×
×
TileA
TileB
+ +
n Improveselementthroughput Runtimen Duplicatescomputeresources Area
A
B
acc
Key
Scratchpad
Reg op
![Page 21: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/21.jpg)
HardwareLanguageRequirements
21
VHDLVerilog
LegUp Vivado HLSOpenCL SDK
Aladdin Spatial
TargetsFPGAs
EnablespipeliningatarbitrarylooplevelsExposesdesignparameterstothecompiler
Evaluatesdesignspriortosynthesis
Exploresdesignspaceautomatically
Generatessynthesizablecode
![Page 22: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/22.jpg)
Hardware LanguageRequirements
22
VHDLVerilog
LegUp Vivado HLSOpenCL SDK
Aladdin Spatial
TargetsFPGAs
EnablespipeliningatarbitrarylooplevelsExposesdesignparameterstothecompiler
Evaluatesdesignspriortosynthesis
Exploresdesignspaceautomatically
Generatessynthesizablecode
![Page 23: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/23.jpg)
Hardware LanguageRequirements
23
VHDLVerilog
LegUp Vivado HLSOpenCL SDK
Aladdin Spatial
TargetsFPGAs
EnablespipeliningatarbitrarylooplevelsExposesdesignparameterstothecompiler
Evaluatesdesignspriortosynthesis
Exploresdesignspaceautomatically
Generatessynthesizablecode
![Page 24: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/24.jpg)
Hardware LanguageRequirements
24
VHDLVerilog
LegUp Vivado HLSOpenCL SDK
Aladdin Spatial
TargetsFPGAs
EnablespipeliningatarbitrarylooplevelsExposesdesignparameterstothecompiler
Evaluatesdesignspriortosynthesis
Exploresdesignspaceautomatically
Generatessynthesizablecode
![Page 25: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/25.jpg)
Hardware LanguageRequirements
25
VHDLVerilog
LegUp Vivado HLSOpenCL SDK
Aladdin Spatial
TargetsFPGAs
EnablespipeliningatarbitrarylooplevelsExposesdesignparameterstothecompiler
Evaluatesdesignspriortosynthesis
Exploresdesignspaceautomatically
Generatessynthesizablecode
![Page 26: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/26.jpg)
TheSpatialLanguagen Includesavarietyparameterizedtemplates
n Parallelpatterns withimplicitparallelizationfactorsn Pipelineconstructs forpipeliningatarbitrarylevelsn Explicitsizeparameters forloopstepsizeandbuffersizes
n Allparametersareexposedtocompilern Compilerincludeslatencyandareamodelsforquickdesignevaluation
n Compilerautomaticallyexploresdesignspacen GeneratessynthesizableMaxJ HGLafterexploration
26
![Page 27: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/27.jpg)
DRAM
DotProductinSpatialDiagram
27
TileB
TileA
×
+
InnerReduce
OuterReduce
Parallelismfactor#1Pipeliningtoggle
TileSize(B)
Parallelismfactor#2
Parallelismfactor#3
A
B
outout
+
![Page 28: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/28.jpg)
DotProductinSpatial
28
val output = Reg[Float]val vectorA = OffChipMem[Float](N)val vectorB = OffChipMem[Float](N)
Reduce(N by B)(output){ i =>val tileA = Scratchpad[Float](B)val tileB = Scratchpad[Float](B)val acc = Reg[Float]tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(B by 1)(acc){ j => tileA(j) * tileB(j)
}{a, b => a + b}}{a, b => a + b}
Parallelismfactor#1Pipeliningtoggle
TileSize(B)
Parallelismfactor#2
Parallelismfactor#3
1
2
val output = vectorA * vectorB // User’s code
![Page 29: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/29.jpg)
SpatialDesignParameters
29
Type Example Description Parameter
Primitives+, -,*,/... Basicmath, logic,andcontrol Vectorwidth
Scratchpad LoadScratchpadStore
Load/storefromon-chipmemories Vectorwidth,stride
Memories
OffChipMem N-dimensionaloff-chiparray Dimensions
Scratchpad On-chipscratchpad Size,buffering, banking
Reg Accumulatorregister Buffering
Controllers
Counter Loop indices Parallelization,pattern
Pipe Pipelined inner-loopbody Parallelization, pattern
MetaPipe Coarse-grained pipeline Parallelization, pattern
Data Transfer TileLoadTile Store Load/storefromoff-chiparrays Tilesize,load rate
![Page 30: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/30.jpg)
SpatialEnablesFastDSE
30
SpatialProgram
SimpleLinearModels
Concise IR
ParameterizedTemplates
EasilyDerivedSpaceConstraints
SpacePruning
FastDesignSpaceExploration
FastEstimationNoUnrollingNoScheduling
SmallerSpaces
![Page 31: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/31.jpg)
LatencyModelingn Analyticalmodel
n Usesdepth-firstsearchtogetcriticalpathofpipelines
n Accurateestimationrequiresdatasizeannotations
n Main-memorymodeln Mathematicalmodelfittoobservedruntimesn Parameterizedby:
n Numberofcontendingreaders/writersn Numberofcommandsissuedinsequencen Commandlength
31
![Page 32: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/32.jpg)
AreaModelingn Analyticalmodel
n Simplesummationofareaofeachtemplaten Includesestimatesfordelaylines,bankedmemories
n Neuralnetworkmodelsn Modelsroutingcostsandmemoryduplicationn Simple,3layernetworkssufficehere(weuse11-6-1)n Trainedonaboutsetof200characterizationdesigns
n Totalarea=analyticalarea+neuralnetarea
32
![Page 33: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/33.jpg)
Evaluation
33
n Accuracy:Howaccuratearethemodels,comparedtoobservations?
![Page 34: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/34.jpg)
ExperimentalSetupn Board:
n AlteraStratix Vn 48GBDDR3DRAM,6memorychannelsn BoardconnectedtohostviaPCI-e
n Executiontime=FPGAexecutiontimen DoesnotincludeCPUßà FPGAcommunication orconfigurationtime
34
![Page 35: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/35.jpg)
ModelSynthesized
Results:ModelAccuracy(Area)
35
Areamodelsfollowimportanttrendsandareaccurateenoughtodrive
automaticdesignspaceexploration
100%
60%
20%
ALMsBRAMsDSPs
ResourceUsage(%
)
dotproduct outerprod tpchq6blackscholes gda kmeans gemm
![Page 36: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/36.jpg)
Results:ModelAccuracy(Latency)
36
Latencymodelsfollowimportanttrendsandareaccurateenoughtodrive
automaticdesignspaceexploration
2.8% 1.3% 3.1% 3.4%
6.7% 7%
18.4%
0%
5%
10%
15%
20%
dotproduct outerprod tpchq6 blackscholes gda kmeans gemm
AverageError(%)
![Page 37: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/37.jpg)
Evaluation
37
n Accuracy:Howaccuratearethemodels,comparedtoobservations?
n Speed:Howfastarethepredictions,comparedtocommercialtools?
![Page 38: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/38.jpg)
Results:PredictionSpeed
38
Benchmark Designs Search TimeDotProduct 5,426 5.3ms/designOuter Product 1,702 30ms/designTPCHQ6 5,426 8.2ms /designBlackscholes 572 27ms /designMatrix Multiply 70,740 11ms /designK-Means 75,200 20ms/designGDA 42,800 17ms/ design
Designs Search TimeGDA 250 1.85min/design
Vivado HLS:
DHDL:
![Page 39: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/39.jpg)
Results:PredictionSpeed
39
Benchmark Designs Search TimeDotProduct 5,426 5.3ms/designOuter Product 1,702 30ms/designTPCHQ6 5,426 8.2ms /designBlackscholes 572 27ms /designMatrix Multiply 70,740 11ms /designK-Means 75,200 20ms/designGDA 42,800 17ms/ design
Designs Search TimeGDA 250 1.85min/design
Vivado HLS:
6533x Speedup Over HLS!
DHDL:
![Page 40: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/40.jpg)
Evaluation
40
n Accuracy:Howaccuratearethemodels,comparedtoobservations?
n Speed:Howfastarethepredictions,comparedtocommercialtools?
n Space:Dothedesignparametershelpcaptureaninterestingspace?
![Page 41: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/41.jpg)
20%60%100%20%60%100%20%60%100%ALMsDSPsBRAMs
ResourceUsage(%ofmaximum)
Cycles(LogScale)
Results:GDADesignSpace
41
1010109108107
Validdesignpoint Pareto-optimaldesignInvaliddesignpoint Synthesizedpareto designpoint
![Page 42: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/42.jpg)
20%60%100%20%60%100%20%60%100%ALMsDSPsBRAMs
ResourceUsage(%ofmaximum)
Cycles(LogScale)
Results:GDADesignSpace
42
1010109108107
Validdesignpoint Pareto-optimaldesignInvaliddesignpoint Synthesizedpareto designpoint
PerformancelimitedbyavailableBRAMs
SpaceforGDAspansfourordersofmagnitude
![Page 43: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/43.jpg)
Evaluation
43
n Accuracy:Howaccuratearethemodels,comparedtoobservations?
n Speed:Howfastarethepredictions,comparedtocommercialtools?
n Space:Dothedesignparametershelpcaptureaninterestingspace?
n Performance:Howgoodisthebestgenerateddesign?
![Page 44: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/44.jpg)
Evaluation:Multi-CoreComparison
44
n FPGAn Altera Stratix V(28nm)n 150MHzclockn Peakmainmemorybandwidthof37.5GB/sec
n Multi-coreCPUn IntelXeonE5-2630(32nm)n 2.3GHzn Peakmainmemorybandwidthof42.6GB/secn 6cores,6threadsn Multi-threadedC++codegeneratedfromDelite
n Executiontime=FPGAexecutiontimen DoesnotincludeCPUßà FPGAcommunication orconfigurationtime
![Page 45: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/45.jpg)
Results:ComparisonwithMulti-Core
45
1.072.42
1.11
16.73
4.55
1.15 0.10
5
10
15
20
dotproduct outerprod tpchq6 blackscholes gda kmeans gemm
Speedu
p
Memory-bound Compute-bound
Gemm usesmulti-threadedOpenBLAS onCPU
![Page 46: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/46.jpg)
Summaryn Tiling andmetapipelining capturelocalityandnested
parallelismn Spatial captureslargedesignspacen Fast,accurateestimators andDSEtoolsenable
rapiddesignspaceexplorationn Upto16.7x speedupovermulti-coreCPUbenchmarksn Upto6533x fasterDSEcomparedtoVivado HLS
46
![Page 47: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/47.jpg)
47
![Page 48: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/48.jpg)
48
![Page 49: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/49.jpg)
TilingTransformation1:StripMiningn Transformssinglepatternintosetofnestedpatterns
n Stripminedpatternsenablecomputationreordering
n Insertcopiesforpredictableaccessestoenhancelocalityn Copiesguidecreationofon-chipbuffers
ParallelPatterns Strip MinedPatternsMap(D)(f)
GroupBy(D)(k)(v)
FlatMap(D)(f)
x(i)
Map(D/B)Map(B)(f’)
GroupBy(D/B)(k’)GroupBy(B)(k’)(v’)
FlatMap(D/B)FlatMap(B)(f’)
x.copy(i to i+B); x(ii) 49
![Page 50: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/50.jpg)
multiFold(m/b0,n/b1){ii,jj =>xTl = x.copy(b0+ii, b1+jj)
(0, multiFold(b2){ k =>(0, xTl(i,j)* yTl(j,k))}{(a,b) => a + b})
})
}
TilingTransformation2:Interchangen Reordernestedpatterns
n Move‘copy’operationsouttowardouterpattern(s)n Improveslocalityandreuseofon-chipmemory
StripMinedPatterns InterchangedPatternsmultiFold(m/b0,n/b1){ii,jj =>xTl = x.copy(b0+ii, b1+jj)((ii,jj), map(b0,b1){i,j =>
multiFold(p/b2){kk =>yTl = y.copy(b1+jj, b2+kk)(0, multiFold(b2){ k =>(0, xTl(i,j)* yTl(j,k))}{(a,b) => a + b})
}{(a,b) => a + b}})
}
50
((ii,jj), multiFold(p/b2){kk => yTl = y.copy(b1+jj, b2+kk)
(0, map(b0,b1){i,j =>
}{(a,b) => map(b0,b1){i,j =>
a(i,j) + b(i,j) }})
![Page 51: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/51.jpg)
Metapipeliningn Coarse-grainedpipelining:A“pipelineofpipelines”
n Exploitsnestedparallelism
n Usesasynchronoushandshakingsignalsbetweenstagesn Allowsstagestohavevariableexecutiontimesn Doesnotrequirecompleteunrollingofinnerpatternsn Noneedtocalculateinitiationinterval(II)statically
n Intermediatedatabetweenstagesstoredindoublebuffers
51
![Page 52: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/52.jpg)
map(N) { r =>
}
Metapipelining – Intuition
row = matrix.slice(r)
diff = map(D) { i =>row(i) – sub(i)
}
vprod = map(D,D) {(i,j)=> diff(i) * diff(j)
}
vprod52
![Page 53: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/53.jpg)
map(N) { r =>
}
Metapipelining – Intuition
ld ld
st
-
diff
sub
Pipe2
ld ld
st
*
vprod
Pipe3
row
TileMemControllerPipe1
TileMemControllerPipe4
2
row = matrix.slice(r)
diff = map(D) { i =>row(i) – sub(i)
}
vprod = map(D,D) {(i,j)=> diff(i) * diff(j)
}
vprod
r=
53
![Page 54: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/54.jpg)
Metapipeline–4stages
map(N) { r =>
}
Metapipelining – Intuition
ld ld
st
-
diff
sub
Pipe2
ld ld
st
*
vprod
Pipe3
ld ld
st
-
diff
sub
Pipe2
row
ld ld
st
*
vprod
Pipe3
diff
row
TileMemControllerPipe1
TileMemControllerPipe4
row
TileMemControllerPipe1
vprod
TileMemControllerPipe4
2 5
row = matrix.slice(r)
diff = map(D) { i =>row(i) – sub(i)
}
vprod = map(D,D) {(i,j)=> diff(i) * diff(j)
}
vprod
r= r=
54
![Page 55: Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin](https://reader030.vdocuments.site/reader030/viewer/2022040818/5e62aa06b1dccd60606718de/html5/thumbnails/55.jpg)
K-means:GeneratedHardware
55
VectorDist
(Norm)Vector
Dist (Norm)
++
//
VectorDist
(Norm)
samplesTile
Load
Inc
/New
kmeansTile
Store
+
kmeansTile
Load
Scalar Dist
(Tree +)
(MinDist, Idx)
kmeansBlockbuffer
samplesBlockDouble buffer
samplesBlockDouble buffer
minIdxDouble buffer
sumBuffer
countBuffer
new kmeansDouble Buffer
Similarto(andmoregeneralthan)hand-writtendesigns1
[1]Hussainetal,“Fpga implementationofk-meansalgorithmforbioinformaticsapplication:Anacceleratedapproachtoclusteringmicroarraydata”,AHS2011
1.Loadkmeans 2.Metapipeline:Calculatesum andcount
3.Metapipeline:Calculatenewkmeans,storeresults