![Page 1: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/1.jpg)
GeneratingConfigurableHardwareFromParallelPatterns
RaghuPrabhakar,DavidKoeplinger,KevinJ.Brown,HyoukJoongLee,ChrisDeSa,ChristosKozyrakis,KunleOlukotun
StanfordUniversity
ASPLOS2016
![Page 2: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/2.jpg)
Motivation
n IncreasinginteresttouseFPGAsasacceleratorsn Key advantage: Performance/Watt
n Keydomains:n Bigdataanalytics,imageprocessing,financialanalytics,scientificcomputing,search
2
![Page 3: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/3.jpg)
Problem:Programmability
n VerilogandVHDLtoolowlevelforsoftwaredevelopers
n Highlevelsynthesis(HLS)toolsneeduserpragmastohelpdiscoverparallelismn C-basedinput,pragmasrequiringhardwareknowledge
n Limitedinexploitingdatalocality
n Difficulttosynthesizecomplexdatapathswithnestedparallelism
3
![Page 4: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/4.jpg)
Hardware Design - HLS
Add 512 integers stored in external DRAMvoid(int* mem) {
mem[512] = 0;
for(int i=0; i<512; i++) {mem[512] += mem[i];
}}
SumModule
DRAM
27,236clockcyclesforcomputationTwo-ordersofmagnitudetoolong!
4
![Page 5: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/5.jpg)
Optimized Design - HLS
#define CHUNKSIZE (sizeof(MPort)/sizeof(int)) #define LOOPCOUNT (512/CHUNKSIZE)
void(MPort* mem) { MPort buff[LOOPCOUNT];memcpy(buff, mem, LOOPCOUNT);
int sum = 0;for(int i=1; i<LOOPCOUNT; i++) {
#pragma PIPELINE for(int j=0; j<CHUNKSIZE; j++) {
#pragma UNROLL sum += (int)(buff[i]>>j*sizeof(int)*8);
}} mem[512] = sum;
}
302clockcyclesforcomputation
WidthofDRAMcontroller interface
Burst Access
Use localvariable
Specialcompilerdirectives
LoopRestructuring
Bit shifting toextract individualelements
5
![Page 6: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/6.jpg)
So, we need to ...
n Use Higher-level Abstractionsn Productivity: Developer focuses on applicationn Performance:
n Capture Locality to reduce off-chip memory trafficn Exploit Parallelism at multiple nesting levels
n Smart compiler generates efficient hardware
6
![Page 7: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/7.jpg)
ParallelPatterns
n Constructswithspecialpropertieswithrespecttoparallelismandmemoryaccess
7
map zip reduce groupBy
key1 key3key2
![Page 8: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/8.jpg)
Why Parallel Patterns?
n Concisen Can express large class of workloads in the machine
learning and data analytics domainn Captures rich semantic information about parallelism
and memory access patternsn Enables powerful transformations using pattern matching
and re-write rulesn Enables generating efficient code for different
architectures
8
![Page 9: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/9.jpg)
Parallel Pattern Language
n A data-parallel language that supports parallel patternsn Example application: k-means
val clusters = samples groupBy { sample =>val dists = kMeans map { mean =>
mean.zip(sample){ (a,b) => sq(a – b) } reduce { (a,b) => a + b }}Range(0, dists.length) reduce { (i,j) =>
if (dists(i) < dists(j)) i else j}
}val newKmeans = clusters map { e => val sum = e reduce { (v1,v2) => v1.zip(v2){ (a,b) => a + b } }val count = e map { v => 1 } reduce { (a,b) => a + b }
sum map { a => a / count }}
// Compute closest mean for each ‘sample’// 1. Compute distance with each mean// 2. Select the mean with shortest distance
//Compute average of each cluster// 1. Compute sum of all assigned points// 2. Compute number of assigned points// 3. Divide each dimension of sum by count
9
![Page 10: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/10.jpg)
OurApproach
PatternTransformationsFusion
PatternTilingCodeMotion
ParallelPatterns
TiledParallelPatternIR
BitstreamGeneration
FPGAConfiguration
HardwareGenerationMemoryAllocationTemplateSelection
MetapipelineAnalysis
MaxJHGL
10
![Page 11: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/11.jpg)
OurApproach
PatternTransformationsFusion
PatternTilingCodeMotion
ParallelPatterns
TiledParallelPatternIR
BitstreamGeneration
FPGAConfiguration
HardwareGenerationMemoryAllocationTemplateSelection
MetapipelineAnalysis
MaxJHGL
11
High-level ParallelPatternshelpsproductivity
DataLocalityimprovedwithparallelpatterntilingtransformations
NestedParallelism exploited
withhierarchicalpipelinesanddoublebuffers
GenerateMaxJ togenerateVHDL
Delite
![Page 12: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/12.jpg)
OurApproach
PatternTransformationsFusion
PatternTilingCodeMotion
ParallelPatterns
TiledParallelPatternIR
BitstreamGeneration
FPGAConfiguration
HardwareGenerationMemoryAllocationTemplateSelection
MetapipelineAnalysis
MaxJHGL
12
![Page 13: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/13.jpg)
OurApproach
PatternTransformationsFusion
PatternTilingCodeMotion
ParallelPatterns
TiledParallelPatternIR
BitstreamGeneration
FPGAConfiguration
HardwareGenerationMemoryAllocationTemplateSelection
MetapipelineAnalysis
MaxJHGL
13
![Page 14: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/14.jpg)
ParallelPatternTiling:MultiFold
n Tilingusingpolyhedralanalysislimitsdataaccesspatternstoaffinefunctionsofloopindices
n Currentparallelpatternscannotrepresenttiling
n Newparallelpatterndescribestiledcomputation
14
tile0 tile1 tile2 tile3
multiFold
out_tile0 out_tile1
![Page 15: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/15.jpg)
ParallelPatternTiling:MultiFold
15
tile0 tile1 tile2 tile3
map
reduce
out_tile0 out_tile1
reduce
groupBy key
![Page 16: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/16.jpg)
kMeans:Untiled
n Data dependent (non-affine) access to ‘sum’ and ‘count’
n Lots of data locality
n Typically, n >> k
16
samples
kMeans
sum
count
newKmeans
mindistminDistIdx
n
k
d
k
k
d
kMeans #reads: n * k * d
![Page 17: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/17.jpg)
ParallelPatternStripMining
n Transformparallelpatternà nestedpatterns
n Stripminedpatternsenablecomputationreordering
n Insertcopiestoenhancelocalityn Copiesguidecreationofon-chipbuffers
ParallelPatterns Strip MinedPatternsmap(d){i => 2*x(i)} multiFold(d/b){ii =>
xTile = x.copy(b + ii)(i, map(b){i => 2*xTile(i)
}) }
17
![Page 18: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/18.jpg)
Strip Mining:kMeans
18
mindistminDistIdx
sum
count
samples
kMeans kMeansBlock
samplesBlock
kMeans #reads: n * k * d
n
k
bs
bk
![Page 19: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/19.jpg)
ParallelPatternInterchange
n Reordernestedpatternsn Move ‘copy’ operations out toward outer pattern(s)n Improves locality and reuse of on-chip memory
StripMinedPatterns InterchangedPatternsmultiFold(m/b0,n/b1){ii,jj =>xTl = x.copy(b0+ii, b1+jj)((ii,jj), map(b0,b1){i,j =>
multiFold(p/b2){kk =>yTl = y.copy(b1+jj, b2+kk)(0, multiFold(b2){ k =>(0, xTl(i,j)* yTl(j,k))}{(a,b) => a + b})
}{(a,b) => a + b}})
}
multiFold(m/b0,n/b1){ii,jj =>xTl = x.copy(b0+ii, b1+jj)((ii,jj), multiFold(p/b2){kk => yTl = y.copy(b1+jj, b2+kk) (0, map(b0,b1){i,j =>(0, multiFold(b2){ k =>(0, xTl(i,j)* yTl(j,k))}{(a,b) => a + b})
})}{(a,b) => map(b0,b1){i,j =>
a(i,j) + b(i,j) }})
}
19
![Page 20: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/20.jpg)
Pattern Interchange: kMeans
20
mindist
minDistIdx
sum
count
samples
kMeans kMeansBlock
samplesBlock
n
k
bs
bk
kMeans #reads: (n / bs) * k * d
![Page 21: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/21.jpg)
OurApproach
PatternTransformationsFusion
PatternTilingCodeMotion
ParallelPatterns
TiledParallelPatternIR
BitstreamGeneration
FPGAConfiguration
HardwareGenerationMemoryAllocationTemplateSelection
MetapipelineAnalysis
MaxJHGL
21
![Page 22: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/22.jpg)
TemplateSelection
Memories Description IRConstruct
Buffer Scratchpadmemory Statically sizedarray
Doublebuffer Buffercoupling twostagesinametapipeline Metapipeline
Cache Taggedmemoryexploits locality inrandomaccesses Non-affineaccesses
Pipe.Exec.Units Description IRConstruct
Vector SIMDparallelism Mapoverscalars
Reduction tree Parallel reductionofassociativeoperations MultiFold overscalars
Parallel FIFO Bufferorderedoutputsofdynamicsize FlatMap overscalars
CAM Fullyassociative key-valuestore GroupByFold overscalars
Controllers Description IRConstruct
Sequential Coordinates sequential execution Sequential IRnode
Parallel Coordinatesparallel execution Independent IRnodes
Metapipeline Execute nestedparallel patternsinapipelinedfashion
Outerparallel patternwithmultiple innerpatterns
Tilememory Fetchtilesofdatafromoff-chipmemory Transformer-inserted arraycopy
Controllers Description IRConstruct
Sequential Coordinates sequential execution Sequential IRnode
Parallel Coordinatesparallel execution Independent IRnodes
Metapipeline Execute nestedparallel patternsinapipelinedfashion
Outerparallel patternwithmultiple innerpatterns
Tilememory Fetchtilesofdatafromoff-chipmemory Transformer-inserted arraycopy
Memories Description IRConstruct
Buffer Scratchpadmemory Statically sizedarray
Doublebuffer Buffercoupling twostagesinametapipeline Metapipeline
Cache Taggedmemoryexploits locality inrandomaccesses Non-affineaccesses
22
![Page 23: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/23.jpg)
Metapipelining
n Hierarchicalpipeline:A“pipelineofpipelines”n Exploitsnestedparallelism
n Innerstagescouldbeothernestedpatternsorcombinational logicn Doesnotrequireiterationspacetobeknownstatically
n Doesnotrequirecompleteunrollingofinnerpatterns
n Intermediatedatafromeachstageautomatically storedindoublebuffersn Allowsstagestohavevariableexecutiontimes
n Noneedtocalculateinitiation interval(II)n Useasynchronouscontrolsignalstobeginnextiteration
23
![Page 24: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/24.jpg)
Metapipeline–4stages
map(N) { r =>
}
Metapipeline – Intuition
ld ld
st
-
diff
sub
Pipe2
ld ld
st
*
vprod
Pipe3
ld ld
st
-
diff
sub
Pipe2
row
ld ld
st
*
vprod
Pipe3
diff
row
TileMemControllerPipe1
TileMemControllerPipe4
row
TileMemControllerPipe1
vprod
TileMemControllerPipe4
12 1234
row = matrix.slice(r)
diff = map(D) { i =>row(i) – sub(i)
}
vprod = map(D,D) {(i,j)=> diff(i) * diff(j)
}
vprod
5r= r=
24
![Page 25: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/25.jpg)
Metapipeline Analysis
n DetectsMetapipelines inthetiledparallelpatternIR
n Detectionn Chainofproducer-consumerparallelpatternswithinthebodyofanotherparallelpattern
n Schedulingn TopologicalsortofIRofparallelpatternbody
n Listofstages,whereeachstageconsistsofoneormoreindependentparallelpatterns
n Promoteintermediatebufferstodoublebuffers
25
![Page 26: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/26.jpg)
PuttingItAllTogether:kMeans
26
VectorDist
(Norm)Vector
Dist (Norm)
++
//
VectorDist
(Norm)
samplesTile
Load
Inc
/New
kmeansTile
Store
+
kmeansTile
Load
Scalar Dist
(Tree +)
(MinDist, Idx)
kmeansBlockbuffer
samplesBlockDouble buffer
samplesBlockDouble buffer
minIdxDouble buffer
sumBuffer
countBuffer
new kmeansDouble Buffer
1 1 12 2 23 3 34 4 4
Similarto(andmoregeneralthan)hand-writtendesigns1
[1]Hussainetal,“Fpga implementation ofk-meansalgorithm forbioinformaticsapplication: Anaccelerated approachtoclustering microarraydata”,AHS2011
1.Loadkmeans 2.Metapipeline:Calculatesum andcount
3.Metapipeline: Calculatenewkmeans,storeresults
![Page 27: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/27.jpg)
Experimental Setup
n Board:n AlteraStratix V
n 48GBDDR3off-chipDRAM,6memorychannels
n BoardconnectedtohostviaPCI-e
n Executiontimereported=FPGAexecutiontimen CPUßà FPGAcommunication,FPGAconfigurationtimenotincluded
n Goal:Howbeneficialistiling andmetapipelining?
27
![Page 28: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/28.jpg)
Experimental Setup
n Baselinen AutogeneratedMaxJ
n Representativeofstate-of-the-artHLStools
n BaselineOptimizationsn Pipelinedexecutionofinnermostloops
n Parallelized(unrolled)innerloops
n Parallelismfactorchosenbyhand
n DatalocalitycapturedatthelevelofaDRAMburst(384bytes)
n Parallelism factorsarekeptconsistentacrossbaselineandoptimizedversionsfromourflow
28
![Page 29: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/29.jpg)
Evaluation
29
![Page 30: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/30.jpg)
Evaluation
30
![Page 31: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/31.jpg)
Results Summary
31
n Speedupwithtiling:upto15.5xn Speedupwithtiling+metapipelining: upto39.4xn Minimal (oftenpositive!)impactonresourceusage
n Tileddesignshavefeweroff-chipdataloadersandstorers
![Page 32: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level](https://reader030.vdocuments.site/reader030/viewer/2022040915/5e8d37a23e9fc74883648126/html5/thumbnails/32.jpg)
Summary
n Twokeyoptimizations:tiling andmetapipelining – togenerateefficientFPGAdesignsfromparallelpatterns
n Automatictilingtransformationsplacingfewerrestrictionsonmemoryaccesspatterns
n Analysistoautomatically inferdesignswithmetapipelines anddoublebuffers
n Significantspeedupsofupto39.4xwithminimal impactonFPGAresourceutilization
32