tangram efficient kernel synthesis for performance portable
TRANSCRIPT
TANGRAMEfficientKernelSynthesisfor
PerformancePortableProgramming
Li-WenChang1,IzzatElHajj1,ChristopherRodrigues2,JuanGómez-Luna3,Wen-MeiHwu1
1UniversityofIllinois,2HuaweiAmericaResearchLab,3UniversidaddeCó[email protected]
PerformancePortability
• MaintainingopTmizedprograms for differentdevicesiscostly
• Ideally, programs wriVen once should rundifferencedeviceswithperformance
2
AcceleratorA AcceleratorBCPU2.0 AcceleratorA2.0
AcceleratorB2.0
AcceleratorACode
AcceleratorBCode
CPU
CPUCode
PortPort
CPU2.0Code
AcceleratorA2.0Code
AcceleratorB2.0Code
Re-opTmize Re-opTmize Re-opTmize
PerformancePortability:OpenCLSGEMM
6
9.2
74.0
2.3
304.6
392.5
55.5
348.0
608.1
183.9
0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0
Tesla GPU (GTX 280) Fermi GPU (C2050) Sandy Bridge CPU (i7-3820)
Perf
orm
ance
(GFL
OPS
)
Parboil (default naïve OpenCL version) Parboil (OpenCL version optimized for Tesla GPU) Reference (MKL for CPU, CUBLAS for GPU)
90%è60%
90%è30%
AcceleratorA AcceleratorBCPU2.0 AcceleratorA2.0
AcceleratorB2.0
AcceleratorACode
AcceleratorBCode
CPU
CPUCode
CPU2.0Code
AcceleratorA2.0Code
AcceleratorB2.0Code
SourceCode
ComposiTon-basedProgramingLanguage
• NESL,Sequoia,Petabricks• HighlyadapTvetohierarchies– ThroughcomposiTon
• Usuallyscalingwell
• Performancereliesonbase-ruleimplementaTons/libraries
8
PerformanceSensiTvityinBaseRule:DGEMM
9
00.20.40.60.81
MKL TANGRAM Petabricks (with MKL)
Petabricks (no MKL)
Normalized
Pe
rforman
ce
(higherisb
e)er)
20%loss
11%ofMKL
SandyBridgei7-3820
TANGRAM
• ComposiTon-basedlanguage• Focusathigh-performancecodesynthesiswithinanode– Removedependenceofhigh-performancebase-ruleimplementaTons/libraries
• ProvidearepresentaTonforbeVerSIMDuTlizaTon
• ProvideanarchitecturalhierarchymodeltoguidecomposiTon
10
TANGRAMLanguage
__codeletintsum(constArray<1,int>in){unsignedlen=in.size();intaccum=0;for(unsignedi=0;i<len;++i){accum+=in[i];}returnaccum;} (a) Atomic autonomous codelet
__codelet__tag(asso_tiled)intsum(constArray<1,int>in){__tunableunsignedp;unsignedlen=in.size();unsignedtile=(len+p-1)/p;returnsum(map(sum,partition(in,p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1))));}
__codelet__coop__tag(kog)intsum(constArray<1,int>in){__sharedinttmp[coopDim()];unsignedlen=in.size();unsignedid=coopIdx();tmp[id]=(id<len)?in[id]:0;for(unsigneds=1;s<coopDim();s*=2){if(id>=s)tmp[id]+=tmp[id-s];}returntmp[coopDim()-1];}
(b) Atomic cooperative codelet
(c) Compound codelet using adjacent tiling
(d) Compound codelet using strided tiling
__codelet__tag(stride_tiled)intsum(constArray<1,int>in){__tunableunsignedp;unsignedlen=in.size();unsignedtile=(len+p-1)/p;returnsum(map(sum,partition(in,p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1))));}
11
TANGRAMLanguage
__codeletintsum(constArray<1,int>in){unsignedlen=in.size();intaccum=0;for(unsignedi=0;i<len;++i){accum+=in[i];}returnaccum;} (a) Atomic autonomous codelet
__codelet__tag(asso_tiled)intsum(constArray<1,int>in){__tunableunsignedp;unsignedlen=in.size();unsignedtile=(len+p-1)/p;returnsum(map(sum,partition(in,p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1))));}
__codelet__coop__tag(kog)intsum(constArray<1,int>in){__sharedinttmp[coopDim()];unsignedlen=in.size();unsignedid=coopIdx();tmp[id]=(id<len)?in[id]:0;for(unsigneds=1;s<coopDim();s*=2){if(id>=s)tmp[id]+=tmp[id-s];}returntmp[coopDim()-1];}
(b) Atomic cooperative codelet
(c) Compound codelet using adjacent tiling
(d) Compound codelet using strided tiling
__codelet__tag(stride_tiled)intsum(constArray<1,int>in){__tunableunsignedp;unsignedlen=in.size();unsignedtile=(len+p-1)/p;returnsum(map(sum,partition(in,p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1))));}
12
TANGRAMLanguage
__codeletintsum(constArray<1,int>in){unsignedlen=in.size();intaccum=0;for(unsignedi=0;i<len;++i){accum+=in[i];}returnaccum;} (a) Atomic autonomous codelet
__codelet__tag(asso_tiled)intsum(constArray<1,int>in){__tunableunsignedp;unsignedlen=in.size();unsignedtile=(len+p-1)/p;returnsum(map(sum,partition(in,p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1))));}
__codelet__coop__tag(kog)intsum(constArray<1,int>in){__sharedinttmp[coopDim()];unsignedlen=in.size();unsignedid=coopIdx();tmp[id]=(id<len)?in[id]:0;for(unsigneds=1;s<coopDim();s*=2){if(id>=s)tmp[id]+=tmp[id-s];}returntmp[coopDim()-1];}
(b) Atomic cooperative codelet
(c) Compound codelet using adjacent tiling
(d) Compound codelet using strided tiling
__codelet__tag(stride_tiled)intsum(constArray<1,int>in){__tunableunsignedp;unsignedlen=in.size();unsignedtile=(len+p-1)/p;returnsum(map(sum,partition(in,p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1))));}
13
TANGRAMLanguage
__codeletintsum(constArray<1,int>in){unsignedlen=in.size();intaccum=0;for(unsignedi=0;i<len;++i){accum+=in[i];}returnaccum;} (a) Atomic autonomous codelet
__codelet__tag(asso_tiled)intsum(constArray<1,int>in){__tunableunsignedp;unsignedlen=in.size();unsignedtile=(len+p-1)/p;returnsum(map(sum,partition(in,p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1))));}
__codelet__coop__tag(kog)intsum(constArray<1,int>in){__sharedinttmp[coopDim()];unsignedlen=in.size();unsignedid=coopIdx();tmp[id]=(id<len)?in[id]:0;for(unsigneds=1;s<coopDim();s*=2){if(id>=s)tmp[id]+=tmp[id-s];}returntmp[coopDim()-1];}
(b) Atomic cooperative codelet
(c) Compound codelet using adjacent tiling
(d) Compound codelet using strided tiling
__codelet__tag(stride_tiled)intsum(constArray<1,int>in){__tunableunsignedp;unsignedlen=in.size();unsignedtile=(len+p-1)/p;returnsum(map(sum,partition(in,p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1))));}
14
TANGRAMLanguage
__codeletintsum(constArray<1,int>in){unsignedlen=in.size();intaccum=0;for(unsignedi=0;i<len;++i){accum+=in[i];}returnaccum;} (a) Atomic autonomous codelet
__codelet__tag(asso_tiled)intsum(constArray<1,int>in){__tunableunsignedp;unsignedlen=in.size();unsignedtile=(len+p-1)/p;returnsum(map(sum,partition(in,p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1))));}
__codelet__coop__tag(kog)intsum(constArray<1,int>in){__sharedinttmp[coopDim()];unsignedlen=in.size();unsignedid=coopIdx();tmp[id]=(id<len)?in[id]:0;for(unsigneds=1;s<coopDim();s*=2){if(id>=s)tmp[id]+=tmp[id-s];}returntmp[coopDim()-1];}
(b) Atomic cooperative codelet
(c) Compound codelet using adjacent tiling
(d) Compound codelet using strided tiling
__codelet__tag(stride_tiled)intsum(constArray<1,int>in){__tunableunsignedp;unsignedlen=in.size();unsignedtile=(len+p-1)/p;returnsum(map(sum,partition(in,p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1))));}
15
TANGRAMLanguage
__codeletintsum(constArray<1,int>in){unsignedlen=in.size();intaccum=0;for(unsignedi=0;i<len;++i){accum+=in[i];}returnaccum;} (a) Atomic autonomous codelet
__codelet__tag(asso_tiled)intsum(constArray<1,int>in){__tunableunsignedp;unsignedlen=in.size();unsignedtile=(len+p-1)/p;returnsum(map(sum,partition(in,p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1))));}
__codelet__coop__tag(kog)intsum(constArray<1,int>in){__sharedinttmp[coopDim()];unsignedlen=in.size();unsignedid=coopIdx();tmp[id]=(id<len)?in[id]:0;for(unsigneds=1;s<coopDim();s*=2){if(id>=s)tmp[id]+=tmp[id-s];}returntmp[coopDim()-1];}
(b) Atomic cooperative codelet
(c) Compound codelet using adjacent tiling
(d) Compound codelet using strided tiling
__codelet__tag(stride_tiled)intsum(constArray<1,int>in){__tunableunsignedp;unsignedlen=in.size();unsignedtile=(len+p-1)/p;returnsum(map(sum,partition(in,p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1))));}
16
TANGRAMLanguage
__codeletintsum(constArray<1,int>in){unsignedlen=in.size();intaccum=0;for(unsignedi=0;i<len;++i){accum+=in[i];}returnaccum;} (a) Atomic autonomous codelet
__codelet__tag(asso_tiled)intsum(constArray<1,int>in){__tunableunsignedp;unsignedlen=in.size();unsignedtile=(len+p-1)/p;returnsum(map(sum,partition(in,p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1))));}
__codelet__coop__tag(kog)intsum(constArray<1,int>in){__sharedinttmp[coopDim()];unsignedlen=in.size();unsignedid=coopIdx();tmp[id]=(id<len)?in[id]:0;for(unsigneds=1;s<coopDim();s*=2){if(id>=s)tmp[id]+=tmp[id-s];}returntmp[coopDim()-1];}
(b) Atomic cooperative codelet
(c) Compound codelet using adjacent tiling
(d) Compound codelet using strided tiling
__codelet__tag(stride_tiled)intsum(constArray<1,int>in){__tunableunsignedp;unsignedlen=in.size();unsignedtile=(len+p-1)/p;returnsum(map(sum,partition(in,p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1))));}
17
S
pc
S S S
S
pd
S S S
ca
cb
RuleExtracTon
• TANGRAMparser– Clang3.5– CustomizedTANGRAMASTbuilder
• OutputasetofTANGRAMASTs
18
TANGRAMLang.Codelets
ProgramComposiTon
Rules
RuleExtracTon
ArchitecturalHierarchyModel
SpecializedComposiTon
Rules
RuleSpecializaTon
ComposiTonPlans
ComposiTon
KernelVersions
Device-specificCodegen
ProgramComposi,onRules:(sum)Rule1: compose(sum,L)→ SL,devolve(ℓL),compose(sum,ℓL) Rule2: compose(sum,L)→ compute(ca,SEL) Rule3: compose(sum,L)→ compute(cb,VEL) Rule4: compose(sum,L)→ SL,regroup(pc,L),distribute(ℓL),compose(sum,ℓL),compose(sum,L) Rule5: compose(sum,L)→ SL,regroup(pd,L),distribute(ℓL),compose(sum,ℓL),compose(sum,L)ExampleforDerivingComposi,onRulesfromCompoundCodelets:(codeletc)compose(sum,L)→ compose(cc,L)
→ compose(sum(map(sum,par::on(…,pc))),L) → compose(map(sum,par::on(…,pc)),L),compose(sum,L) → compose(par::on(…,pc),L),compose(map(sum,…),L),compose(sum,L) → SL,regroup(pc,L),distribute(ℓL),compose(sum,ℓL),compose(sum,L)
ArchitecturalHierarchyModel• Definea“level”
– ComputaTonalcapability• ScalarorvectorexecuTon
– Capabilitytosynchronizeacrossthesubordinatelevelofthatlevel
• Extensible– CPUSIMD,GPUwarp,ILP,evenGPUdynamicparallelism
19
DeviceSpecificaKon:G:= CG=none , (ℓG,SG)=(B,terminate/launch)//G:gridB:= CB=VEB,(ℓB,SB)=(T,__syncthreads()) //B:blockT:= CT=SET ,(ℓT,ST)=none //T:thread
TANGRAMLang.Codelets
ProgramComposiTon
Rules
RuleExtracTon
ArchitecturalHierarchyModel
SpecializedComposiTon
Rules
RuleSpecializaTon
ComposiTonPlans
ComposiTon
KernelVersions
Device-specificCodegen
RuleSpecializaTon
• TANGRAManalyzer– ASTtraverser
• Outputalookuptable– Legalcodeletsforeachlevel– AlsoprioriTzethem
20
Specialized Composition Rules: G rules: G1: compose(sum , G) → SG , devolve(B) , compose(sum, B)
G4: compose(sum , G) → SG , regroup(pc , G) , distribute(B) , compose(sum, B) , compose(sum, G) G5: compose(sum , G) → SG , regroup(pd , G) , distribute(B) , compose(sum, B) , compose(sum, G)
B rules: B1: compose(sum , B) → SB , devolve(T) , compose(sum, T) B3: compose(sum , B) → compute (cb , VEB) B4: compose(sum , B) → SB , regroup(pc , B) , distribute(T) , compose(sum, T) , compose(sum, B) B5: compose(sum , B) → SB , regroup(pd , B) , distribute(T) , compose(sum, T) , compose(sum, B)
T rules: T2: compose(sum , T) → compute(ca , SET)
TANGRAMLang.Codelets
ProgramComposiTon
Rules
RuleExtracTon
ArchitecturalHierarchyModel
SpecializedComposiTon
Rules
RuleSpecializaTon
ComposiTonPlans
ComposiTon
KernelVersions
Device-specificCodegen
ComposiTon
• TANGRAMplanner– ASTtraverser/builder– SelecTonofcodeletsormappolicies– Pruning
• OutputASTsforcodegen
21
TANGRAMLang.Codelets
ProgramComposiTon
Rules
RuleExtracTon
ArchitecturalHierarchyModel
SpecializedComposiTon
Rules
RuleSpecializaTon
ComposiTonPlans
ComposiTon
KernelVersions
Device-specificCodegen
S
pc
S S S
S
pd
S S S
ca
cb
pd
terminate/launch
cb
pc
ca ca ca
cb
pc
ca ca ca
cb
pc
ca ca ca
cb
pc
ca ca ca
Codegen
• TANGRAMcodegen– ASTtraversers– ConvenTonalopTmizaTons
• OutputC/CUDAsourcecode
22
TANGRAMLang.Codelets
ProgramComposiTon
Rules
RuleExtracTon
ArchitecturalHierarchyModel
SpecializedComposiTon
Rules
RuleSpecializaTon
ComposiTonPlans
ComposiTon
KernelVersions
Device-specificCodegen
GPUCodegenExampleGPU
1. Grid2. Block3. Thread
tile=(len+gridDim.x–1)/gridDim.x;sub_tile=(tile+blockDim.x–1)/blockDim.x;accum=0#pragmaunrollfor(unsignedi=0;i<sub_tile;++i){accum+=in[blockIdx.x*tile+i*blockDim.x+threadIdx.x];}tmp[threadIdx.x]=accum;__syncthreads();for(unsigneds=1;s<blockDim.x;s*=2){if(id>=s)tmp[threadIdx.x]+=tmp[threadIdx.x-s];__syncthreads();}partial[blockIdx.x]=tmp[blockDim.x-1];return;//Launchnewkerneltosumuppartial
pd
cb
pc
ca ca ca
cb
pc
ca ca ca
cb
pc
ca ca ca
ExperimentalResults• TANGRAMdelivers70%orhigherperformancecomparedto
highly-opTmizedlibraries,suchasIntelMKL,NVIDIACUBLAS,CUSPARSE,orThrust,orexperts’opTmizedbenchmarks,Rodinia
24
0
0.2
0.4
0.6
0.8
1
Scan SGEMV-TS SGEMV-SF DGEMM SpMV KMeans BFS
Normalized
Perform
ance
(higherisb
e*er)
Kepler(reference) Kepler(TANGRAM) Fermi(reference) Fermi(TANGRAM) CPU(reference) CPU(TANGRAM)
FAQ1
• WhyTANGRAMisbeVerthanothercomposiTon-basedlanguages?– TANGRAMprovidesanarchitecturalhierarchymodeltoguidecomposiTon
– TANGRAMprovidesarepresentaTonofcooperaTvecodeletsforbeVerSIMDuTlizaTon• EspeciallyshuffleinstrucTonsandscratchpad
25
FAQ2
• WhereopTmizaTonshappen?– SelecTonofcodeletsormappoliciesinComposiTon
– ConvenTonalopTmizaTonsinCodegen– OpTmizaTonsinbackendcompilers
26
FAQ3
• What?MulTpleversions?– WedidNOTaskuserstowritemulTpleversionsofkernels
– Codeletscanbeusedtosynthesizedifferentversionsofkernels
– CodeletscanbereusedmulTpleTmeswithinonekernel,acrosskernelsinadevice,acrosskernelsfordifferentdevices
27
TakeawaysofTANGRAM
• Performanceportability– 70%orhigherperformancecomparedtohighly-opTmizedlibraries
• Extensiblearchitecturalhierarchicalmodel– SupportCPUSIMD,GPUwarp,ILP,evenGPUdynamicparallelism
• NaTvedescripTonforalgorithmicdesignspace– Perfectfordomainusers
28