a compilation framework for irregular memory accesses on ... · 4 compiler-directed communication...
TRANSCRIPT
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummary
A Compilation Framework forIrregular Memory Accesses
on the Cell Broadband Engine
MAINAK CHAUDHURI
Department of Computer Science and Engineering, IIT Kanpur
Pramod Bhatotia � IBM ResearchSanjeev Aggarwal � IIT Kanpur
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryOutline
1 Compiling Irregular Accesses
2 Data�ow Analysis
3 Compiler Transformations and Code Generation
4 Compiler-Directed Communication Mechanism
5 Run-time Parallelization
6 Experiments and Results
7 Summary
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCell Broadband Engine OverviewCell Architecture
SPU ( 128 128b registers)
Local Store(256 KB)
(LS)
Memory Flow Controller
(MFC)
Power Processing Unit(PPU)
L1 Cache
L2 Cache
Element Interconnect Bus (EIB)
SPE (x8)
IOController
BIF & IO
PPE
MIC
System Memory
IOController
BIF & IO
Memory Subsystem IO Subystem
The Cell Broadband Engine
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiling Irregular Accesses on the Cell ProcessorIrregular Array Accesses
Irregular Array AccessAn array access is irregular if no closed-form expression, interms of the loop indices and constants, for the subscripts ofthe accessed array is available at compile-time.
do t =do i = num_edges
n1 = left[i ]n2 = right[i ]force = (x[n1]−x[n2])/4y [n1] + = forcey [n2] + = force
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiling Irregular Accesses on the Cell ProcessorIrregular Array Accesses
Irregular Array AccessAn array access is irregular if no closed-form expression, interms of the loop indices and constants, for the subscripts ofthe accessed array is available at compile-time.
do t =do i = num_edges
n1 = left[i ]n2 = right[i ]force = (x[n1]−x[n2])/4y [n1] + = forcey [n2] + = force
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiling Irregular Accesses
Compiling Irregular Accesses
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiling Irregular Accesses on the Cell ProcessorCompiler Analysis for Irregular Array Accesses
force = x[r ight [ i ] ] - x[ lef t [ i ] ] ;
access pattern @ run-time
int main(){ ..... .......... for() { ................... for() { ................... } }
for() { force = x[r ight[ i ] ] - x[ left[ i ] ] ; } ......... ........}
static analysis for DMA comm.
MFC-DMA
MFC-DMA
MFC-DMA
Dataflow frameworkbased on run-timepreprocessing for DMA comm.
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiling Irregular Accesses on the Cell ProcessorExplicit Dynamic Memory Management
Execution Unit
Local StoreSPU
DMA controller MFC
EIB
Main Memory
Limited amount of LS (256 KB)
Code
Data
Local Store
1
SPE vector processor computes quickly
2 Efficient t i l ing
3
Mult iple-buffering
4
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiling Irregular Accesses on the Cell ProcessorCell Architecture Speci�c Challenges
EU LSSPU
MFC
EU LSSPU
MFC
EU LSSPU
MFC
EU LSSPU
MFC
L1-C L1-DPPU
L2
SPE-1
SPE-3
SPE-2
SPE-4
Main Memory
SPE-all
SPMD ||el execution
workload part i t ioning + monitor control
1
Vector processor Data al ignment
2 Branch prediction compiler hints
3
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryNeed for Run-time Parallelization
Sparse UpdatesSparse update refers to reduction operations on some elements ofan array within a loop where the access pattern of the array in theloop may be irregular.
for(i = 1 to n)A[B[i ]] = A[B[i ]] + X ;
Flow
S1: A(B(i))=...S2: ...=A(B(i))
Anti
S1: ...=A(B(i))S2: A(B(i))=...
Output
S1: A(B(i))=...S2: A(B(i))=...
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiling Irregular Accesses on the Cell ProcessorRun-time Parallelization
EU LSSPU
MFC
SPE-0
EU LSSPU
MFC
No H/w Coherency
1 2 543 6
X2 3 2 3 1 3X
Iteration i
Array B
Array A X X X X
SPE-2
A[B[ i ] ]
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiling Irregular Accesses on the Cell ProcessorRun-time Parallelization
EU LSSPU
MFC
SPE-0
EU LSSPU
MFC
No H/w Coherency
1 2 543 6
X2 3 2 3 1 3X
Iteration i
Array B
Array A X X X X
SPE-2
A[B[ i ] ]
4
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiling Irregular Accesses on the Cell ProcessorRun-time Parallelization
EU LSSPU
MFC
SPE-0
EU LSSPU
MFC
No H/w Coherency
1 2 543 6
X2 3 2 3 1 3X
Iteration i
Array B
Array A X X X X
SPE-2
A[B[ i ] ]
4 3
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiling Irregular Accesses on the Cell ProcessorRun-time Parallelization
EU LSSPU
MFC
SPE-0
EU LSSPU
MFC
No H/w Coherency
1 2 543 6
X2 3 2 3 1 3X
Iteration i
Array B
Array A X X X X
SPE-2
A[B[ i ] ]
4 3 5
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryOverview of the System
Sequential Input program
Dataflow VariableAnalysis
Runtime System
SPMD Cell Code
CompilerTransformation
MemoryCommunicat ion
Runtime Parallelization
Parallelizing Compiler Framework
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryData�ow Analysis
Data�ow Analysis
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryData�ow Analysis for Memory Communication
Loop Flow Graph Builder
Result FlowVariable Analysis
Bit-VectorLibrary
GlobalInformation
Builder
global flow analysis
result flow analysis
Dependence Analyzer
Local FlowAnalysis
Global FlowAnalysis
Result FlowAnalysis
SourceLoop Flow
Graph
Global Info-DS
c2suifAST
(SUIF IR)
InspectorOutput
Figure: Data�ow analysis for memory communication
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiler Transformations and Code Generation
Compiler Transformations and CodeGeneration
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiler Transformations and Code Generation
Run-Time Library
Loop Transformations
Kernel Transformations
Communication Builder
Compiler Transformations and Code Generation
CommunicationTransformations
Multi-BufferingBuilder
Tiling InfoBuilder
InspectorOutput
GlobalInfo-DS
Input SourceSUIF IR
suif2c
TransformedSUIF IR
SPMD CellCode
Figure: Compiler transformations and code generation
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiler-Directed Communication Mechanism
Compiler-Directed CommunicationMechanism
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiler-directed Communication MechanismBlock Access Method for Regular Accesses
EU LSSPU
MFC
Main Memory
for( i=1 to 4){ sum += a[ i ] ;}
a [0 ]
a [1 ]a [2 ]a [3 ]
Block Access for Regular Accesses
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiler-directed Communication MechanismBlock Access Method for Regular Accesses
Execution Unit
Local StoreSPU
DMA controller MFC
EIB
Main Memory
1Tile-Size
Code
Data
Local Store
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiler-directed Communication MechanismBlock Access Method for Regular Accesses
Execution Unit
Local StoreSPU
DMA controller MFC
EIB
Main Memory
2Addressing
Code
Data
LS Address
EA
*Cache-line aligned*last 4 bits
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiler-directed Communication MechanismBlock Access Method for Regular Accesses
Execution Unit
Local StoreSPU
DMA controller MFC
EIB
Main Memory
3 DataPadding
Code
Data
Multiple of 16
Transfer unit
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiler-directed Communication MechanismBounded Method for Irregular Read Accesses
EU LSSPU
MFC
Main Memory
for( i=1 to 4){ sum += a[b[ i ] ] ;} a [b [0 ] ]
a [b [1 ] ]
a [b [2 ] ]a [b [3 ] ]
Bounded Access for Irregular read-only Accesses
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiler-directed Communication MechanismBounded Method for Irregular Read Accesses
EU LSSPU
MFC
Main Memory
for( i=1 to 4){ sum += a[b[ i ] ] ;}
a [b [0 ] ]
a [b [1 ] ]
a [b [2 ] ]a [b [3 ] ]
Bulk Access
a[0]
a[n]
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiler-directed Communication MechanismBounded Method for Irregular Read Accesses
EU LSSPU
MFC
Main Memory
for( i=1 to 4){ a [b [ i ] ]= a [b [ i ] ]+x ;}
a [b [0 ] ]
a [b [1 ] ]
a [b [2 ] ]a [b [3 ] ]
Gather Method
a[0]
a[n]
a [b [0 ] ]
a [b [1 ] ]
a [b [2 ] ]
a [b [3 ] ]
DMA-LIST
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiler-directed Communication MechanismBounded Method for Irregular Read Accesses
Execution Unit
Local StoreSPU
DMA controller MFC
Modif ied bounding box method
Main Memory
1 Prepare list ofbounding boxes
DMA-LIST
Executor prepares l ist of all the neededelements and ini t iates the communicat ionwith MFC DMA list command
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiler-directed Communication MechanismBounded Method for Irregular Read Accesses
Execution Unit
Local StoreSPU
DMA controller MFC
Modif ied bounding box method
Main Memory
2
*Keeps the size of box as 128 bytes, since DMA transfer unit is 128bytes
DMA-LIST
*Prepares single bounding box within same-ti le for duplicated values of ref-array
128 bytes
*Does memcopy for values across the t i le
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiler-directed Communication MechanismCompiler Controlled Cache for Sparse Updates
EU LSSPU
MFC
Main Memory
for( i=1 to 10){ a [b[ i ] ] += X;}
a [10]
Compiler Controlled Cache Regular Accesses
a[10]
a[10]
a[10]
1
2
3
DMARead-List
DMAWrite-List
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiler-directed Communication MechanismCompiler Controlled Cache for Sparse Updates
EU LSSPU
MFC
Main Memory
for( i=1 to 10){ a [b[ i ] ] += X;}
Compiler Controlled Cache Regular Accesses
Tag
Tag
Tag
L-0
L-i
L-128
0
1
32
128 bytes
Tag Line word
Compare
Hit inCache
Miss inCache
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryCompiler-directed Communication MechanismCompiler Controlled Cache for Sparse Updates
EU LSSPU
MFC
Main Memory
Tag
Tag
Tag
EU LSSPU
MFC
SPE-0 SPE-1
Tag
Tag
Tag
X
Y
Shared
Dirty
Dir ty
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryRun-time Parallelization
Run-time Parallelization
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryParallelization of Irregular Reduction Loops
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryParallel Construction of the Dependence Chain
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryExperiment and Results
Experiments and Results
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryIterative Partial Di�erential Equation (PDE) Solver
Figure: Speedup for IRREG
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryMolecular Dynamics Code
Figure: Speedup for MOLDYN
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryNonbonded Force Calculations
Figure: Speedup for NBF
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummarySummary
Summary
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummarySummary
Data�ow analysis framework for determining the programpoints for memory communication.
Compiler transformation for run-time data processing andactual computation.
Builds the memory communication schedules.
Run-time parallelization of loops judiciously partitions the dataand computational work.
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor
Compiling Irregular AccessesData�ow AnalysisCompiler Transformations and Code GenerationCompiler-Directed Communication MechanismRun-time ParallelizationExperiments and ResultsSummaryThank You
Thank You
Mainak Chaudhuri Compiling Irregular Accesses for the Cell Processor