compiler-directed data partitioning for multicluster processors
DESCRIPTION
Compiler-directed Data Partitioning for Multicluster Processors. Michael Chu and Scott Mahlke Advanced Computer Architecture Lab University of Michigan March 28, 2006. Intercluster Communication Network. Register File. Register File. M. M. I. F. I. F. I. F. M. I. F. M. - PowerPoint PPT PresentationTRANSCRIPT
1 University of MichiganElectrical Engineering and Computer Science
Compiler-directed Data Partitioning Compiler-directed Data Partitioning for Multicluster Processorsfor Multicluster Processors
Michael Chu and Scott Mahlke
Advanced Computer Architecture Lab
University of Michigan
March 28, 2006
2 University of MichiganElectrical Engineering and Computer Science
Processor
I IF FM M
Data Memory
Multicluster ArchitecturesMulticluster Architectures
• Addresses the register file bottleneck
• Decentralizes architecture• Compilation focuses on
partitioning operations• Most previous work
assumes a unified memory Data MemoryData Mem 1 Data Mem 2
Register File
Intercluster Communication Network
Register File
Cluster 1
Register File
Cluster 2
IF FM MI
3 University of MichiganElectrical Engineering and Computer Science
Data Mem 1 Data Mem 2
Cluster 1 Cluster 2
I F M I F M
Problem: Partitioning of DataProblem: Partitioning of Data
• Determine object placement into data memories• Limited by:
– Memory sizes/capacities– Computation operations related to data
• Partitioning relevant to caches and scratchpad memories
int x[100] struct foo
int y[100]
4 University of MichiganElectrical Engineering and Computer Science
Cluster 1 Cluster 2
I F M I F M
int x[100] foo int y[100]
Architectural ModelArchitectural Model
• This work focuses on use of scratchpad-like static local memories– Each cluster has one local memory– Each object placed in one specific memory– Data object available in the memory throughout the
lifetime of the program
5 University of MichiganElectrical Engineering and Computer Science
Data Unaware PartitioningData Unaware Partitioning
Lose average 30% performance by ignoring data
6 University of MichiganElectrical Engineering and Computer Science
Our ObjectiveOur Objective
• Goal: Produce efficient code• Strategy:
– Partition both data objects and computation operations
– Balance memory size across clusters
• Improve memory bandwidth• Maximize parallelism
int x[100] struct foo
int y [100]
7 University of MichiganElectrical Engineering and Computer Science
First Try: Greedy ApproachFirst Try: Greedy Approach
• Computation-centric partition of data– Place data where computation references it most often
• Greedy approach:– Pass 1: Region-view computation partition
Greedy data cluster assignment– Pass 2: Region-view computation repartition
Full knowledge of data location
Data Partition Computation Partition
Data Unaware None, Profile-based placement
Region-view
Greedy Region-view
Greedy Profile-based
Region-view
8 University of MichiganElectrical Engineering and Computer Science
Greedy Approach ResultsGreedy Approach Results• 2 Clusters:
– One Integer, Float, Memory, Branch unit per cluster
• Relative to a unified, dual-ported memory
• Improvement over Data Unaware, still room for improvement
9 University of MichiganElectrical Engineering and Computer Science
Second Try: Global Data PartitionSecond Try: Global Data Partition• Data-centric partition of computation• Hierarchical technique• Pass 1: Global-view for data
– Consider memory relationships throughout program
– Locks memory operations to clusters• Pass 2: Region-view for computation
– Partition computation based on data location
Global Data Partition
Regional ComputationPartition
10 University of MichiganElectrical Engineering and Computer Science
Pass 1: Global Data PartitioningPass 1: Global Data Partitioning
• Determine memory relationships– Pointer analysis & profiling of memory
• Build program-level graph representation of all operations• Perform data object memory operation merging:
– Respect correctness constraints of the program
InterproceduralPointer Analysis
&Memory Profile
Step 1
Merge MemoryOperations
Step 3
METISGraph
Partitioner
Step 4
Build ProgramData Graph
Step 2
11 University of MichiganElectrical Engineering and Computer Science
• Nodes: Operations, either memory or non-memory– Memory operations: loads, stores, malloc callsites
• Edges: Data flow between operations• Node weight: Data object size
– Sum of data sizes forreferenced objects
• Object size determined by:– Globals/locals: pointer analysis– Malloc callsites: memory profile
Global Data Graph RepresentationGlobal Data Graph Representation
int x[100] struct foomalloc site 1
400 bytes
1 Kbyte
200bytes
12 University of MichiganElectrical Engineering and Computer Science
Global Data Partitioning ExampleGlobal Data Partitioning ExampleBB1
BB2
2 Objects referenced80 Kb2 Objects referenced80 Kb
1 Object referenced100 Kb1 Object referenced100 Kb
2 Objects referenced200 Kb2 Objects referenced200 Kb
Non-memory op
Memory op
Cluster 0
Cluster 1
malloc site 1
malloc site 2
int x[100]
struct foo
struct bar
13 University of MichiganElectrical Engineering and Computer Science
Pass 2: Computation PartitioningPass 2: Computation Partitioning
BB1
• Observation:Global-level data partition is only half the answer:– Doesn’t account for operation resource usage– Doesn’t consider code scheduling regions
• Second pass of partitioning on each scheduling region– Memory operations from first phase locked in place
BB1 BB1
14 University of MichiganElectrical Engineering and Computer Science
Experimental MethodologyExperimental Methodology
• Compared to:
• 2 Clusters:– One Integer, Float, Memory, Branch unit per cluster
• All results relative to a unified, dual-ported memory
Data Partitioning Computation Partition
Global Global-view
Data-centric
Know data location
Greedy Region-view
Greedy computation-centric
Know data location
Data Unaware None, assume unified memory Assume unified memory
Unified Memory N/A Unified memory
15 University of MichiganElectrical Engineering and Computer Science
Performance: 1-cycle Remote AccessPerformance: 1-cycle Remote Access
UnifiedMemory
16 University of MichiganElectrical Engineering and Computer Science
Performance: 10-cycle Remote AccessPerformance: 10-cycle Remote Access
UnifiedMemory
17 University of MichiganElectrical Engineering and Computer Science
Case Study: rawcaudioCase Study: rawcaudio
Global Data PartitionGlobal Data Partition
X
X
Greedy Profile-basedGreedy Profile-based
18 University of MichiganElectrical Engineering and Computer Science
SummarySummary
• Global Data Partitioning– Data placement: first-order design principle– Global data-centric partition of computation– Phased ordered approach
• Global-view for decisions on data • Region-view for decisions on computation
• Achieves 96% performance of a unified memory on partitioned memories
• Future work: apply to cache memories
19 University of MichiganElectrical Engineering and Computer Science
Data Partitioning for MulticoresData Partitioning for Multicores
• Adapt global data partitioning for cache memory domain• Similar goals:
– Increase data bandwidth– Maximize parallel computation
• Different goals:– Reducing coherence traffic– Keep working set ≤ cache size
20 University of MichiganElectrical Engineering and Computer Science
Questions?Questions?
http://cccp.eecs.umich.edu
21 University of MichiganElectrical Engineering and Computer Science
BackupBackup
22 University of MichiganElectrical Engineering and Computer Science
Future Work: Cache MemoriesFuture Work: Cache Memories
• Adapt global data partitioning for cache memory domain
• Similar goals:– Increase data bandwidth– Maximize parallel
computation• Different goals:
– Reducing coherence traffic– Balancing working set
23 University of MichiganElectrical Engineering and Computer Science
Memory Operation MergingMemory Operation Merging
int * x;int foo [100];int bar [100];
void main() { int *a = malloc() int *b; int c; if(cond) c = foo[1]; b = a; else c = bar[1]; b = &bar[1];
b = 100; foo[0] = c;}
mallocmalloc
load “foo”load “foo”
store “malloc” or “bar”
store “malloc” or “bar”
load “bar”load “bar”
store “foo”store “foo”
• Interprocedural pointer analysis determines memory relationships
24 University of MichiganElectrical Engineering and Computer Science
Multicluster CompilationMulticluster Compilation
• Previous techniques focused on operation partitioning [cite some papers]
• Ignores the issue of data object placement in memory
• Assumes shared memory accessible from each cluster
25 University of MichiganElectrical Engineering and Computer Science
Phase 2: Computation PartitioningPhase 2: Computation Partitioning• Observation:
Global-level data partition is only half the solution:– Doesn’t properly account for resource usage details– Doesn’t consider code scheduling regions
• Second pass of partitioning is done locally on each basic block of the program– Memory operations locked into specific clusters
• Uses Region-based Hierarchical Operation Partitioner (RHOP)
26 University of MichiganElectrical Engineering and Computer Science
Computation Partitioning ExampleComputation Partitioning Example
BB1
L L
S
+
*
+ +
&
&
+
BB1
L L
S
+
*
+ +
&
&
+
BB1
L L
S
+
*
+ +
&
&
+
• Memory operations from first phase locked in place• RHOP performs a detailed resource-cognizant
computation partition• Modified multi-level Kernighan-Lin algorithm using
schedule estimates