compiler-directed data partitioning for multicluster processors

1 University of MichiganElectrical Engineering and Computer Science

Compiler-directed Data Partitioning Compiler-directed Data Partitioning for Multicluster Processorsfor Multicluster Processors

Michael Chu and Scott Mahlke

Advanced Computer Architecture Lab

University of Michigan

March 28, 2006


Processor

I IF FM M

Data Memory

Multicluster ArchitecturesMulticluster Architectures

• Addresses the register file bottleneck

• Decentralizes architecture• Compilation focuses on

partitioning operations• Most previous work

assumes a unified memory Data MemoryData Mem 1 Data Mem 2

Register File

Intercluster Communication Network

Register File

Cluster 1

Register File

Cluster 2

IF FM MI


Data Mem 1 Data Mem 2

Cluster 1 Cluster 2

I F M I F M

Problem: Partitioning of DataProblem: Partitioning of Data

• Determine object placement into data memories• Limited by:

– Memory sizes/capacities– Computation operations related to data

• Partitioning relevant to caches and scratchpad memories

int x[100] struct foo

int y[100]


Cluster 1 Cluster 2

I F M I F M

int x[100] foo int y[100]

Architectural ModelArchitectural Model

• This work focuses on use of scratchpad-like static local memories– Each cluster has one local memory– Each object placed in one specific memory– Data object available in the memory throughout the

lifetime of the program


Data Unaware PartitioningData Unaware Partitioning

Lose average 30% performance by ignoring data


Our ObjectiveOur Objective

• Goal: Produce efficient code• Strategy:

– Partition both data objects and computation operations

– Balance memory size across clusters

• Improve memory bandwidth• Maximize parallelism

int x[100] struct foo

int y [100]


First Try: Greedy ApproachFirst Try: Greedy Approach

• Computation-centric partition of data– Place data where computation references it most often

• Greedy approach:– Pass 1: Region-view computation partition

Greedy data cluster assignment– Pass 2: Region-view computation repartition

Full knowledge of data location

Data Partition Computation Partition

Data Unaware None, Profile-based placement

Region-view

Greedy Region-view

Greedy Profile-based

Region-view


Greedy Approach ResultsGreedy Approach Results• 2 Clusters:

– One Integer, Float, Memory, Branch unit per cluster

• Relative to a unified, dual-ported memory

• Improvement over Data Unaware, still room for improvement


Second Try: Global Data PartitionSecond Try: Global Data Partition• Data-centric partition of computation• Hierarchical technique• Pass 1: Global-view for data

– Consider memory relationships throughout program

– Locks memory operations to clusters• Pass 2: Region-view for computation

– Partition computation based on data location

Global Data Partition

Regional ComputationPartition


Pass 1: Global Data PartitioningPass 1: Global Data Partitioning

• Determine memory relationships– Pointer analysis & profiling of memory

• Build program-level graph representation of all operations• Perform data object memory operation merging:

– Respect correctness constraints of the program

InterproceduralPointer Analysis

&Memory Profile

Step 1

Merge MemoryOperations

Step 3

METISGraph

Partitioner

Step 4

Build ProgramData Graph

Step 2


• Nodes: Operations, either memory or non-memory– Memory operations: loads, stores, malloc callsites

• Edges: Data flow between operations• Node weight: Data object size

– Sum of data sizes forreferenced objects

• Object size determined by:– Globals/locals: pointer analysis– Malloc callsites: memory profile

Global Data Graph RepresentationGlobal Data Graph Representation

int x[100] struct foomalloc site 1

400 bytes

1 Kbyte

200bytes


Global Data Partitioning ExampleGlobal Data Partitioning ExampleBB1

BB2

2 Objects referenced80 Kb2 Objects referenced80 Kb

1 Object referenced100 Kb1 Object referenced100 Kb

2 Objects referenced200 Kb2 Objects referenced200 Kb

Non-memory op

Memory op

Cluster 0

Cluster 1

malloc site 1

malloc site 2

int x[100]

struct foo

struct bar


Pass 2: Computation PartitioningPass 2: Computation Partitioning

BB1

• Observation:Global-level data partition is only half the answer:– Doesn’t account for operation resource usage– Doesn’t consider code scheduling regions

• Second pass of partitioning on each scheduling region– Memory operations from first phase locked in place

BB1 BB1


Experimental MethodologyExperimental Methodology

• Compared to:

• 2 Clusters:– One Integer, Float, Memory, Branch unit per cluster

• All results relative to a unified, dual-ported memory

Data Partitioning Computation Partition

Global Global-view

Data-centric

Know data location

Greedy Region-view

Greedy computation-centric

Know data location

Data Unaware None, assume unified memory Assume unified memory

Unified Memory N/A Unified memory


Performance: 1-cycle Remote AccessPerformance: 1-cycle Remote Access

UnifiedMemory


Performance: 10-cycle Remote AccessPerformance: 10-cycle Remote Access

UnifiedMemory


Case Study: rawcaudioCase Study: rawcaudio

Global Data PartitionGlobal Data Partition

X

X

Greedy Profile-basedGreedy Profile-based


SummarySummary

• Global Data Partitioning– Data placement: first-order design principle– Global data-centric partition of computation– Phased ordered approach

• Global-view for decisions on data • Region-view for decisions on computation

• Achieves 96% performance of a unified memory on partitioned memories

• Future work: apply to cache memories


Data Partitioning for MulticoresData Partitioning for Multicores

• Adapt global data partitioning for cache memory domain• Similar goals:

– Increase data bandwidth– Maximize parallel computation

• Different goals:– Reducing coherence traffic– Keep working set ≤ cache size


Questions?Questions?

http://cccp.eecs.umich.edu


BackupBackup


Future Work: Cache MemoriesFuture Work: Cache Memories

• Adapt global data partitioning for cache memory domain

• Similar goals:– Increase data bandwidth– Maximize parallel

computation• Different goals:

– Reducing coherence traffic– Balancing working set


Memory Operation MergingMemory Operation Merging

int * x;int foo [100];int bar [100];

void main() { int *a = malloc() int *b; int c; if(cond) c = foo[1]; b = a; else c = bar[1]; b = &bar[1];

b = 100; foo[0] = c;}

mallocmalloc

load “foo”load “foo”

store “malloc” or “bar”

store “malloc” or “bar”

load “bar”load “bar”

store “foo”store “foo”

• Interprocedural pointer analysis determines memory relationships


Multicluster CompilationMulticluster Compilation

• Previous techniques focused on operation partitioning [cite some papers]

• Ignores the issue of data object placement in memory

• Assumes shared memory accessible from each cluster


Phase 2: Computation PartitioningPhase 2: Computation Partitioning• Observation:

Global-level data partition is only half the solution:– Doesn’t properly account for resource usage details– Doesn’t consider code scheduling regions

• Second pass of partitioning is done locally on each basic block of the program– Memory operations locked into specific clusters

• Uses Region-based Hierarchical Operation Partitioner (RHOP)


Computation Partitioning ExampleComputation Partitioning Example

BB1

L L

S

+

*

+ +

&

&

+

BB1

L L

S

+

*

+ +

&

&

+

BB1

L L

S

+

*

+ +

&

&

+

• Memory operations from first phase locked in place• RHOP performs a detailed resource-cognizant

computation partition• Modified multi-level Kernighan-Lin algorithm using

schedule estimates

compiler-directed data partitioning for multicluster processors

Documents