compiler-directed data partitioning for multicluster processors

26
1 University of Michigan Electrical Engineering and Computer Science Compiler-directed Data Compiler-directed Data Partitioning Partitioning for Multicluster for Multicluster Processors Processors Michael Chu and Scott Mahlke Advanced Computer Architecture Lab University of Michigan March 28, 2006

Upload: ira-stuart

Post on 31-Dec-2015

37 views

Category:

Documents


0 download

DESCRIPTION

Compiler-directed Data Partitioning for Multicluster Processors. Michael Chu and Scott Mahlke Advanced Computer Architecture Lab University of Michigan March 28, 2006. Intercluster Communication Network. Register File. Register File. M. M. I. F. I. F. I. F. M. I. F. M. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Compiler-directed Data Partitioning  for Multicluster Processors

1 University of MichiganElectrical Engineering and Computer Science

Compiler-directed Data Partitioning Compiler-directed Data Partitioning for Multicluster Processorsfor Multicluster Processors

Michael Chu and Scott Mahlke

Advanced Computer Architecture Lab

University of Michigan

March 28, 2006

Page 2: Compiler-directed Data Partitioning  for Multicluster Processors

2 University of MichiganElectrical Engineering and Computer Science

Processor

I IF FM M

Data Memory

Multicluster ArchitecturesMulticluster Architectures

• Addresses the register file bottleneck

• Decentralizes architecture• Compilation focuses on

partitioning operations• Most previous work

assumes a unified memory Data MemoryData Mem 1 Data Mem 2

Register File

Intercluster Communication Network

Register File

Cluster 1

Register File

Cluster 2

IF FM MI

Page 3: Compiler-directed Data Partitioning  for Multicluster Processors

3 University of MichiganElectrical Engineering and Computer Science

Data Mem 1 Data Mem 2

Cluster 1 Cluster 2

I F M I F M

Problem: Partitioning of DataProblem: Partitioning of Data

• Determine object placement into data memories• Limited by:

– Memory sizes/capacities– Computation operations related to data

• Partitioning relevant to caches and scratchpad memories

int x[100] struct foo

int y[100]

Page 4: Compiler-directed Data Partitioning  for Multicluster Processors

4 University of MichiganElectrical Engineering and Computer Science

Cluster 1 Cluster 2

I F M I F M

int x[100] foo int y[100]

Architectural ModelArchitectural Model

• This work focuses on use of scratchpad-like static local memories– Each cluster has one local memory– Each object placed in one specific memory– Data object available in the memory throughout the

lifetime of the program

Page 5: Compiler-directed Data Partitioning  for Multicluster Processors

5 University of MichiganElectrical Engineering and Computer Science

Data Unaware PartitioningData Unaware Partitioning

Lose average 30% performance by ignoring data

Page 6: Compiler-directed Data Partitioning  for Multicluster Processors

6 University of MichiganElectrical Engineering and Computer Science

Our ObjectiveOur Objective

• Goal: Produce efficient code• Strategy:

– Partition both data objects and computation operations

– Balance memory size across clusters

• Improve memory bandwidth• Maximize parallelism

int x[100] struct foo

int y [100]

Page 7: Compiler-directed Data Partitioning  for Multicluster Processors

7 University of MichiganElectrical Engineering and Computer Science

First Try: Greedy ApproachFirst Try: Greedy Approach

• Computation-centric partition of data– Place data where computation references it most often

• Greedy approach:– Pass 1: Region-view computation partition

Greedy data cluster assignment– Pass 2: Region-view computation repartition

Full knowledge of data location

Data Partition Computation Partition

Data Unaware None, Profile-based placement

Region-view

Greedy Region-view

Greedy Profile-based

Region-view

Page 8: Compiler-directed Data Partitioning  for Multicluster Processors

8 University of MichiganElectrical Engineering and Computer Science

Greedy Approach ResultsGreedy Approach Results• 2 Clusters:

– One Integer, Float, Memory, Branch unit per cluster

• Relative to a unified, dual-ported memory

• Improvement over Data Unaware, still room for improvement

Page 9: Compiler-directed Data Partitioning  for Multicluster Processors

9 University of MichiganElectrical Engineering and Computer Science

Second Try: Global Data PartitionSecond Try: Global Data Partition• Data-centric partition of computation• Hierarchical technique• Pass 1: Global-view for data

– Consider memory relationships throughout program

– Locks memory operations to clusters• Pass 2: Region-view for computation

– Partition computation based on data location

Global Data Partition

Regional ComputationPartition

Page 10: Compiler-directed Data Partitioning  for Multicluster Processors

10 University of MichiganElectrical Engineering and Computer Science

Pass 1: Global Data PartitioningPass 1: Global Data Partitioning

• Determine memory relationships– Pointer analysis & profiling of memory

• Build program-level graph representation of all operations• Perform data object memory operation merging:

– Respect correctness constraints of the program

InterproceduralPointer Analysis

&Memory Profile

Step 1

Merge MemoryOperations

Step 3

METISGraph

Partitioner

Step 4

Build ProgramData Graph

Step 2

Page 11: Compiler-directed Data Partitioning  for Multicluster Processors

11 University of MichiganElectrical Engineering and Computer Science

• Nodes: Operations, either memory or non-memory– Memory operations: loads, stores, malloc callsites

• Edges: Data flow between operations• Node weight: Data object size

– Sum of data sizes forreferenced objects

• Object size determined by:– Globals/locals: pointer analysis– Malloc callsites: memory profile

Global Data Graph RepresentationGlobal Data Graph Representation

int x[100] struct foomalloc site 1

400 bytes

1 Kbyte

200bytes

Page 12: Compiler-directed Data Partitioning  for Multicluster Processors

12 University of MichiganElectrical Engineering and Computer Science

Global Data Partitioning ExampleGlobal Data Partitioning ExampleBB1

BB2

2 Objects referenced80 Kb2 Objects referenced80 Kb

1 Object referenced100 Kb1 Object referenced100 Kb

2 Objects referenced200 Kb2 Objects referenced200 Kb

Non-memory op

Memory op

Cluster 0

Cluster 1

malloc site 1

malloc site 2

int x[100]

struct foo

struct bar

Page 13: Compiler-directed Data Partitioning  for Multicluster Processors

13 University of MichiganElectrical Engineering and Computer Science

Pass 2: Computation PartitioningPass 2: Computation Partitioning

BB1

• Observation:Global-level data partition is only half the answer:– Doesn’t account for operation resource usage– Doesn’t consider code scheduling regions

• Second pass of partitioning on each scheduling region– Memory operations from first phase locked in place

BB1 BB1

Page 14: Compiler-directed Data Partitioning  for Multicluster Processors

14 University of MichiganElectrical Engineering and Computer Science

Experimental MethodologyExperimental Methodology

• Compared to:

• 2 Clusters:– One Integer, Float, Memory, Branch unit per cluster

• All results relative to a unified, dual-ported memory

Data Partitioning Computation Partition

Global Global-view

Data-centric

Know data location

Greedy Region-view

Greedy computation-centric

Know data location

Data Unaware None, assume unified memory Assume unified memory

Unified Memory N/A Unified memory

Page 15: Compiler-directed Data Partitioning  for Multicluster Processors

15 University of MichiganElectrical Engineering and Computer Science

Performance: 1-cycle Remote AccessPerformance: 1-cycle Remote Access

UnifiedMemory

Page 16: Compiler-directed Data Partitioning  for Multicluster Processors

16 University of MichiganElectrical Engineering and Computer Science

Performance: 10-cycle Remote AccessPerformance: 10-cycle Remote Access

UnifiedMemory

Page 17: Compiler-directed Data Partitioning  for Multicluster Processors

17 University of MichiganElectrical Engineering and Computer Science

Case Study: rawcaudioCase Study: rawcaudio

Global Data PartitionGlobal Data Partition

X

X

Greedy Profile-basedGreedy Profile-based

Page 18: Compiler-directed Data Partitioning  for Multicluster Processors

18 University of MichiganElectrical Engineering and Computer Science

SummarySummary

• Global Data Partitioning– Data placement: first-order design principle– Global data-centric partition of computation– Phased ordered approach

• Global-view for decisions on data • Region-view for decisions on computation

• Achieves 96% performance of a unified memory on partitioned memories

• Future work: apply to cache memories

Page 19: Compiler-directed Data Partitioning  for Multicluster Processors

19 University of MichiganElectrical Engineering and Computer Science

Data Partitioning for MulticoresData Partitioning for Multicores

• Adapt global data partitioning for cache memory domain• Similar goals:

– Increase data bandwidth– Maximize parallel computation

• Different goals:– Reducing coherence traffic– Keep working set ≤ cache size

Page 20: Compiler-directed Data Partitioning  for Multicluster Processors

20 University of MichiganElectrical Engineering and Computer Science

Questions?Questions?

http://cccp.eecs.umich.edu

Page 21: Compiler-directed Data Partitioning  for Multicluster Processors

21 University of MichiganElectrical Engineering and Computer Science

BackupBackup

Page 22: Compiler-directed Data Partitioning  for Multicluster Processors

22 University of MichiganElectrical Engineering and Computer Science

Future Work: Cache MemoriesFuture Work: Cache Memories

• Adapt global data partitioning for cache memory domain

• Similar goals:– Increase data bandwidth– Maximize parallel

computation• Different goals:

– Reducing coherence traffic– Balancing working set

Page 23: Compiler-directed Data Partitioning  for Multicluster Processors

23 University of MichiganElectrical Engineering and Computer Science

Memory Operation MergingMemory Operation Merging

int * x;int foo [100];int bar [100];

void main() { int *a = malloc() int *b; int c; if(cond) c = foo[1]; b = a; else c = bar[1]; b = &bar[1];

b = 100; foo[0] = c;}

mallocmalloc

load “foo”load “foo”

store “malloc” or “bar”

store “malloc” or “bar”

load “bar”load “bar”

store “foo”store “foo”

• Interprocedural pointer analysis determines memory relationships

Page 24: Compiler-directed Data Partitioning  for Multicluster Processors

24 University of MichiganElectrical Engineering and Computer Science

Multicluster CompilationMulticluster Compilation

• Previous techniques focused on operation partitioning [cite some papers]

• Ignores the issue of data object placement in memory

• Assumes shared memory accessible from each cluster

Page 25: Compiler-directed Data Partitioning  for Multicluster Processors

25 University of MichiganElectrical Engineering and Computer Science

Phase 2: Computation PartitioningPhase 2: Computation Partitioning• Observation:

Global-level data partition is only half the solution:– Doesn’t properly account for resource usage details– Doesn’t consider code scheduling regions

• Second pass of partitioning is done locally on each basic block of the program– Memory operations locked into specific clusters

• Uses Region-based Hierarchical Operation Partitioner (RHOP)

Page 26: Compiler-directed Data Partitioning  for Multicluster Processors

26 University of MichiganElectrical Engineering and Computer Science

Computation Partitioning ExampleComputation Partitioning Example

BB1

L L

S

+

*

+ +

&

&

+

BB1

L L

S

+

*

+ +

&

&

+

BB1

L L

S

+

*

+ +

&

&

+

• Memory operations from first phase locked in place• RHOP performs a detailed resource-cognizant

computation partition• Modified multi-level Kernighan-Lin algorithm using

schedule estimates