university of michigan electrical engineering and computer science 1 data access partitioning for...

57
1 University of Michigan Electrical Engineering and Computer Science Data Access Partitioning Data Access Partitioning for for Fine-grain Parallelism Fine-grain Parallelism on on Multicore Architectures Multicore Architectures Michael Chu (Microsoft), Rajiv Ravindran (HP), Scott Mahlke University of Michigan March 21, 2022

Post on 22-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

1 University of MichiganElectrical Engineering and Computer Science

Data Access Partitioning for Data Access Partitioning for Fine-grain Parallelism on Fine-grain Parallelism on Multicore ArchitecturesMulticore Architectures

Michael Chu (Microsoft), Rajiv Ravindran (HP), Scott Mahlke

University of Michigan

April 19, 2023

2 University of MichiganElectrical Engineering and Computer Science

Multicore is HereMulticore is Here

• Power efficient design

• Decrease core complexity, Increase number of cores

• Intel Core 2 Duo (2006)AMD Athlon 64 X2 (2005)Sun Niagara (2005)

Image source: Intel

3 University of MichiganElectrical Engineering and Computer Science

LD

>>

ST

LD

+

/

>>

&

<<

ST

+

LD

LD

>>

ST

LD

+

/

>>

&

<<

ST

+

LD

Compiling in the Multicore EraCompiling in the Multicore Era• Coarse-grain vs fine-grain parallelism

LD

>>

ST

LD

+

/

>>

&

<<

ST

+

LD

Core Core

Core Core

Core

Core

Core Core Core

Core

Core

Core

Core Core Core Core

Core

Core

Core

Core

Core Core Core Core Core

Core

Core

Core

Core

Core

Core Core Core Core Core Core

Core 1 Core 2

RF RF

[Software Queues, PMUP 06],[Scalar Operand Network, HPCA 05]

4 University of MichiganElectrical Engineering and Computer Science

Objectives of this WorkObjectives of this Work

• Goal: detect and exploit available fine-grain parallelism• Compiler support is key for good performance

– Divide computation operations and data across cores– Maximize direct access to values you need

Cache 1 Cache 2

Core 1 Core 2

I F M I F M

LD

>>

&

LD

+

int x[100] int y[100]

5 University of MichiganElectrical Engineering and Computer Science

Data/Computation PartitioningData/Computation Partitioning

Profile-drivenData

PartitioningProgram

Binding decisions for memory operations

Partition assignmentfor all OpsRegion-level

ComputationPartitioningRegion Selection

• Separated data partitioning from computation partitioning– First partition data, use to guide partition of computation

6 University of MichiganElectrical Engineering and Computer Science

Data Partitioning for CachesData Partitioning for Caches

• Goals:– Maximize parallel computation– Reduce stall cycles

• Coherence traffic• Conflicts/misses

• Static analysis1

– Object granularity• Profile-driven partitioning

– Memory instruction granularity

Cache 1 Cache 2

Core 1 Core 2

I F M I F M

Coherence Network

1 Compiler-directed Object Partitioning for Multicluster Processors. [CGO 06]

7 University of MichiganElectrical Engineering and Computer Science

Memory Access GraphMemory Access Graph• Nodes: Memory operations in the

program– Node weight: working set size

• Edges: Relationship between memory operations– Edge weight: affinity

BB1

BB2

8 University of MichiganElectrical Engineering and Computer Science

...Load 2 Address 5Load 6 Address 3Load 1 Address 1Store 2 Address 5Load 3 Address 3Load 2 Address 4Load 1 Address 2Load 1 Address 1Store 3 Address 7Load 2 Address 8

...

Node Weight: Working Set EstimateNode Weight: Working Set Estimate

• Weighted average blocks required

Requires 5 cache blocks

2 2

11

10

3

1

9 University of MichiganElectrical Engineering and Computer Science

Edge Weight: Memory Op AffinityEdge Weight: Memory Op Affinity• Finds relationships between memory

accesses• Positive affinity: possible hit • Negative affinity: possible miss

C1 C2 C1 C2 C2 C2 C2 C1 C2 C2 C1 C1 C2 C2

Current Memory Access

... ...Memory Op

Block Address

Cache Line

LD1 LD2 LD4 ST1 LD3 LD2 ST1 LD2 LD1 LD2 LD4 LD2 LD4 LD4

B1 B2 B3 B4 B4 B2 B4 B1 B2 B4 B3 B3 B4 B1

Sliding Window

2 2

11

10 3

1

10 University of MichiganElectrical Engineering and Computer Science

Partitioning of Memory Access GraphPartitioning of Memory Access Graph

• Prevent overloading of node weights– Reduces cache conflicts

• Minimize positive affinity cuts– Localizes accesses to caches

• Cut highly negative affinity edges

2 2

11

10

3

1

11 University of MichiganElectrical Engineering and Computer Science

Weight Calculation

GraphPartitioning

1

1

10

10

10

10

1

8

8

8

8 8

8

1 1

1 1 1

1 1

1 1

1

int main { int x; printf(…); . . .}

Program Region

Computation PartitioningComputation Partitioning

• Region-based Hierarchical Operation Partitioner [PLDI 03]– Modified Multilevel-FM graph partitioning algorithm

• Any standard operatition partitioner could be used– Must transfer data decisions down

12 University of MichiganElectrical Engineering and Computer Science

Dealing with Prebound DataDealing with Prebound Data

BB1

• Data partition locks memory operations in place

• Don’t move memory operations– Conveys global data partition down to RHOP

BB1 BB1

13 University of MichiganElectrical Engineering and Computer Science

Profile-driven PartitioningProfile-driven PartitioningBB1

BB2

Non-memory op

Memory op

Core 2

Core 1

14 University of MichiganElectrical Engineering and Computer Science

Experimental MethodologyExperimental Methodology

• Trimaran Compiler Toolset• Profiled/ran with different input sets• Machine model:

– 2, 4 cores– 512B, 1kB, 4kB, 8kB caches per core– 1, 2, 3 cycle operand network move latency per hop

15 University of MichiganElectrical Engineering and Computer Science

0

10

20

30

40

50

60

70

80

90

100

lyapunov

fsed sobelchannel

vitonelooplinescreen

cjpegdjpeg

g721encodeg721decodegsmencodegsmdecodempeg2encmpeg2decrawcaudiorawdaudio164.gzip175.vpr181.mcf

256.bzip2300.twolf171.swim172.mgrid177.mesa188.ammpAVERAGE

% Reduction in Snoops

Coherence Traffic ReductionCoherence Traffic ReductionKernels Mediabench AVGSPECint SPECfp

16 University of MichiganElectrical Engineering and Computer Science

-20

-10

0

10

20

30

40

50

60

70

80

90

100

lyapunov

fsedsobelchannel

vitonelooplinescreen

cjpegdjpeg

g721encodeg721decodegsmencodegsmdecodempeg2encmpeg2decrawcaudiorawdaudio164.gzip175.vpr181.mcf

256.bzip2300.twolf171.swim172.mgrid177.mesa188.ammpAverage

Average 2cycleAverage 3cycle

% Reduction in Stall Cycles

512 B 1 kB 4 kB 8 kB

Stall Cycle ReductionStall Cycle ReductionKernels Mediabench SPECint SPECfp Average

17 University of MichiganElectrical Engineering and Computer Science

0.8

1.0

1.2

1.4

1.6

1.8

2.0

lyapunov

fsed sobelchannel

vitonelooplinescreen

cjpegdjpeg

g721encodeg721decodegsmencodegsmdecodempeg2encmpeg2decrawcaudiorawdaudio164.gzip175.vpr181.mcf

256.bzip2300.twolf171.swim172.mgrid177.mesa188.ammpAVERAGE

Speedup

Data Incognizant Data Partitioned Unified

Speedup Over Single CoreSpeedup Over Single CoreKernels Mediabench SPECfp AVGSPECint

18 University of MichiganElectrical Engineering and Computer Science

ConclusionConclusion

• Fine-grain parallelism can be exploited– Instruction-level parallelism is not dead– Coarse-grain parallelism is still important!

• Data-cognizant partitioning can improve performance– Improves stall cycles by 51%– Reduces coherence traffic by 87%

19 University of MichiganElectrical Engineering and Computer Science

Thank YouThank You

http://cccp.eecs.umich.eduhttp://cccp.eecs.umich.edu

20 University of MichiganElectrical Engineering and Computer Science

BackupBackup

21 University of MichiganElectrical Engineering and Computer Science

2-Core Speedup vs 4-Core Speedup2-Core Speedup vs 4-Core Speedup

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

lyapunov

fsedsobelchannel

vitonelooplinescreen

cjpegdjpeg

g721encodeg721decodegsmencodegsmdecodempeg2encmpeg2decrawcaudiorawdaudio164.gzip175.vpr181.mcf

256.bzip2300.twolf171.swim172.mgrid177.mesa188.ammpAVERAGE

Speedup

2-PE 4-PE

Kernels Mediabench SPECfp AVGSPECint

22 University of MichiganElectrical Engineering and Computer Science

Speedup with Four CoresSpeedup with Four Cores

0.5

1.0

1.5

2.0

2.5

3.0

lyapunov

fsed sobelchannel

vitonelooplinescreen

cjpegdjpegepic

unepic

g721encodeg721decodegsmencodegsmdecodempeg2encmpeg2decpegwitencpegwitdecrawcaudiorawdaudio164.gzip181.mcf300.twolf

AVERAGE

Speedup

Random Data Partitioned Unified

Kernels Mediabench SPEC AVG

23 University of MichiganElectrical Engineering and Computer Science

Multicore is the PastMulticore is the Past• Embedded Processors:

– Low power constraints– High performance requirements

• Examples:– Multiflow Trace [1992]– TI C6x [1997]– Lx ST/200 [2000]– Philips TM1300 [1999]– MIT Raw [1997]

Processor

I IF FM M

Data Memory

Register File

Processor

I F M

Data Memory

Register File

Intercluster Communication Network

Register File 1

Cluster 1

Register File 2

Cluster 2

I IF FM M

Data Cache 1 Data Cache 2Data Memory

Coherence Network

24 University of MichiganElectrical Engineering and Computer Science

Basics of Computation PartitioningBasics of Computation Partitioning• Goal: Minimize schedule length• Strategy:

– Exploit parallel resources– Minimize critical intercluster communication

+

>>

*

&

+

+ Intercluster move

Intercluster Communication Network

Register File

Cluster 1

Register File

Cluster 2

I IF FM M

Cache 1 Cache 2

25 University of MichiganElectrical Engineering and Computer Science

Problem #1: Local vs Region ScopeProblem #1: Local vs Region Scope

1 2

0

1

2

3

4

5

6

7

Local scope clustering Region scope clustering

1

2

6

3

7

4

8

5

9

11

12

10

1 2

0

1

2

3

4

5

6

7

1

2

6

10

5

9

11

12

3

7

4

8

1

2

6

3

7

4

8

5

9

11

12

10

1

2

6

3

7

10

12

11

4

8

5

9

mov

e

mov

emov

e

cycl

e

cycl

e

Examples: [BUG, UAS, B-ITER] Examples: [CARS, Aletà ‘01]

26 University of MichiganElectrical Engineering and Computer Science

Problem #2: Scheduler-centricProblem #2: Scheduler-centric• Cluster assignment during scheduling adds complexity• Detailed resource model/reservation tables is slow• Forces local decisions• Examples: [BUG, UAS]

1

2

6

3

7

4

8

5

9

11

12

10

X XX X

Cluster 1

X XX X

X XX X

Reservation Tables

X XX X

Cluster 2

X XX X

X XX X

cycle

1

2

1

2

1

2

cycle

1

2

1

2

1

2

27 University of MichiganElectrical Engineering and Computer Science

Our ApproachOur Approach• Opposite approach to conventional clustering• Hierarchical region view

– Graph partitioning strategy– Identify tightly coupled operations - treat uniformly

• Non scheduler-centric mindset– Pre-scheduling technique– Estimates schedule length

• Advantages: – Efficient, hierarchical, doesn’t complicate scheduler

28 University of MichiganElectrical Engineering and Computer Science

Region-based Hierarchical Operation Region-based Hierarchical Operation Partitioning (RHOP)Partitioning (RHOP)

• Code is considered region at a time• Weight calculation creates guides for good partitions• Partitioning clusters based on given weights

Weight Calculation

GraphPartitioning

1

1

10

10

10

10

1

8

8

8

8 8

8

1 1

1 1 1

1 1

1 1

1

int main { int x; printf(…); . . .}

Program Region

Region-based Hierarchical Operation Partitioning for Mulitcluster Processors [PLDI 03]

29 University of MichiganElectrical Engineering and Computer Science

Node WeightNode Weight

opwgtc =1

#ops that can execute on c in1 cycle

++

++

&&

I F M B

Register File

Dedicated Resources

Accounts for FU’s

• Nodes: operations in the scheduling region

• Metric for resource usage– Resources used by a single

operation per cycle• Used to estimate

overcommitment of resources

30 University of MichiganElectrical Engineering and Computer Science

Edge WeightEdge Weight• Edges: data flow between operations• Slack distribution allocates slack to certain edges

– Edge slack = identifies preferred places to cut edges– First come, first serve method used

11 22

33 55 66

1010

77

1111

1414

13131212

44

88 99

0

0

0

0

0

1 1 1 1

2 11

1

H H

H

H

H L

L

0 0 0

01

0M M M M

ML

31 University of MichiganElectrical Engineering and Computer Science

RHOP - Partitioning PhaseRHOP - Partitioning Phase• Modified Multilevel-FM algorithm [Fiduccia ‘82]• Multilevel graph partitioning consists of two stages

1. Coarsening stage2. Refinement stage

Cluster 1

Cluster 2

32 University of MichiganElectrical Engineering and Computer Science

Cluster RefinementCluster Refinement

• 3 questions to answer:1. How good is the current partition? 2. Which cluster should operations move from?3. How profitable is it to move X from cluster A to B?

Cluster 1

Cluster 2

move to cluster 2?

33 University of MichiganElectrical Engineering and Computer Science

Resource Overcommitment EstimateResource Overcommitment Estimate

0

1

2

3

4

4 6

9

5

Cluster 1

Cluster 2

cycl

ecy

cle

8

12

14

1 2

3

0

1

2

3

4

7

11

13

10

0 1 2

2.5

2.0

0.5

0.0

0.0

0 1 2

Cluster_wgt1= 5.0

0.0

0.33

0.33

0.0

0.0

Cluster_wgt2= 0.67

4 6

9

5

1.0

0.0

0.0

0.0

0.0

0 1 2

Cluster_wgt1= 1.0

1.33

2.33

0.83

0.0

0.0

0 1 2

Cluster_wgt2= 4.5

cluster 2?

34 University of MichiganElectrical Engineering and Computer Science

RHOP Computation PartitioningRHOP Computation Partitioning• Improves over local

algorithms– Prescheduling technique– Schedule length estimator– Combines slack

distribution with multilevel partitioning

• Performs better as number of resources increases

-2.00%

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

16.00%

2-cluster2-ALU

2-cluster4-ALU

4-cluster4-ALU

4-cluster8-ALU

4-cluster3-ALU

Average Improvement over BUG

35 University of MichiganElectrical Engineering and Computer Science

Profile-driven Data PartitioningProfile-driven Data Partitioning• Static analysis: coarse-grain partition

– Object granularity

• Profile: fine-grain partition– Memory instruction granularity

• Global-view for data– Consider memory relationships

throughout program

Static Data Partition

Profile-guided Partition

struct foo int bar

36 University of MichiganElectrical Engineering and Computer Science

Communication Unaware PartitioningCommunication Unaware Partitioning• 8-wide centralized vs two cluster, 1-cycle communication

37 University of MichiganElectrical Engineering and Computer Science

Overall Performance ImprovementOverall Performance Improvement

-10

-5

0

5

10

15

20

25

30

lyapunov

fsed sobelchannel

vitonelooplinescreen

epicunepic

g721encodeg721decodegsmencodegsmdecodempeg2encmpeg2decpegwitencpegwitdecrawcaudiorawdaudioAverage

Average 2cycleAverage 3cycle

% Improvement in Performance

512 B 1 kB 4 kB 8 kB

Kernels Mediabench Average

38 University of MichiganElectrical Engineering and Computer Science

Next Steps: Auto ParallelizationNext Steps: Auto Parallelization

• Integrate fine and coarse-grain parallelism– Better take advantage of 100’s of cores

• Future strategies for general parallelization– Analyze memory/data access patterns between threads– Break, untangle dependencies– New parallel programming models

• Balance data partitioning decisions with computation

39 University of MichiganElectrical Engineering and Computer Science

Edge Weight: Memory Op AffinityEdge Weight: Memory Op Affinity

• Finds relationships between memory accesses• Positive affinity: possible hit • Negative affinity: possible miss

Memory OpAccessedBlock

LD1 LD2 LD4 ST1 LD3 LD2 ST1 LD2 LD1 LD2 LD4 LD2 LD4 LD4

B1 B2 B3 B4 B4 B2 B4 B1 B2 B4 B3 B3 B4 B1

C1 C2 C1 C2 C2 C2 C2 C1 C2 C2 C1 C1 C2 C2

Current Memory Access

Sliding Window

... ...Memory Op

Block Address

Cache Line

40 University of MichiganElectrical Engineering and Computer Science

RHOP Computation PartitioningRHOP Computation Partitioning• Improves over local algorithms

– Prescheduling technique– Estimates on schedule length used

instead of scheduler– Combines slack distribution with

multilevel partitioning• Performs better as number of

resources increases

Machine RHOP vs BUG

2-cluster8-issue

-1.8%

2-cluster10-issue

3.7%

4-cluster16-issue

14.3%

4-cluster18-issue

15.3%

4-cluster H13-issue

8.0%

Average Improvement

41 University of MichiganElectrical Engineering and Computer Science

2 Cluster Results vs 2 Cluster Results vs 11 Cluster Cluster

0.4

0.5

0.6

0.7

0.8

0.9

1

atmcellchannel

dct firfsed

halftone

heat

huffman

LU

lyapunov

rlssobel

164.gzip175.vpr181.mcf197.parser

253.perlbmk

254.gap255.vortex

256.bzip300.twolfAverage

Benchmarks

Relative Performance

BUG

RHOP

42 University of MichiganElectrical Engineering and Computer Science

4 Cluster Results vs 4 Cluster Results vs 11 Cluster Cluster

0.4

0.5

0.6

0.7

0.8

0.9

1

atmcellchannel

dct firfsed

halftone

heat

huffman

LU

lyapunov

rlssobel

164.gzip175.vpr181.mcf197.parser

253.perlbmk

254.gap255.vortex

256.bzip300.twolfAverage

Benchmarks

Relative Performance

BUG

RHOP

43 University of MichiganElectrical Engineering and Computer Science

Experimental EvaluationExperimental Evaluation• Trimaran toolset: a retargetable VLIW compiler• Evaluated DSP kernels and SPECint2000• Baseline: centralized processor, sum of clustered resources

Name Configuration

2-cluster8-issue

2 Homogenous clusters

1 I, 1 F, 1 M, 1 B per cluster

2-cluster10-issue

2 Homogenous clusters

2 I, 1 F, 1 M, 1 B per cluster

4-cluster16-issue

4 Homogenous clusters

1 I, 1 F, 1 M, 1 B per cluster

4-cluster18-issue

4 Homogenous clusters

2 I, 1 F, 1 M, 1 B per cluster

4-cluster H 13-issue

4 Heterogeneous clusters

IM, IF, IB and IMF clusters

• 64 registers per cluster

• Latencies similar to Itanium

• Perfect caches

44 University of MichiganElectrical Engineering and Computer Science

Architectural ModelArchitectural Model

• Solve the easier problem: computation partitioning– Assume centralized, shared data memory– Accessible from each cluster with uniform latency

Centralized Data Memory

Cluster 1 Cluster 2

I F M I F M

45 University of MichiganElectrical Engineering and Computer Science

Cluster 1 Cluster 2

I F M I F M

Architectural ModelArchitectural Model

• Use scratchpad memories– Compiler-controlled memory– Each cluster has one memory– Each object placed in one specific memory– Data object available in the memory throughout the

lifetime of the program

int x[100] foo int y[100]

46 University of MichiganElectrical Engineering and Computer Science

Data Mem 1 Data Mem 2

Cluster 1 Cluster 2

I F M I F M

Problem: Partitioning of DataProblem: Partitioning of Data

• Determine object placement into data memories• Limited by:

– Memory sizes/capacities– Computation operations related to data

• Partitioning relevant to caches and scratchpad memories

int x[100] struct foo

int y[100]

47 University of MichiganElectrical Engineering and Computer Science

Data Unaware PartitioningData Unaware Partitioning

Lose average 30% performance by ignoring data

48 University of MichiganElectrical Engineering and Computer Science

Our ObjectiveOur Objective• Goal: Produce efficient code• Strategy:

– Partition data objects and computation operations

• Consider interplay between data/computation decisions

• Minimize intercluster transfers

– Evenly distribute data objects• Improve memory bandwidth• Maximize parallelism

int x[100] struct foo

int y [100]

49 University of MichiganElectrical Engineering and Computer Science

Pass 1: Global Data PartitioningPass 1: Global Data Partitioning

• Determine memory relationships– Pointer analysis & profiling of memory

• Build program-level graph representation of all operations• Perform data object memory operation merging:

– Respect correctness constraints of the program

InterproceduralPointer Analysis

&Memory Profile

Step 1

Merge MemoryOperations

Step 3

METISGraph

Partitioner

Step 4

Build ProgramData Graph

Step 2

50 University of MichiganElectrical Engineering and Computer Science

• Nodes: Operations, either memory or non-memory– Memory operations: loads, stores, malloc callsites

• Edges: Data flow between operations• Node weight: Data object size

– Sum of data sizes forreferenced objects

• Object size determined by:– Globals/locals: pointer analysis– Malloc callsites: memory profile

Global Data Graph RepresentationGlobal Data Graph Representation

int x[100] struct foomalloc site 1

400 bytes

1 Kbyte

200bytes

51 University of MichiganElectrical Engineering and Computer Science

Experimental MethodologyExperimental Methodology

• Compared to:

• 2 Clusters:– One Integer, Float, Memory, Branch unit per cluster

• All results relative to a unified, dual-ported memory

Data Partitioning Computation Partition

Global Global-view

Data-centric

Know data location

Greedy Region-view

Greedy computation-centric

Know data location

Data Unaware None, assume unified memory Assume unified memory

Unified Memory N/A Unified memory

52 University of MichiganElectrical Engineering and Computer Science

Performance: 1-cycle Remote AccessPerformance: 1-cycle Remote Access

UnifiedMemory

53 University of MichiganElectrical Engineering and Computer Science

Performance: 10-cycle Remote AccessPerformance: 10-cycle Remote Access

UnifiedMemory

54 University of MichiganElectrical Engineering and Computer Science

Global Data PartitioningGlobal Data Partitioning

• Global Data Partitioning– Data placement: first-order design principle– Global data-centric partition of computation– Phased ordered approach

• Global-view for decisions on data • Region-view for decisions on computation

• Achieves 95.2% performance of a unified memory on partitioned memories

55 University of MichiganElectrical Engineering and Computer Science

GDP for Cache MemoriesGDP for Cache Memories

• Used scratchpad GDP with caches

• Improvement from coherence traffic reduction

• rls and fsed suffer from no replication of data

56 University of MichiganElectrical Engineering and Computer Science

Decentralized ArchitecturesDecentralized Architectures• Multicluster architecture

– Smaller, faster register files– Transfers data through

interconnect network– Cycle time, area, power benefits

• Examples: – TI C6x – Analog TigerSharc– Alpha 21264– MIT Raw

Intercluster Communication Network

Register File 1

Cluster 1

Register File 2

Cluster 2

I IF FM M

Data Mem 1 Data Mem 2

57 University of MichiganElectrical Engineering and Computer Science

Clustered ArchitecturesClustered Architectures

Processor

I IF FM M

Data Memory

Register File

Processor

I F M

Data Memory

Register File Register File 1

Cluster 1

Register File 2

Cluster 2

I IF FM M

Data Memory

Schedule Schedule Schedule

Clock Rate Clock RateClock Rate