eecs 583 – class 17 research topic 1 decoupled software pipelining

EECS 583 – Class 17Research Topic 1Decoupled Software Pipelining

University of Michigan

November 12, 2012

- 2 -

Announcements + Reading Material 1st paper review due today

» Should have submitted to andrew.eecs.umich.edu:/y/submit

Next Monday – Midterm exam in class» See room assignments

Today’s class reading» “Automatic Thread Extraction with Decoupled Software

Pipelining,” G. Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, Nov. 2005.

Next class» “Spice: Speculative Parallel Iteration Chunk Execution,” E.

Raman, N. Vachharajani, R. Rangan, and D. I. August, Proc 2008 Intl. Symposium on Code Generation and Optimization, April 2008.

- 3 -

Midterm Exam When: Monday, Nov 19, 2012, 10:40-12:30 Where

» 1003 EECS Uniquenames (not first or last names!) starting with A-E

» 1005 EECS Uniquenames starting with F-N

» 3150 Dow (our classroom) Uniquenames starting with P-Z

What to expect» Open book/notes, no laptops

» Apply techniques we discussed in class on examples

» Reason about solving compiler problems – why things are done

» A couple of thinking problems

» No LLVM code

» Reasonably long but you should finish

- 4 -

Midterm Exam Last 2 years exams are posted on the course website

» Note – Past exams may not accurately predict future exams!!

Office hours between now and Monday if you have questions» James: Tue, Thu, Fri: 1-3pm

» Scott: Tue 4:30-5:30pm, Fri 5-6pm (extra slot)

Studying» Yes, you should study even though its open notes

Lots of material that you have likely forgotten from early this semester Refresh your memories No memorization required, but you need to be familiar with the material to

finish the exam

» Go through lecture notes, especially the examples!

» If you are confused on a topic, go through the reading

» If still confused, come talk to me or James

» Go through the practice exams as the final step

- 5 -

Exam Topics Control flow analysis

» Control flow graphs, Dom/pdom, Loop detection

» Trace selection, superblocks

Predicated execution» Control dependence analysis, if-conversion, hyperblocks

» Can ignore control height reduction

Dataflow analysis» Liveness, reaching defs, DU/UD chains, available defs/exprs

» Static single assignment

Optimizations» Classical: Dead code elim, constant/copy prop, CSE, LICM, induction

variable strength reduction

» ILP optimizations - unrolling, renaming, tree height reduction, induction/accumulator expansion

» Speculative optimization – like HW 1

- 6 -

Exam Topics - Continued Acyclic scheduling

» Dependence graphs, Estart/Lstart/Slack, list scheduling

» Code motion across branches, speculation, exceptions

» Can ignore sentinel scheduling

Software pipelining» DSA form, ResMII, RecMII, modulo scheduling

» Make sure you can modulo schedule a loop!

» Execution control with LC, ESC

Register allocation» Live ranges, graph coloring

Research topics» Can ignore automatic parallelization

- 7 -

Last Class

Scientific codes – Successful parallelization» KAP, SUIF, Parascope, gcc w/ Graphite

» Affine array dependence analysis

» DOALL parallelization

C programs» Not dominated by array accesses – classic

parallelization fails

» Speculative parallelization – Hydra, Stampede, Speculative multithreading Profiling to identify statistical DOALL loops But not all loops DOALL, outer loops typically not!!

This class – Parallelizing loops with dependences

- 8 -

What About Non-Scientific Codes???

for(i=1; i<=N; i++) // C a[i] = a[i] + 1; // X

while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X

Scientific Codes (FORTRAN-like) General-purpose Codes (legacy C/C++)

0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Cyclic Multithreading

(CMT)

Example: DOACROSS

[Cytron, ICPP 86]

Independent Multithreading (IMT)

Example: DOALL

parallelization

0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1

Core 2

Core 1

Core 2

- 9 -

Alternative Parallelization Approaches

0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Core 1

Core 2


0

1

2

3

4

5

LD:1

LD:2 X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6 X:5

Core 1

Core 2Pipelined

Multithreading (PMT)

Example: DSWP[PACT 2004]

Cyclic Multithreadi

ng(CMT)

- 10 -

Comparison: IMT, PMT, CMT

0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1

Core 2 0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Core 1

Core 2

CMTIMT

0

1

2

3

4

5

LD:1

LD:2 X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6 X:5

Core 1

Core 2

PMT

- 11 -


0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1

Core 2

IMT

1 iter/cycle

0

1

2

3

4

5

LD:1

LD:2 X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6 X:5

Core 1

Core 2

PMT

1 iter/cyclelat(comm) = 1:

0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Core 1

Core 2

CMT

1 iter/cycle

- 12 -


0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1

Core 2

IMT

0

1

2

3

4

5

LD:1

LD:2

X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6

Core 1

Core 2

PMT

1 iter/cyclelat(comm) = 1: 1 iter/cycle1 iter/cycle1 iter/cyclelat(comm) = 2: 0.5 iter/cycle1 iter/cycle

0

1

2

3

4

5

LD:1

X:1

LD:2

X:2

LD:3

X:3

Core 1

Core 2

CMT

- 13 -


0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1 Core 2

IMT

0

1

2

3

4

5

LD:1

LD:2

X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6

Core 1 Core 2

PMT

0

1

2

3

4

5

LD:1

X:1

LD:2

X:2

LD:3

X:3

Core 1 Core 2

CMT

Cross-thread Dependences Wide Applicability

Thread-local Recurrences Fast Execution

- 14 -

Our Objective: Automatic Extraction of Our Objective: Automatic Extraction of Pipeline Parallelism using DSWPPipeline Parallelism using DSWP

FindEnglish

Sentences

ParseSentences

(95%)

EmitResults

Decoupled Software Pipelining PS-DSWP (Spec DOALL Middle Stage)

197.parser

Decoupled Software Pipelining

- 16 -

Decoupled Software Pipelining (DSWP)

A: while(node) B: ncost = doit(node);C: cost += ncost;D: node = node->next;

Inter-thread communication latency is a one-time cost

intra-iteration

loop-carried

register

control

communication queue

[MICRO 2005]

DependenceGraph

DAGSCCThread 1 Thread 2

D

B

C

A

A D

B

C

A

DB

C

- 17 -

Implementing DSWPL1:

Aux:

DFG

intra-iteration

loop-carried

register

memory

control

- 18 -

Optimization: Node SplittingTo Eliminate Cross Thread Control

L1

L2

intra-iteration

loop-carried

register

memory

control

- 19 -

Optimization: Node Splitting To Reduce Communication L1

L2

intra-iteration

loop-carried

register

memory

control

- 20 -

Constraint: Strongly Connected Components

Solution: DAGSCC

Consider:

intra-iteration

loop-carried

register

memory

control

Eliminates pipelined/decoupled property

- 21 -

2 Extensions to the Basic Transformation

Speculation» Break statistically unlikely dependences

» Form better-balanced pipelines

Parallel Stages» Execute multiple copies of certain “large” stages

» Stages that contain inner loops perfect candidates

- 22 -

Why Speculation?

A: while(node) B: ncost = doit(node);C: cost += ncost;D: node = node->next;

DependenceGraph D

B

C

A

DAGSCC A D

B

C

intra-iteration

loop-carried

register

control

communication queue

- 23 -

Why Speculation?

A: while(cost < T && node) B: ncost = doit(node);C: cost += ncost;D: node = node->next;

DependenceGraph D

B

C

A

DAGSCC A D

B

C

A B C D

PredictableDependenc

es

intra-iteration

loop-carried

register

control

communication queue

- 24 -

Why Speculation?


DependenceGraph D

B

C

A

DAGSCC D

B

C

PredictableDependenc

es

Aintra-iteration

loop-carried

register

control

communication queue

- 25 -

Execution Paradigm

Misspeculationdetected

DAGSCC D

B

C

A

Misspeculation RecoveryRerun Iteration 4

intra-iteration

loop-carried

register

control

communication queue

- 26 -

Understanding PMT Performance

0

1

2

3

4

5

A:1

A:2 B:1

B:2

B:3

B:4

A:3

A:4

A:5

A:6 B:5

Core 1

Core 2 0

1

2

3

4

5

A:1B:1

C:1

C:3

A:2B:2

A:3B:3

Core 1

Core 2

Idle

T

ime

1 cycle/iterSlowest thread:

Iteration Rate:1 iter/cycle

2 cycle/iter

0.5 iter/cycle

)max( itT

1. Rate ti is at least as large as the longest dependence recurrence.

2. NP-hard to find longest recurrence.

3. Large loops make problem difficult in practice.

- 27 -

Selecting Dependences To Speculate


DependenceGraph D

B

C

A

DAGSCC D

B

C

A

Thread 1

Thread 2

Thread 3

Thread 4intra-iteration

loop-carried

register

control

communication queue

- 28 -

Detecting Misspeculation

DAGSCC D

B

C

A

A1: while(consume(4)) D : node = node->next produce({0,1},node);T

hre

ad

1

A3: while(consume(6)) B3: ncost = consume(2);C : cost += ncost; produce(3,cost);T

hre

ad

3

A2: while(consume(5)) B : ncost = doit(node); produce(2,ncost);D2: node = consume(0);T

hre

ad

2

A : while(cost < T && node)B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node);

Th

read

4

- 29 -


DAGSCC D

B

C

A

A1: while(TRUE) D : node = node->next produce({0,1},node);T

hre

ad

1

A3: while(TRUE) B3: ncost = consume(2);C : cost += ncost; produce(3,cost);T

hre

ad

3

A2: while(TRUE) B : ncost = doit(node); produce(2,ncost);D2: node = consume(0);T

hre

ad

2

A : while(cost < T && node)B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node);

Th

read

4

- 30 -


DAGSCC D

B

C

A

A1: while(TRUE) D : node = node->next produce({0,1},node);T

hre

ad

1

A3: while(TRUE) B3: ncost = consume(2);C : cost += ncost; produce(3,cost);T

hre

ad

3

A2: while(TRUE) B : ncost = doit(node); produce(2,ncost);D2: node = consume(0);T

hre

ad

2

A : while(cost < T && node)B4: cost = consume(3); C4: node = consume(1); if(!(cost < T && node)) FLAG_MISSPEC();

Th

read

4

- 31 -

Breaking False Memory Dependences

MemoryVersion 3Memory

Version 3

Oldest VersionCommitted by

Recovery ThreadDependence

Graph D

B

C

A

intra-iteration

loop-carried

register

control

communication queue

false memory

- 32 -

Adding Parallel Stages to DSWP

LD = 1 cycleX = 2 cycles


ThroughputDSWP: 1/2 iteration/cycleDOACROSS: 1/2 iteration/cyclePS-DSWP: 1 iteration/cycle

Comm. Latency = 2 cycles

0

1

2

3

4

5

LD:1

LD:2

X:1

X:3

LD:3

LD:4

LD:5

LD:6

X:5

6

7

LD:7

LD:8

Core 1 Core 2 Core 3

X:2

X:4

- 33 -

p = list; sum = 0;A: while (p != NULL) {B: id = p->id;E: q = p->inner_list;C: if (!visited[id]) {D: visited[id] = true;F: while (foo(q))G: q = q->next;H: if (q != NULL)I: sum += p->value; }J: p = p->next; }

10

10

10

10

55

50

50

5

3 Reduction

Thread Partitioning

- 34 -

Thread Partitioning: DAGSCC

- 35 -

Thread Partitioning

Merging Invariants

• No cycles• No loop-carried dependence inside a doall node

20

10

15

5

100

5

3

- 36 -

Treated as sequential

20

10

15

113

Thread Partitioning

- 37 -

45

113

Thread Partitioning

Modified MTCG[Ottoni, MICRO’05] to generate code from partition

- 38 -

Discussion Point 1 – Speculation How do you decide what dependences to speculate?

» Look solely at profile data?

» How do you ensure enough profile coverage?

» What about code structure?

» What if you are wrong? Undo speculation decisions at run-time?

How do you manage speculation in a pipeline?» Traditional definition of a transaction is broken

» Transaction execution spread out across multiple cores

- 39 -

Discussion Point 2 – Pipeline Structure When is a pipeline a good/bad choice for parallelization?

Is pipelining good or bad for cache performance?» Is DOALL better/worse for cache?

Can a pipeline be adjusted when the number of available cores increases/decreases?

eecs 583 – class 17 research topic 1 decoupled software pipelining

Documents

class parallelizing

loops doall

practice exams

years exams

future exams

automatic parallelization

execution control

outer loops