eecs 583 – class 17 research topic 1 decoupled software pipelining

39
EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining University of Michigan November 9, 2011

Upload: rollin

Post on 17-Mar-2016

37 views

Category:

Documents


5 download

DESCRIPTION

EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining. University of Michigan November 9, 2011. Announcements + Reading Material. 2 nd paper review due today Should have submitted to andrew.eecs.umich.edu:/y/submit Next Monday – Midterm exam in class Today’s class reading - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

EECS 583 – Class 17Research Topic 1Decoupled Software Pipelining

University of Michigan

November 9, 2011

Page 2: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 2 -

Announcements + Reading Material 2nd paper review due today

» Should have submitted to andrew.eecs.umich.edu:/y/submit Next Monday – Midterm exam in class Today’s class reading

» “Automatic Thread Extraction with Decoupled Software Pipelining,” G. Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, Nov. 2005.

Next class reading (Wednes Nov 16)» “Spice: Speculative Parallel Iteration Chunk Execution,” E.

Raman, N. Vachharajani, R. Rangan, and D. I. August, Proc 2008 Intl. Symposium on Code Generation and Optimization, April 2008.

Page 3: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 3 -

Midterm Exam When: Monday, Nov 14, 2011, 10:40-12:30 Where

» 1005 EECS Uniquenames starting with A-H go here

» 3150 Dow (our classroom) Uniquenames starting with I-Z go here

What to expect» Open book/notes, no laptops» Apply techniques we discussed in class on examples» Reason about solving compiler problems – why things are done» A couple of thinking problems» No LLVM code» Reasonably long but you should finish

Last 2 years exams are posted on the course website» Note – Past exams may not accurately predict future exams!!

Page 4: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 4 -

Midterm Exam Office hours between now and Monday if you have questions

» Daya: Thurs and Fri 3-5pm» Scott: Wednes 4:30-5:30, Fri 4:30-5:30

Studying» Yes, you should study even though its open notes

Lots of material that you have likely forgotten Refresh your memories No memorization required, but you need to be familiar with the material to

finish the exam» Go through lecture notes, especially the examples!» If you are confused on a topic, go through the reading» If still confused, come talk to me or Daya» Go through the practice exams as the final step

Page 5: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 5 -

Exam Topics Control flow analysis

» Control flow graphs, Dom/pdom, Loop detection» Trace selection, superblocks

Predicated execution» Control dependence analysis, if-conversion, hyperblocks» Can ignore control height reduction

Dataflow analysis» Liveness, reaching defs, DU/UD chains, available defs/exprs» Static single assignment

Optimizations» Classical: Dead code elim, constant/copy prop, CSE, LICM, induction

variable strength reduction» ILP optimizations - unrolling, renaming, tree height reduction,

induction/accumulator expansion» Speculative optimization – like HW 1

Page 6: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 6 -

Exam Topics - Continued Acyclic scheduling

» Dependence graphs, Estart/Lstart/Slack, list scheduling» Code motion across branches, speculation, exceptions

Software pipelining» DSA form, ResMII, RecMII, modulo scheduling» Make sure you can modulo schedule a loop!» Execution control with LC, ESC

Register allocation» Live ranges, graph coloring

Research topics» Can ignore these

Page 7: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 7 -

Last Class Scientific codes – Successful parallelization

» KAP, SUIF, Parascope, gcc w/ Graphite» Affine array dependence analysis» DOALL parallelization

C programs» Not dominated by array accesses – classic

parallelization fails» Speculative parallelization – Hydra, Stampede,

Speculative multithreading Profiling to identify statistical DOALL loops But not all loops DOALL, outer loops typically not!!

This class – Parallelizing loops with dependences

Page 8: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 8 -

What About Non-Scientific Codes???

for(i=1; i<=N; i++) // C a[i] = a[i] + 1; // X

while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X

Scientific Codes (FORTRAN-like) General-purpose Codes (legacy C/C++)

0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Cyclic Multithreading

(CMT)

Example: DOACROSS

[Cytron, ICPP 86]

Independent Multithreading (IMT)

Example: DOALL

parallelization

0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1

Core 2

Core 1

Core 2

Page 9: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 9 -

Alternative Parallelization Approaches

0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Core 1

Core 2

while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X

0

1

2

3

4

5

LD:1

LD:2 X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6 X:5

Core 1

Core 2Pipelined

Multithreading (PMT)

Example: DSWP[PACT 2004]

Cyclic Multithreadi

ng(CMT)

Page 10: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 10 -

Comparison: IMT, PMT, CMT

0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1

Core 2 0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Core 1

Core 2

CMTIMT

0

1

2

3

4

5

LD:1

LD:2 X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6 X:5

Core 1

Core 2

PMT

Page 11: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 11 -

Comparison: IMT, PMT, CMT

0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1

Core 2

IMT

1 iter/cycle

0

1

2

3

4

5

LD:1

LD:2 X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6 X:5

Core 1

Core 2

PMT

1 iter/cyclelat(comm) = 1:

0

1

2

3

4

5

LD:1

X:1 LD:2

X:2

LD:4

X:4

LD:3

X:3

LD:5

X:5 LD:6

Core 1

Core 2

CMT

1 iter/cycle

Page 12: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 12 -

Comparison: IMT, PMT, CMT

0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1

Core 2

IMT

0

1

2

3

4

5

LD:1

LD:2

X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6

Core 1

Core 2

PMT

1 iter/cyclelat(comm) = 1: 1 iter/cycle1 iter/cycle1 iter/cyclelat(comm) = 2: 0.5 iter/cycle1 iter/cycle

0

1

2

3

4

5

LD:1

X:1

LD:2

X:2

LD:3

X:3

Core 1

Core 2

CMT

Page 13: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 13 -

Comparison: IMT, PMT, CMT

0

1

2

3

4

5

C:1

X:1

C:2

X:2

C:4

X:4

C:3

X:3

C:5

X:5

C:6

X:6

Core 1 Core 2IMT

0

1

2

3

4

5

LD:1

LD:2

X:1

X:2

X:3

X:4

LD:3

LD:4

LD:5

LD:6

Core 1 Core 2PMT

0

1

2

3

4

5

LD:1

X:1

LD:2

X:2

LD:3

X:3

Core 1 Core 2CMT

Cross-thread Dependences Wide Applicability

Thread-local Recurrences Fast Execution

Page 14: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 14 -

Our Objective: Automatic Extraction of Our Objective: Automatic Extraction of Pipeline Parallelism using DSWPPipeline Parallelism using DSWP

FindEnglish

Sentences

ParseSentences

(95%)Emit

Results

Decoupled Software Pipelining PS-DSWP (Spec DOALL Middle Stage)

197.parser

Page 15: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

Decoupled Software Pipelining

Page 16: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 16 -

Decoupled Software Pipelining (DSWP)

A: while(node) B: ncost = doit(node);C: cost += ncost;D: node = node->next;

Inter-thread communication latency is a one-time cost

intra-iterationloop-carried

registercontrol

communication queue

[MICRO 2005]

DependenceGraph

DAGSCCThread 1 Thread 2

D

B

C

A

A D

B

C

A

DB

C

Page 17: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 17 -

Implementing DSWPL1:

Aux:

DFG

intra-iterationloop-carried

register

memorycontrol

Page 18: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 18 -

Optimization: Node SplittingTo Eliminate Cross Thread Control

L1

L2

intra-iterationloop-carried

register

memorycontrol

Page 19: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 19 -

Optimization: Node Splitting To Reduce Communication L1

L2

intra-iterationloop-carried

register

memorycontrol

Page 20: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 20 -

Constraint: Strongly Connected Components

Solution: DAGSCC

Consider:

intra-iterationloop-carried

register

memorycontrol

Eliminates pipelined/decoupled property

Page 21: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 21 -

2 Extensions to the Basic Transformation

Speculation» Break statistically unlikely dependences» Form better-balanced pipelines

Parallel Stages» Execute multiple copies of certain “large” stages» Stages that contain inner loops perfect candidates

Page 22: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 22 -

Why Speculation?A: while(node) B: ncost = doit(node);C: cost += ncost;D: node = node->next;

DependenceGraph D

B

C

A

DAGSCC A D

B

C

intra-iterationloop-carried

registercontrol

communication queue

Page 23: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 23 -

Why Speculation?A: while(cost < T && node) B: ncost = doit(node);C: cost += ncost;D: node = node->next;

DependenceGraph D

B

C

A

DAGSCC A D

B

C

A B C D

PredictableDependenc

es

intra-iterationloop-carried

registercontrol

communication queue

Page 24: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 24 -

Why Speculation?A: while(cost < T && node) B: ncost = doit(node);C: cost += ncost;D: node = node->next;

DependenceGraph D

B

C

A

DAGSCC D

B

C

PredictableDependenc

es

Aintra-iterationloop-carried

registercontrol

communication queue

Page 25: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 25 -

Execution Paradigm

Misspeculationdetected

DAGSCC D

B

C

A

Misspeculation RecoveryRerun Iteration 4

intra-iterationloop-carried

registercontrol

communication queue

Page 26: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 26 -

Understanding PMT Performance

0

1

2

3

4

5

A:1

A:2 B:1

B:2

B:3

B:4

A:3

A:4

A:5

A:6 B:5

Core 1

Core 2 0

1

2

3

4

5

A:1B:1

C:1

C:3

A:2B:2

A:3B:3

Core 1

Core 2

Idle

Ti

me

1 cycle/iterSlowest thread:Iteration Rate:1 iter/cycle

2 cycle/iter0.5 iter/cycle

)max( itT

1. Rate ti is at least as large as the longest dependence recurrence.

2. NP-hard to find longest recurrence.

3. Large loops make problem difficult in practice.

Page 27: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 27 -

Selecting Dependences To Speculate

A: while(cost < T && node) B: ncost = doit(node);C: cost += ncost;D: node = node->next;

DependenceGraph D

B

C

A

DAGSCC D

B

C

A

Thread 1

Thread 2

Thread 3

Thread 4intra-iterationloop-carried

registercontrol

communication queue

Page 28: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 28 -

Detecting Misspeculation

DAGSCC D

B

C

A

A1: while(consume(4)) D : node = node->next produce({0,1},node);Th

read

1

A3: while(consume(6)) B3: ncost = consume(2);C : cost += ncost; produce(3,cost);Th

read

3

A2: while(consume(5)) B : ncost = doit(node); produce(2,ncost);D2: node = consume(0);Th

read

2

A : while(cost < T && node)B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node);

Thre

ad 4

Page 29: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 29 -

Detecting Misspeculation

DAGSCC D

B

C

A

A1: while(TRUE) D : node = node->next produce({0,1},node);Th

read

1

A3: while(TRUE) B3: ncost = consume(2);C : cost += ncost; produce(3,cost);Th

read

3

A2: while(TRUE) B : ncost = doit(node); produce(2,ncost);D2: node = consume(0);Th

read

2

A : while(cost < T && node)B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node);

Thre

ad 4

Page 30: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 30 -

Detecting Misspeculation

DAGSCC D

B

C

A

A1: while(TRUE) D : node = node->next produce({0,1},node);Th

read

1

A3: while(TRUE) B3: ncost = consume(2);C : cost += ncost; produce(3,cost);Th

read

3

A2: while(TRUE) B : ncost = doit(node); produce(2,ncost);D2: node = consume(0);Th

read

2

A : while(cost < T && node)B4: cost = consume(3); C4: node = consume(1); if(!(cost < T && node)) FLAG_MISSPEC();

Thre

ad 4

Page 31: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 31 -

Breaking False Memory Dependences

MemoryVersion 3

Oldest VersionCommitted by

Recovery ThreadDependence

Graph D

B

C

A

intra-iterationloop-carried

registercontrol

communication queue

false memory

Page 32: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 32 -

Adding Parallel Stages to DSWP

LD = 1 cycleX = 2 cycles

while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X

ThroughputDSWP: 1/2 iteration/cycleDOACROSS: 1/2 iteration/cyclePS-DSWP: 1 iteration/cycle

Comm. Latency = 2 cycles

0

1

2

3

4

5

LD:1

LD:2

X:1

X:3

LD:3

LD:4

LD:5

LD:6

X:5

6

7

LD:7

LD:8

Core 1 Core 2 Core 3

X:2

X:4

Page 33: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 33 -

p = list; sum = 0;A: while (p != NULL) {B: id = p->id;E: q = p->inner_list;C: if (!visited[id]) {D: visited[id] = true;F: while (foo(q))G: q = q->next;H: if (q != NULL)I: sum += p->value; }J: p = p->next; }

10

10

10

10

55

50

50

5

3 Reduction

Thread Partitioning

Page 34: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 34 -

Thread Partitioning: DAGSCC

Page 35: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 35 -

Thread Partitioning

Merging Invariants

• No cycles• No loop-carried dependence inside a doall node

20

10

15

5

100

5

3

Page 36: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 36 -

Treated as sequential

20

10

15

113

Thread Partitioning

Page 37: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 37 -

45

113

Thread Partitioning

Modified MTCG[Ottoni, MICRO’05] to generate code from partition

Page 38: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 38 -

Discussion Point 1 – Speculation How do you decide what dependences to speculate?

» Look solely at profile data?» What about code structure?

How do you manage speculation in a pipeline?» Traditional definition of a transaction is broken» Transaction execution spread out across multiple cores

Page 39: EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

- 39 -

Discussion Point 2 – Pipeline Structure When is a pipeline a good/bad choice for parallelization?

Is pipelining good or bad for cache performance?» Is DOALL better/worse for cache?

Can a pipeline be adjusted when the number of available cores increases/decreases?