1 the potential for software-only thread- level speculation depth oral presentation co-supervisors:...

38
1 The potential for Software- only thread-level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members Prof. Tarek. Abdelrahman Prof. Michael Voss Prof. Ken Sevick By: Chuck (Chengyan) Zhao April 25, 2005

Upload: brittany-lucas

Post on 18-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

1

The potential for Software-only thread-level speculation

Depth Oral Presentation

Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza

Committee MembersProf. Tarek. Abdelrahman  

Prof. Michael VossProf. Ken Sevick

By: Chuck (Chengyan) ZhaoApril 25, 2005

Page 2: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

2

Chip Multi-Processor (CMP) is now everywhereFrom all major companies: IBM:

Power 4 Power 5 …

Intel: Montecito Smithfield …

AMD: Dual-core Opteron

Sun: MAJC

Sony, Toshiba, IBM: Cell

… …

Power 4 Dual-core Intel chip

CellDual-core Opteron

Abundant Chip Multiprocessors

Page 3: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

3

Improving Throughput with a Chip Multi-Processor

C

C

P

C

P

C

P

C

P

C

C

P

Multiprogramming Workload:

Execution

Time

improve throughput

Processor

Caches

Applications

Page 4: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

4

Improving Single Application Performance with a Chip Multi-Processor

C

C

P

C

P

C

P

C

P

C

C

P

Single Application:

need parallel threads to reduce execution time

C

C

P

C

P

C

P

C

P

Exec.

Time

Page 5: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

5

Using Chip Multi-Processor for improvements Improve throughput for multi-programming workload

Easy CMP behaves like a normal MP

Improve single-application performance Hard Control and Data Dependence Proposed approach: Thread-Level Speculation (TLS)

CMP trade-offs

Page 6: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

6

Thread-Level Speculation (TLS) Enable compiler to create parallel threads despite the existence of

ambiguous data dependence Optimistically parallelize at compile time Detect violations and recover at runtime

Compile Time

Parallelize without

dependency detection

Run Time

Detect

Violation

Commit

Modification

Squash And

Re-execute

No

Yes

Optimistic at compile time, detect and recover at runtime

Page 7: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

7

Example of Thread-Level SpeculationCode to parallelize

Un-parallelizable through paralleling compilers Uncertain dependence between *p and *q Might be runtime or user-input dependent

for ( …){ … *p = …; … … = … … *q; …}

Break loop iterations into threads, explore uncertainty in each thread

Page 8: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

8

How Thread-Level Speculation works

exploit available thread-level parallelism

Exec.

Time TLS

…*q*p…

Recover

…*q

violation

Page 9: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

9

Thread-Level Speculation quick summary Benefits

Reduce inter-thread communication time among cores Scale New parallel programming model

Types of implementations Hardware only Combined with hardware and software Software only

Thread-Level Speculation is good for Chip Multi-Processor

Page 10: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

10

Thread-Level Speculation Implementation Diagram

Thread-Level Speculation

HW-only approach

SW-only approach

Our approach

Overall picture of Thread-Level Speculation

Page 11: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

11

Thread-Level Speculation Implementation Comparison Hardware-only approach

Lots of research Good speed up through simulation Nobody builds it yet

cost, risky, need both HW + SW at the same time

Outcome HW-only TLS looks promising Significant hardware changes

Software-only approach: limited work, limited progress Major problem: high overhead

Buffer memory for speculative states Track each memory read + write: violation detection Recover from failed speculation: re-execution

Quick summary on HW-only and SW-only approaches

Page 12: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

12

Outline for the rest of the talk Hardware TLS schemes Software TLS schemes Our scheme

Our goals Starting point Potential applications

Conclusion

Page 13: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

13

Hardware-only Thread-Level Speculation

Thread-Level Speculation

HW-only approach

SW-only approach

Our approach

Overall picture of HW-only TLS approach

Page 14: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

14

Hardware Thread-Level Speculation Schemes Lots of hardware TLS research

CMU Stampede Stanford Hydra Wisconsin Multiscalar UIUC IA-COMA UMN Super-threaded architecture …

Convergence of hardware schemes Use cache to buffer speculative state Extend cache coherence protocol to track data

dependenceConvergence of HW-only Thread-Level Speculation

Page 15: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

15

Hardware TLS Schemes: quick summary Result

TLS is promising SPEC int improvement:

30% - 100% Depends on aggressivene

ss of the hardware support

C

(non-speculative)

C

P

C

P

C

P

C

P

Sp-state Sp-state Sp-state Sp-state

CMP with hardware speculative buffer and enhanced cache consistence protocol

Convergence of HW-only Thread-Level Speculation

Page 16: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

16

Software-only Thread-Level Speculation

Thread-Level Speculation

HW-only approach

SW-only approach

Our approach

Overall picture of SW-only TLS approach

Page 17: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

17

Software-only Thread-Level Speculation Schemes LRPD Test: UIUC

VM for dependence tracking: Spiros’s, CMU Cintra’s SW TLS: U Edinburgh

Problem of software-only approach: high overhead

Try to reduce it

overview of SW-only TLS approach

Page 18: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

18

LRPD Test (UIUC)

+ implemented entirely in software– applies only to array-based code– no partial parallelism

entire loop will re-execute sequentially if there is any dependence

softwaredependencetracking

was parallelexecution safe?

Exec.

Time

Pros + Cons of LRPD

Page 19: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

19

Dependence tracking using Virtual Memory

Exec.

Time

Software dependence tracking through VM pages

Virtual Memory Synchronize:

transfer VM pages

? Pros + Cons of VM Tracking

Page 20: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

20

CMU Spiros’s approach -- Dependence tracking using Virtual Memory

Coarse-grain, software-only Based on memory tracking

virtual memory page protection mechanism use software DSM (TreadMarks) Synchronization through VM pages through cost analysis

Overhead is prohibitive 2 sec (seq) / 5 min (par) Not a viable approach on this level of coarse granularity

SW-TLS through VM Tracking is not attractive

Page 21: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

21

Cintra’s SW TLS: Memory tracking tuned for performance

Exec.

Time

Efficient tracking for array references

Efficient but custom-made for array only

Page 22: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

22

Pros + Cons: + advanced implementation of LRPD test + implement entirely in software + cover partial parallelism

– hand-crafted code for performance – apply only to array-based code

Cintra’s software-only Thread-Level Speculation: quick summary

Features Software simulation for extended cache coherence protocol

Provide speculative state transition table Violation detection through speculate state comparison Instrument on each load and store

Summary of Cintra’s work

Page 23: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

23

Problems with Software Thread-Level Speculation High overhead

Buffer speculative state Track data dependence for all memory reference Re-execute in case of failed speculation

Potential speedup largely unexplored

Possible directions for future research Reduce overhead Achieve speedup from TLS parallelism

Summary of Software TLS

Page 24: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

24

Our current Thread-Level Speculation approach

Thread-Level Speculation

HW-only approach

SW-only approach

Our approach

Overall position for our SW TLS approach

Page 25: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

25

Long term future plan Goals

Target Chip Multi-Processors Tightly-coupled MPs

Apply to general-purpose code: not only arrays Minimize overhead

Capitalize on compiler analysis and optimizations Idempotency analysis <done> Synchronization and communications <done> PPA: Probabilistic pointer analysis Framework (Jeff’s work)

<progressing> Minimal backup and buffer retrieval analysis <progressing> … more analysis we will invent <todo>

SW-only approach: room to improve Starting point: highly efficient software checkpointing

Goals and Plans

Page 26: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

26

Starting point: efficient software checkpointing

Some program points in source code Buffer state change between current execution point

and its latest check point Execution can always efficiently rewind to its latest

checkpointing

program

execution

Buffer memory changes

Buffer more memory changes

Software checkpointing

Introduce software checkpointing

Page 27: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

27

Potential use of Software checkpointing Software Rollback

automatic software TLS support foundation of future automatic TLS parallelization

Debug controlled rewind

Enhance application reliability Speculative optimizations in uni-processor program

larger window size deep branch speculation speculative code motion

what can software checkpointing do

Page 28: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

28

Compiler analysis Local: Basic Block level

Backup only needed memory writes Optimize to minimize

number of backup Number of buffer retrieval

Global: procedural level Populate buffers through control-flow graph Iterate until buffer stabilizes

Inter-procedural level

Potential approaches for software backup Undo backup Todo backup

Software checkpointing schemes

build software checkpointing

Page 29: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

29

Undo backup Compile-time analysis Backup once

per distinct memory write per Basic Block

Program continue to operate on non-backup memory

Action upon execution completion Commit: trash buffer Rollback: restore from buffer

undo backup properties

Page 30: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

30

Undo backup example

…a = 10;b = 12;

… c = a + b;

(&a, [a])(&b, [b])(&c, [c])

Program, Basic Block level Undo backup memory

Undo backup action

conflicts check

Next Basic Block

trash undo memory

N

restore undo memoryY

undo backup process

Page 31: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

31

Todo backup Perform at runtime Happen on each single memory write inside Basic

Block Each following read might need to retrieve from b

uffer Action upon completion (reverse of Undo type)

Commit: write-back from buffer Rollback: trash buffer

todo backup properties

Page 32: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

32

Todo backup example

…*p = a;*q = b;

……*p + *q;

(p, a)(q, b)

Program, Basic Block level todo backup memory

conflicts check

Next Block…

write todo backup to memory

N

trash todo backupY

todo backup process

Page 33: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

33

Backup Comparison Undo

Pro: fast Few number of backups No need to retrieve from buffer for read

Con: Memory address needs to be known statically Scalar Pointer to fixed location

Todo Pro

Handle both scalar and general-purpose pointer cases

Con: slow Backup once per memory write Need to retrieve each following read from buffer

In reality: both types are used

pros + cons of undo and todo

Page 34: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

34

An example in reality: mixed mode

int a, b, c;int * p, * q;

…(d) a = 1;(d) b = 2;(d) *p = 5;

… … (u) c = a + b;

… …(u) … = * q;

Code to execute

(&a, [a])(&b, [b])(&c, [c])

(p, 5)

Undo buffer

Todo buffer

combined-backup process in reality

Page 35: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

35

Selection of backups in reality Combined approach

Undo: memory address known Scalars Pointers to fixed address Compile-time analysis

Todo: memory address unknown Normal pointers Run-time analysis

Plan for implementation put into SUIF, as a optimization pass Minimize performance drop

use both types together in reality

Page 36: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

36

Conclusion Thread-Level Speculation is compelling

Potential large performance gains Challenge

Software overhead Limited SW TLS work

No previous SW TLS working on general-purpose programs

Killer advantage: compiler analyses Modest starting point

efficient software checkpointing

summary

Page 37: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

37

Questions and Answers

Page 38: 1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members

38

Concurrent HW-only Related WorkApproach Composition Compiler-assisted

or Translator-only

DMT HW-only

CSMP HW-only

Trace Processor HW-only

Krishnan99 SW/HW

Hydra SW/HW

SVC SW/HW

SUDS SW/HW

Zhang99 SW/HW

Cintra00 SW/HW

STAMPede SW/HW

An other view of HW-only Thread-Level Speculation Schemes