1 the potential for software-only thread- level speculation depth oral presentation co-supervisors:...
TRANSCRIPT
1
The potential for Software-only thread-level speculation
Depth Oral Presentation
Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza
Committee MembersProf. Tarek. Abdelrahman
Prof. Michael VossProf. Ken Sevick
By: Chuck (Chengyan) ZhaoApril 25, 2005
2
Chip Multi-Processor (CMP) is now everywhereFrom all major companies: IBM:
Power 4 Power 5 …
Intel: Montecito Smithfield …
AMD: Dual-core Opteron
Sun: MAJC
Sony, Toshiba, IBM: Cell
… …
Power 4 Dual-core Intel chip
CellDual-core Opteron
Abundant Chip Multiprocessors
3
Improving Throughput with a Chip Multi-Processor
C
C
P
C
P
C
P
C
P
C
C
P
Multiprogramming Workload:
Execution
Time
improve throughput
Processor
Caches
Applications
4
Improving Single Application Performance with a Chip Multi-Processor
C
C
P
C
P
C
P
C
P
C
C
P
Single Application:
need parallel threads to reduce execution time
C
C
P
C
P
C
P
C
P
Exec.
Time
5
Using Chip Multi-Processor for improvements Improve throughput for multi-programming workload
Easy CMP behaves like a normal MP
Improve single-application performance Hard Control and Data Dependence Proposed approach: Thread-Level Speculation (TLS)
CMP trade-offs
6
Thread-Level Speculation (TLS) Enable compiler to create parallel threads despite the existence of
ambiguous data dependence Optimistically parallelize at compile time Detect violations and recover at runtime
Compile Time
Parallelize without
dependency detection
Run Time
Detect
Violation
Commit
Modification
Squash And
Re-execute
No
Yes
Optimistic at compile time, detect and recover at runtime
7
Example of Thread-Level SpeculationCode to parallelize
Un-parallelizable through paralleling compilers Uncertain dependence between *p and *q Might be runtime or user-input dependent
for ( …){ … *p = …; … … = … … *q; …}
Break loop iterations into threads, explore uncertainty in each thread
8
How Thread-Level Speculation works
exploit available thread-level parallelism
Exec.
Time TLS
…*q*p…
Recover
…*q
violation
9
Thread-Level Speculation quick summary Benefits
Reduce inter-thread communication time among cores Scale New parallel programming model
Types of implementations Hardware only Combined with hardware and software Software only
Thread-Level Speculation is good for Chip Multi-Processor
10
Thread-Level Speculation Implementation Diagram
Thread-Level Speculation
HW-only approach
SW-only approach
Our approach
Overall picture of Thread-Level Speculation
11
Thread-Level Speculation Implementation Comparison Hardware-only approach
Lots of research Good speed up through simulation Nobody builds it yet
cost, risky, need both HW + SW at the same time
Outcome HW-only TLS looks promising Significant hardware changes
Software-only approach: limited work, limited progress Major problem: high overhead
Buffer memory for speculative states Track each memory read + write: violation detection Recover from failed speculation: re-execution
Quick summary on HW-only and SW-only approaches
12
Outline for the rest of the talk Hardware TLS schemes Software TLS schemes Our scheme
Our goals Starting point Potential applications
Conclusion
13
Hardware-only Thread-Level Speculation
Thread-Level Speculation
HW-only approach
SW-only approach
Our approach
Overall picture of HW-only TLS approach
14
Hardware Thread-Level Speculation Schemes Lots of hardware TLS research
CMU Stampede Stanford Hydra Wisconsin Multiscalar UIUC IA-COMA UMN Super-threaded architecture …
Convergence of hardware schemes Use cache to buffer speculative state Extend cache coherence protocol to track data
dependenceConvergence of HW-only Thread-Level Speculation
15
Hardware TLS Schemes: quick summary Result
TLS is promising SPEC int improvement:
30% - 100% Depends on aggressivene
ss of the hardware support
C
(non-speculative)
C
P
C
P
C
P
C
P
Sp-state Sp-state Sp-state Sp-state
CMP with hardware speculative buffer and enhanced cache consistence protocol
Convergence of HW-only Thread-Level Speculation
16
Software-only Thread-Level Speculation
Thread-Level Speculation
HW-only approach
SW-only approach
Our approach
Overall picture of SW-only TLS approach
17
Software-only Thread-Level Speculation Schemes LRPD Test: UIUC
VM for dependence tracking: Spiros’s, CMU Cintra’s SW TLS: U Edinburgh
Problem of software-only approach: high overhead
Try to reduce it
overview of SW-only TLS approach
18
LRPD Test (UIUC)
+ implemented entirely in software– applies only to array-based code– no partial parallelism
entire loop will re-execute sequentially if there is any dependence
softwaredependencetracking
was parallelexecution safe?
Exec.
Time
Pros + Cons of LRPD
19
Dependence tracking using Virtual Memory
Exec.
Time
Software dependence tracking through VM pages
Virtual Memory Synchronize:
transfer VM pages
? Pros + Cons of VM Tracking
20
CMU Spiros’s approach -- Dependence tracking using Virtual Memory
Coarse-grain, software-only Based on memory tracking
virtual memory page protection mechanism use software DSM (TreadMarks) Synchronization through VM pages through cost analysis
Overhead is prohibitive 2 sec (seq) / 5 min (par) Not a viable approach on this level of coarse granularity
SW-TLS through VM Tracking is not attractive
21
Cintra’s SW TLS: Memory tracking tuned for performance
Exec.
Time
Efficient tracking for array references
Efficient but custom-made for array only
22
Pros + Cons: + advanced implementation of LRPD test + implement entirely in software + cover partial parallelism
– hand-crafted code for performance – apply only to array-based code
Cintra’s software-only Thread-Level Speculation: quick summary
Features Software simulation for extended cache coherence protocol
Provide speculative state transition table Violation detection through speculate state comparison Instrument on each load and store
Summary of Cintra’s work
23
Problems with Software Thread-Level Speculation High overhead
Buffer speculative state Track data dependence for all memory reference Re-execute in case of failed speculation
Potential speedup largely unexplored
Possible directions for future research Reduce overhead Achieve speedup from TLS parallelism
Summary of Software TLS
24
Our current Thread-Level Speculation approach
Thread-Level Speculation
HW-only approach
SW-only approach
Our approach
Overall position for our SW TLS approach
25
Long term future plan Goals
Target Chip Multi-Processors Tightly-coupled MPs
Apply to general-purpose code: not only arrays Minimize overhead
Capitalize on compiler analysis and optimizations Idempotency analysis <done> Synchronization and communications <done> PPA: Probabilistic pointer analysis Framework (Jeff’s work)
<progressing> Minimal backup and buffer retrieval analysis <progressing> … more analysis we will invent <todo>
SW-only approach: room to improve Starting point: highly efficient software checkpointing
Goals and Plans
26
Starting point: efficient software checkpointing
Some program points in source code Buffer state change between current execution point
and its latest check point Execution can always efficiently rewind to its latest
checkpointing
program
execution
Buffer memory changes
Buffer more memory changes
Software checkpointing
Introduce software checkpointing
27
Potential use of Software checkpointing Software Rollback
automatic software TLS support foundation of future automatic TLS parallelization
Debug controlled rewind
Enhance application reliability Speculative optimizations in uni-processor program
larger window size deep branch speculation speculative code motion
what can software checkpointing do
28
Compiler analysis Local: Basic Block level
Backup only needed memory writes Optimize to minimize
number of backup Number of buffer retrieval
Global: procedural level Populate buffers through control-flow graph Iterate until buffer stabilizes
Inter-procedural level
Potential approaches for software backup Undo backup Todo backup
Software checkpointing schemes
build software checkpointing
29
Undo backup Compile-time analysis Backup once
per distinct memory write per Basic Block
Program continue to operate on non-backup memory
Action upon execution completion Commit: trash buffer Rollback: restore from buffer
undo backup properties
30
Undo backup example
…a = 10;b = 12;
… c = a + b;
…
(&a, [a])(&b, [b])(&c, [c])
Program, Basic Block level Undo backup memory
Undo backup action
conflicts check
Next Basic Block
…
trash undo memory
N
restore undo memoryY
undo backup process
31
Todo backup Perform at runtime Happen on each single memory write inside Basic
Block Each following read might need to retrieve from b
uffer Action upon completion (reverse of Undo type)
Commit: write-back from buffer Rollback: trash buffer
todo backup properties
32
Todo backup example
…*p = a;*q = b;
……*p + *q;
…
(p, a)(q, b)
Program, Basic Block level todo backup memory
conflicts check
Next Block…
write todo backup to memory
N
trash todo backupY
todo backup process
33
Backup Comparison Undo
Pro: fast Few number of backups No need to retrieve from buffer for read
Con: Memory address needs to be known statically Scalar Pointer to fixed location
Todo Pro
Handle both scalar and general-purpose pointer cases
Con: slow Backup once per memory write Need to retrieve each following read from buffer
In reality: both types are used
pros + cons of undo and todo
34
An example in reality: mixed mode
int a, b, c;int * p, * q;
…(d) a = 1;(d) b = 2;(d) *p = 5;
… … (u) c = a + b;
… …(u) … = * q;
…
Code to execute
(&a, [a])(&b, [b])(&c, [c])
(p, 5)
Undo buffer
Todo buffer
combined-backup process in reality
35
Selection of backups in reality Combined approach
Undo: memory address known Scalars Pointers to fixed address Compile-time analysis
Todo: memory address unknown Normal pointers Run-time analysis
Plan for implementation put into SUIF, as a optimization pass Minimize performance drop
use both types together in reality
36
Conclusion Thread-Level Speculation is compelling
Potential large performance gains Challenge
Software overhead Limited SW TLS work
No previous SW TLS working on general-purpose programs
Killer advantage: compiler analyses Modest starting point
efficient software checkpointing
summary
37
Questions and Answers
38
Concurrent HW-only Related WorkApproach Composition Compiler-assisted
or Translator-only
DMT HW-only
CSMP HW-only
Trace Processor HW-only
Krishnan99 SW/HW
Hydra SW/HW
SVC SW/HW
SUDS SW/HW
Zhang99 SW/HW
Cintra00 SW/HW
STAMPede SW/HW
An other view of HW-only Thread-Level Speculation Schemes