alias speculation using atomic regions
DESCRIPTION
Alias Speculation using Atomic Regions. (To appear at ASPLOS 2013) Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign. Disclaimer. This talk is not about parallelism. - PowerPoint PPT PresentationTRANSCRIPT
Alias Speculation using Atomic Regions
(To appear at ASPLOS 2013)
Wonsun Ahn*, Yuelu Duan, Josep Torrellas
University of Illinois at Urbana Champaign
Disclaimer
• This talk is not about parallelism.
• This talk is about decreasing the amount of work that needs to be done through better code generation.
• We want to do this by making the software-hardware barrier more porous.
2
Compiler
Hardware
Assumptions
Information
What prevents good code generation?
• Many popular optimizations require code motion
– Loop Invariant Code Motion (LICM): From the body to the preheader of a loop
– Redundancy elimination: From the location of the redundant computation to the first computation
• Memory aliasing prevents code motion
3
r1 = a + b…c = a + b
r1 = a + b…c = a + b
r1 = a + b…r2 = a + bc = r2
r1 = a + b…r2 = a + bc = r2
r1 = a + br2 = a + b…c = r2
r1 = a + br2 = a + b…c = r2
r1 = a + br2 = r1…c = r2
r1 = a + br2 = r1…c = r2
r1 = a + b*p = …c = a + b
r1 = a + b*p = …c = a + b
r1 = a + b*p = …r2 = a + bc = r2
r1 = a + b*p = …r2 = a + bc = r2
r1 = a + br2 = a + b*p = …c = r2
r1 = a + br2 = a + b*p = …c = r2
Alias Analysis is Difficult
• Alias analysis returns one of three results
– Must-Alias, No-Alias, May-Alias
• Accurate static analysis is fundamentally difficult
– Requires points-to analysis, heap modeling etc.
– Quickly becomes intractable in space/time complexity
• Alternative: insert runtime checks
– Software checks
– Hardware checks (e.g. Itanium ALAT, Transmeta)
• We propose to leverage atomic regions to do runtime checks and automatic recovery
4
Background: Atomic Regions (aka Transactions)
• Sections of code demarcated in software that are either committed atomically on success or rolled back on failure
• Atomic regions are here and now:– Intel TSX, AMD ASF, IBM Bluegene/Q, IBM Power– Originally to ease parallel programming… but again that’s not
what the talk is about today
• Does two things well that software finds difficult– Checkpointing: to guarantee atomic commit of transaction
• Exposed to software through begin atomic, end atomic– Memory alias detection: to guarantee isolation of transaction
• Hidden from software
5
Proposal: Leverage Atomic Regions for Alias Speculation
• Expose alias checking HW to SW through ISA extensions
• Use HW support for Atomic Regions to perform alias speculation in a compiler for optimizations
– Cover path of code motion in an Atomic Region
– Speculate may-aliases in code motion path are no-aliases
– Check speculated aliases using alias checking HW
– Recover from failure by rolling back to checkpoint
– Apply this to optimizations such as:
• Loop Invariant Code Motion (LICM)
• Partial Redundancy Elimination (PRE)
• Global Value Numbering (GVN)
6
Modifications to Atomic Regions
• Key insight
– Atomic regions maintain a read set and a write set
• Speculative Read (SR), Speculative Written (SW) bits in speculative cache
– Only SW bits are needed for checkpointing
• Repurpose SR bits to mark certain load locations for monitoring alias speculation failures
– Do not mark SR bits for regular loads
• Add ISA extensions to manipulate and check SR and SW bits to do alias checks
7
already supported
• begin_atomic PC / end_atomic / abort_atomic
• Starts / ends / aborts atomic region
• PC is the address of the Safe-Version of atomic region
- atomic region code without speculative optimizations
- abort_atomic jumps to Safe-Version after rollback
8
Extensions to the ISA(for Checkpointing)
newly added
• load.add.sr r1, addr
• Loads location addr to r1 just like a regular load
• Marks SR bit in cache line containing addr
• Used for marking monitored loads
• clear.sr addr
• Clears SR bit in cache line containing addr
• Used to mark end of load monitoring
• store.chk.(sr / sw / srsw) addr, r1
• Stores r1 to location addr just like a regular store
• sr: If SR bit is set, atomic region is aborted
• sw: If SW bit is set, atomic region is aborted
9
Extensions to the ISA(for Alias Checking)
How are these Instructions Used?
• Instrumentation goals
– Minimize alias checking instruction overhead
– Allow alias checks on a subset of accesses in AR
• A single AR can enable multiple optimizations
• Each code motion involves only a subset of accesses
• Two cases of code motion that involve alias checks
– Moving (hoisting) loads
– Moving (sinking) stores
10
Code Motion 1: Hoisting Loads
1. Assume a may-alias with x and y
2. Hoist load a above store x and setup monitoring of a
– store.chk.sr x will rollback AR on alias check failure
3. Sink clear.sr a to end of AR (if possible)
– store y will not trigger rollback on alias with a
– Now clear.sr a can be removed
11
begin_atomicload.add.sr astore.chk.sr xstore y
end_atomic
begin_atomicload.add.sr astore.chk.sr xstore y
end_atomic
begin_atomicstore xload astore yend_atomic
begin_atomicstore xload astore yend_atomic
begin_atomicload.add.sr astore.chk.sr xclear.sr astore yend_atomic
begin_atomicload.add.sr astore.chk.sr xclear.sr astore yend_atomic
clear.sr aclear.sr a
• Can selectively check against stores in path of code motion
• (Often) no instruction overhead for checking
Code Motion 2: Sinking Stores
1. Assume a may-alias with x and y
2. Sink store a below load x and store y
– Alias with x is checked when SR bits are checked in store.chk.srsw a
– Alias with y is checked when SW bits are checked in store.chk.srsw a
12
begin_atomicstore aload xstore yend_atomic
begin_atomicstore aload xstore yend_atomic
begin_atomicload.add.sr xstore ystore.chk.srsw aend_atomic
begin_atomicload.add.sr xstore ystore.chk.srsw aend_atomic
• Can selectively check only loads in path of code motion
• Must check against all previous stores in atomic region
– Because SW bits cannot be set selectively
Illustrative Example: LICM and GVN
13
// a,b may alias with *p,*q,*s.// *p,*q,*s may alias with each // other.for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20; *s = *q + 20;}
// a,b may alias with *p,*q,*s.// *p,*q,*s may alias with each // other.for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20; *s = *q + 20;}
// PC points to the original loopbegin_atomic PCfor(i=0; i < 100; i++) { a = b + 10; *p = *q + 20; *s = *q + 20;}end_atomic
// PC points to the original loopbegin_atomic PCfor(i=0; i < 100; i++) { a = b + 10; *p = *q + 20; *s = *q + 20;}end_atomic
• Put atomic region around loop
• Perform optimizations after inserting appropriate checks
Illustrative Example: LICM and GVN
14
// a aliases with *p,*q// b aliases with *p// *p,*q,*s aliases with each otherfor(i=0; i < 100; i++) { a = b + 10; *p = *q + 20; *s = *q + 20;}
// a aliases with *p,*q// b aliases with *p// *p,*q,*s aliases with each otherfor(i=0; i < 100; i++) { a = b + 10; *p = *q + 20; *s = *q + 20;}
// PC points to the original loopregister int r1, r2;begin_atomic PCld.add.sr r1, br2 = r1 + 10;for(i=0; i < 100; i++) { store a, r2; store.chk.sr *p, *q + 20; store *s, *q + 20;}clear.sr bend_atomic
// PC points to the original loopregister int r1, r2;begin_atomic PCld.add.sr r1, br2 = r1 + 10;for(i=0; i < 100; i++) { store a, r2; store.chk.sr *p, *q + 20; store *s, *q + 20;}clear.sr bend_atomic
• Put atomic region around loop
• Perform optimizations after inserting appropriate checks
– Hoist b + 10 (LICM)
Illustrative Example: LICM and GVN
15
// a aliases with *p,*q// b aliases with *p// *p,*q,*s aliases with each otherfor(i=0; i < 100; i++) { a = b + 10; *p = *q + 20; *s = *q + 20;}
// a aliases with *p,*q// b aliases with *p// *p,*q,*s aliases with each otherfor(i=0; i < 100; i++) { a = b + 10; *p = *q + 20; *s = *q + 20;}
// PC points to the original loopregister int r1, r2, r3;begin_atomic PCld.add.sr r1, br2 = r1 + 10;for(i=0; i < 100; i++) { store a, r2; ld.add.sr r3, *q r4 = r3 + 20 store.chk.sr *p, r4; clear.sr *q store *s, r4;}clear.sr bend_atomic
// PC points to the original loopregister int r1, r2, r3;begin_atomic PCld.add.sr r1, br2 = r1 + 10;for(i=0; i < 100; i++) { store a, r2; ld.add.sr r3, *q r4 = r3 + 20 store.chk.sr *p, r4; clear.sr *q store *s, r4;}clear.sr bend_atomic
• Put atomic region around loop
• Perform optimizations after inserting appropriate checks
– Hoist b + 10 (LICM)
– Eliminate 2nd *q + 20 (GVN)
Illustrative Example: LICM and GVN
16
// a aliases with *p,*q// b aliases with *p// *p,*q,*s aliases with each otherfor(i=0; i < 100; i++) { a = b + 10; *p = *q + 20; *s = *q + 20;}
// a aliases with *p,*q// b aliases with *p// *p,*q,*s aliases with each otherfor(i=0; i < 100; i++) { a = b + 10; *p = *q + 20; *s = *q + 20;}
// PC points to the original loopregister int r1, r2, r3;begin_atomic PCld.add.sr r1, br2 = r1 + 10;for(i=0; i < 100; i++) { store a, r2; ld.add.sr r3, *q r4 = r3 + 20 store.chk.sr *p, r4; store *s, r4;}clear.sr *qclear.sr bend_atomic
// PC points to the original loopregister int r1, r2, r3;begin_atomic PCld.add.sr r1, br2 = r1 + 10;for(i=0; i < 100; i++) { store a, r2; ld.add.sr r3, *q r4 = r3 + 20 store.chk.sr *p, r4; store *s, r4;}clear.sr *qclear.sr bend_atomic
• Put atomic region around loop
• Perform optimizations after inserting appropriate checks
– Hoist b + 10 (LICM)
– Eliminate second c + i (GVN)
– Sink clear.sr *q
Illustrative Example: LICM and GVN
17
// a aliases with *p,*q// b aliases with *p// *p,*q,*s aliases with each otherfor(i=0; i < 100; i++) { a = b + 10; *p = *q + 20; *s = *q + 20;}
// a aliases with *p,*q// b aliases with *p// *p,*q,*s aliases with each otherfor(i=0; i < 100; i++) { a = b + 10; *p = *q + 20; *s = *q + 20;}
// PC points to the original loopregister int r1, r2, r3;begin_atomic PCld.add.sr r1, br2 = r1 + 10;for(i=0; i < 100; i++) { ld.add.sr r3, *q r4 = r3 + 20 store.chk.sr *p, r4; store *s, r4;}store.chk.srsw a, r2;clear.sr *qclear.sr bend_atomic
// PC points to the original loopregister int r1, r2, r3;begin_atomic PCld.add.sr r1, br2 = r1 + 10;for(i=0; i < 100; i++) { ld.add.sr r3, *q r4 = r3 + 20 store.chk.sr *p, r4; store *s, r4;}store.chk.srsw a, r2;clear.sr *qclear.sr bend_atomic
• Put atomic region around loop
• Perform optimizations after inserting appropriate checks
– Hoist b + 10 (LICM)
– Eliminate second c + i (GVN)
– Sink clear.sr *q
– Sink a = r1 (LICM)
Checked needlessly butis fine since it does notalias with “a”
Where should we Place Atomic Regions?
• We chose to focus on loops
– Where most of the execution time is spent
– Loops provide ample range for opts such as LICM or PRE to perform large scale redundancy elimination
– Can amortize cost of atomic region instrumentation over multiple iterations for a given optimization
• When loops can potentially overflow speculation resources, loops are blocked into nested sub-loops appropriately
18
Memory Consistency Issues
• In a multiprocessor system, disabling conflict checks on speculative read lines can change access ordering
– Stores commit out of order at the end of an atomic region even when loads read values from remote processors
• Conventionally, this causes a rollback
• Not a problem in reality
– Compiler code motion cause access re-orderings anyway.
– If it is legal for the compiler to re-order, it is legal for HW
– If it was illegal for the compiler to re-order (e.g. due to synchronization), the atomic region would not be placed there
19
Compiler Toolchain
1. Run loop blocking pass that uses loop footprint estimation
2. Run application instrumented with alias check instructions to profile how many Atomic Region aborts a particular speculation would have caused.
3. Run Atomic Region instrumentation pass for loops that would benefit according to a cost-benefit model and the abort profile information.
4. Run modified optimization passes (e.g. LICM, PRE, GVN) that perform the code movements deemed beneficial by the cost-benefit model. Insert appropriate alias checks.
20
21
Experimental Setup
• Compare three environments using LICM and GVN/PRE optimizations:– BaselineAA:
• Unmodified LLVM-2.8 using basic alias analysis• Default alias analysis used by –O3 optimization
– DSAA:• Unmodified LLVM-2.8 using data structure alias analysis• Experimental alias analysis with high time/space complexity
– LAS:• Modified LLVM-2.8 using loop-based alias speculation
• Applications:– SPEC INT2006, SPEC FP2006
• Simulation:– SESC with Pin-based front end with Atomic Region support– 32KB 8-way associative speculative L1 cache w/ 64B lines
Alias Analysis Results
• Breakdown of alias analysis results when run with LICM pass• LAS is able to convert almost all may-aliases to no-aliases using profile information
22
Speedups
• Speedups normalized to BaselineAA
23
Atomic Region Characterization
24
• Low L1 cache occupancy due to not buffering speculatively read lines• Overhead amortized over large atomic region
25
Summary
• Proposed exposing HW Atomic Region alias checking primitive to SW using ISA extensions
• Proposed loop-based Atomic Region instrumentation
– To maximize speculation opportunity
– To minimize instrumentation overhead
• Proposed an alias speculation framework leveraging Atomic Regions and evaluated using LICM and GVN/PRE
– May-alias results: 56% → 4% SPECINT2006, 43% → 1% SPECFP2006
– Speedup: 3% for SPECINT2006, 9% for SPECFP2006