energy efficient latency tolerance: single-thread performance for the multi-core era
DESCRIPTION
Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era. Andrew Hilton University of Pennsylvania [email protected]. Duke :: March 18, 2010. Multi-Core Architecture. Atom. Atom. Atom. Atom. Single-thread performance growth has diminished - PowerPoint PPT PresentationTRANSCRIPT
Duke:: March 18, 2010
Energy Efficient Latency Tolerance:Single-Thread Performance for the Multi-Core Era
Andrew HiltonUniversity of [email protected]
Multi-Core Architecture
[ 2 ][ 2 ]
Core i7
Single-thread performance growth has diminished• Clock frequency has hit an energy wall• Instruction level parallelism (ILP) has hit energy, memory, idea walls
Future chips will be heterogeneous multi-cores • Few high-performance out-of-order cores (Core i7) for serial code• Many low-power in-order cores (Atom) for parallel code
Atom Atom Atom Atom
Atom Atom Atom Atom
Multi-Core Performance
[ 3 ][ 3 ]
Core i7
Obvious performance key: write more parallel software
Less obvious performance key: speed up existing cores• Core i7? Keep serial portion from becoming a bottleneck (Amdahl)• Atoms? Parallelism is typically not elastic
Key constraint: energy• Thermal limitations of chip, cost of energy, cooling costs,…
Atom Atom Atom Atom
Atom Atom Atom Atom
“TurboBoost”
[ 4 ][ 4 ]
Existing technique: Dynamic Voltage Frequency Scaling?• Increase clock frequency (requires increasing voltage)+Simple+Applicable to both types of cores- Not very energy-efficient (energy ≈ frequency2)- Doesn’t help “memory bound” programs (performance < frequency)
Effectiveness of “TurboBoost”
[ 5 ][ 5 ]
Hig
her
Is b
ett
er
Low
er
is b
ett
er
Example: TurboBoost 3.2 GHz 4.0 GHz (25%)• Ideal conditions: 25% speedup, constant Energy * Delay2
• Memory bound programs: far from ideal
“Memory Bound”
Main memory is slow relative to core (~250 cycles)
Cache hierarchy makes most accesses fast• “Memory bound” = many L3 misses• … or in some cases many L2 misses• … or for in-order cores many L1 misses• Clock frequency (“TurboBoost”) accelerates only core/L1/L2
[ 6 ][ 6 ]
Core i7
Main Memory (250 cycles)
Atom
Atom
L1$
L2$ (10)
L3$ (40 cycles)
L1$
L2$ (10)L1
$L1$
L2$ (10)L1
$L1$
L2$ (10)L1
$Atom
Atom
Atom
Atom
Goal: Help Memory Bound Programs
Wanted: complementary technique to TurboBoost
Successful applicants should• Help “memory bound” programs• Be at least as energy efficient as TurboBoost (at least ED2 constant)• Work well with both out-of-order and in-order cores
Promising previous idea: latency tolerance• Helps “memory bound” programs
My work: energy efficient latency tolerance for all cores• Today: primarily out-of-order (BOLT) [HPCA’10]
[ 7 ][ 7 ]
Talk Outline
Introduction
Background: memory latency & latency tolerance
My work: energy efficient latency tolerance in BOLT • Implementation aspects• Runtime aspects
Other work and future plans
[ 8 ][ 8 ]
[ 9 ][ 9 ]
LLC (Last-Level Cache) Misses
What is this picture? Loads A & H miss caches
This is an in-order processor• Misses serialize latencies add dominate performance
We want Miss Level Parallelism (MLP): overlap A & H
Time
250
250(not to scale)
[ 10 ][ 10 ]
Miss-Level Parallelism (MLP)
One option: prefetching• Requires predicting address of H at A
Another option: out-of-order execution (Core i7)• Requires sufficiently large “window” to do this
Time
250
250 250
[ 11 ][ 11 ]
Out-of-Order Execution & “Window”
Important “window” structures• Register file (number of in-flight instructions): 128 insns on Core i7 • Issue queue (number of un-executed instructions): 36 on Core i7• Sized to “tolerate” (keep core busy for) ~30 cycle latencies• To tolerate ~250 cycles need order of magnitude bigger structures
Latency tolerance big idea: scale window virtually
Rename
IssueQueu
e
Reorder BufferFetch
Register
FileFU
D$
I$
BC A
AD
B C D
D
A
LLC miss
completed
unexecuted
[ 12 ][ 12 ]
Latency Tolerance
Prelude: Add slice buffer• New structure (not in conventional processors)• Can be relatively large: low bandwidth, not in critical execution core
Slice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC A
D
D
A B C DA
RenameFetch
[ 13 ][ 13 ]
Latency Tolerance
Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot
Slice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC A
D
D
A B C DA
RenameFetch
[ 14 ][ 14 ]
Latency Tolerance
Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot
Slice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC A
D
D
B C D
A
RenameFetch
[ 15 ][ 15 ]
Latency Tolerance
Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot• Propagate “poison” to identify dependents
Slice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC A
D
D
B C D
A
miss dependent
RenameFetch
[ 16 ][ 16 ]
Latency Tolerance
Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot• Propagate “poison” to identify dependents• Pseudo-execute them too
Slice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AD
B C
AD
RenameFetch
[ 17 ][ 17 ]
Latency Tolerance
Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot• Propagate “poison” to identify dependents• Pseudo-execute them too• Proceed under miss
EHSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AI D
B C
ADEHEF
F
H
I I
G
G
RenameFetch
[ 18 ][ 18 ]
Latency Tolerance
Phase #2: Cache miss return slice in
EHSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AI D
B C
ADEHEF
F
H
I I
G
G
RenameFetch
[ 19 ][ 19 ]
Latency Tolerance
Phase #2: Cache miss return slice in• Allocate new registers
EHSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AI D
B C
ADEHEF
F
H
I I
G
GA
RenameFetch
[ 20 ][ 20 ]
Latency Tolerance
Phase #2: Cache miss return slice in• Allocate new registers• Put in issue queue
EHSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AI D
B C
ADEHEF
F
H
I I
G
GA
A
RenameFetch
[ 21 ][ 21 ]
Latency Tolerance
Phase #2: Cache miss return slice in• Allocate new registers• Put in issue queue• Re-execute instruction
EHSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AI D
B C
DEHEF
F
H
I I
G
GA
RenameFetch
[ 22 ][ 22 ]
Latency Tolerance
Phase #2: Cache miss return slice in• Allocate new registers• Put in issue queue• Re-execute instruction• Problems with sliced in instructions (exceptions, mis-predictions)?
EHSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AI D
B C
EHEF
F
H
I I
G
GA
EED
Exception!
RenameFetch
[ 23 ][ 23 ]
Latency Tolerance
Phase #2: Cache miss return slice in• Allocate new registers• Put in issue queue• Re-execute instruction• Problems with sliced in instructions (exceptions, mis-predictions)?• Recover to checkpoint (taken before A)
EHSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AI D
B C
EHEF
F
H
I I
G
GA
EED
Exception!
ChkRenam
eFetch
Slice Self Containment
Important for latency tolerance: self-contained slices• A,D, & E have miss-independent inputs• Capture these values during slice out• This decouples slice from rest of program
[ 24 ][ 24 ]
AB
C
D
E
F
G
H
Latency Tolerance
Latency tolerance example• Slice out miss and dependent instructions “grow” window• Slice in after miss returns
[ 25 ][ 25 ]
Time
Energy:
1.5x
Delay:0.5
x
Energy ≈ #Boxes
Combine into ED2
ED2 < 1.0 = Good
ED2: 0.38x
…
Previous Design: CFP
[ 26 ][ 26 ]
Prior design: Continual Flow Pipelines [Srinivasan’04]
• Obtains speedups, but…
Hig
her
Is b
ett
er
Previous Design: CFP
[ 27 ][ 27 ]
Prior design: Continual Flow Pipelines [Srinivasan’04]
• Obtains speedups, but also slowdowns
Hig
her
Is b
ett
er
Previous Design: CFP
[ 28 ][ 28 ]
Prior design: Continual Flow Pipelines [Srinivasan’04]
• Obtains speedups, but also slowdowns• Typically not energy efficient
Hig
her
Is b
ett
er
Low
er
is b
ett
er
Energy-Efficient Latency Tolerance?
Efficient Implementation• Re-use existing structures when possible• New structures must be simple, low-overhead
Runtime efficiency• Minimize superfluous re-executions
Previous designs have not achieved (or considered) these• Waiting Instruction Buffer [Lebeck’02]
• Continual Flow Pipeline [Srinivasan’04]
• Decoupled Kilo Instruction Processor [Pericas ’06,’07]
[ 29 ][ 29 ]
Sneak Preview: Final Results
[ 30 ][ 30 ]
This talk: my work on efficient latency tolerance+Improved performance+Performance robustness (do no harm)+Performance is energy efficient
Hig
her
Is b
ett
er
Low
er
is b
ett
er
Talk Outline
Introduction
Background: memory latency & latency tolerance
My work: energy efficient latency tolerance in BOLT • Implementation aspects• Runtime aspects
Other work and future plans
[ 31 ][ 31 ]
[ 32 ][ 32 ]
Examination of the Problem
Problem with existing design: register management• Miss-dependent instructions free registers when they execute
Slice BufferADHK L E
Chk
IssueQueu
e
Register
FileFU
D$
I$
BC AI D
B C
EF
F
H
I I
G
G
RenameFetch
Reorder Buffer
[ 33 ][ 33 ]
Examination of the Problem
Problem with existing design: register management• Miss-dependent instructions free registers when they execute• Actually, all instructions free registers when they execute
What’s wrong with this?• No instruction level precise state hurts on branch mispredictions• Execution order slice buffer
Slice BufferADHK L E
IssueQueu
e
Register
FileFU
D$
I$
I
ChkChkRenam
eFetch
hard to re-rename & re-acquire registers
[ 34 ][ 34 ]
BOLT Register Management
Youngest instructions: keep in re-order buffer• Conventional, in-order register freeing
Miss-dependent instructions: in slice buffer• Execution based register freeing
ChkSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AD
B C DA
RenameFetch
[ 35 ][ 35 ]
BOLT Register Management
In-order speculative retirement stage• Head of ROB completed or poison?
ChkSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AD
B C DA
RenameFetch
[ 36 ][ 36 ]
BOLT Register Management
In-order speculative retirement stage• Head of ROB completed or poison?• Release registers
ChkSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AD
B C DA
RenameFetch
[ 37 ][ 37 ]
BOLT Register Management
In-order speculative retirement stage• Head of ROB completed or poison?• Release registers
ChkSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AD
B C D
RenameFetch
[ 38 ][ 38 ]
BOLT Register Management
In-order speculative retirement stage• Head of ROB completed or poison?• Release registers• Poison instructions enter slice buffer
ChkSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BCD
B C D
RenameFetch
A
[ 39 ][ 39 ]
BOLT Register Management
In-order speculative retirement stage• Head of ROB completed or poison?• Release registers• Poison instructions enter slice buffer• Completed instructions are done and simply removed
ChkSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BCD
B C D
RenameFetch
A
[ 40 ][ 40 ]
BOLT Register Management
In-order speculative retirement stage• Head of ROB completed or poison?• Release registers• Poison instructions enter slice buffer• Completed instructions are done and simply removed
ChkSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
CD
C D
RenameFetch
A
[ 41 ][ 41 ]
BOLT Register Management
Benefits of BOLT’s management• Youngest instructions (ROB) get conventional recovery (do no harm)
Slice Buffer Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT ADHKL E
[ 42 ][ 42 ]
BOLT Register Management
Benefits of BOLT’s register management• Youngest instructions (ROB) get conventional recovery (do no harm)• Program order slice buffer allows re-use of SMT (“HyperThreading”)
Slice BufferADHKL E
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT
[ 43 ][ 43 ]
BOLT Register Management
Benefits of BOLT’s register management• Youngest instructions (ROB) get conventional recovery (do no harm)• Program order slice buffer allows re-use of SMT (“HyperThreading”)• Scale single, conventionally sized register file
Slice BufferAD
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT HKL E
Register File Contribution #1:Hybrid register management—best of both
worlds
[ 44 ][ 44 ]
BOLT Register Management
Benefits of BOLT’s register management• Youngest instructions (ROB) get conventional recovery (do no harm)• Program order slice buffer allows re-use of SMT (“HyperThreading”)• Scale single, conventionally sized register file
Challenging part: two algorithms, one register file• Note: two register files not a good solution
Slice BufferAD
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT HKL E
[ 45 ][ 45 ]
Two Algorithms, One Register File
Conventional algorithm (ROB)• In-order allocation/freeing from circular queue• Efficient squashing support by moving queue pointer
Slice BufferAD
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT HKL E
[ 46 ][ 46 ]
Two Algorithms, One Register File
Conventional algorithm (ROB)• In-order allocation/freeing from circular queue• Efficient squashing support by moving queue pointer
Aggressive algorithm (slice instructions)• Execution driven reference counting scheme
Slice BufferAD
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT HKL E
[ 47 ][ 47 ]
Two Algorithms, One Register File
How to combine these two algorithms?• Execution based algorithm uses reference counting• Efficiently encode conventional algorithm as reference counting• Combine both into one reference count matrix
Slice BufferAD
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT HKL E
Register File Contribution #2:Efficient implementation of new hybrid
algorithm
[ 48 ][ 48 ]
Management of Loads and Stores
Large window requires support for many loads and stores • Window effectively A-V now, what about the loads & stores ?• This could be an hour+ talk by itself… so just a small piece
Slice BufferAD
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT HKL E
Store to Load Dependences
Different from register state: cannot capture inputs• Store -> load dependences determined by addresses
• Cannot “capture” like registers• Must be able to find proper (older, matching) stores
[ 49 ][ 49 ]
B
C
D
E
A
F
?
Store to Load Dependences
Different from register state: cannot capture inputs• Store -> load dependences determined by addresses
• Cannot “capture” like registers• Must be able to find proper (older, matching) stores• Must avoid younger matching stores (“write-after-read” hazards)
[ 50 ][ 50 ]
B
C
D
E
A
F
?
X
[ 51 ][ 51 ]
7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 12
addressvalue
0 0 1 0 1 0 0poison
Tail (younger) Head (older)Store Buffers
86 85 84 83 82 81 80
Conventional store queue/store buffer• Holds stores in program order• Loads search “associatively” (all entries in parallel)• Doesn’t scale to sizes we need
For latency tolerance, we need….• Poison (easy)• Scalable way to search, accounting for age (hard)
[ 52 ][ 52 ]
7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0
valuepoison
Tail (younger) Head (older)Chained Store Buffer
86 85 84 83 82 81 80
Replace associative search with iterative indexed search• Overlay store buffer with address-based hash table• Exploits in-order nature of speculative retirement to build
44 81 0 15 0 77 0link
85868321
ACB0B4B8
Root……
……
64
-en
trie
s
address
[ 53 ][ 53 ]
7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0
valuepoison
Tail (younger) Head (older)Chained Store Buffer
86 85 84 83 82 81 80
44 81 0 15 0 77 0link
Loads follow chain starting at appropriate root table entry• For example, load to address 1AC
85868321
ACB0B4B8
Root……
……
64
-en
trie
s
85AC 2AC85
81
1AC81
Match, forward
address
[ 54 ][ 54 ]
7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0
valuepoison
Tail (younger) Head (older)Chained Store Buffer
86 85 84 83 82 81 80
44 81 0 15 0 77 0link
Loads follow chain starting at appropriate root table entry• For example, load to address 1AC
Deferred loads ignore younger stores, avoid WAR hazards• For example, deferred load to address 1B4 …• … whose immediately older store 81 (note when entering pipeline)
85868321
ACB0B4B8
Root……
……
64
-en
trie
s
83B4
1B483
Younger store, ignore
15
Go to D$
address
[ 55 ][ 55 ]
Chained Store Buffer
+ Non-speculative search, scalable …+ Fast• Most non-forwarding loads access root table only• Most forwarding loads find store on first shot• Average number of excess hops < 0.05 with 64-entry root table
7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0
valuepoison
Tail (younger) Head (older)
86 85 84 83 82 81 80
44 81 0 15 0 77 0link
85868321
ACB0B4B8
Root……
……
64
-en
trie
s
address
[ 56 ][ 56 ]
BOLT: Implementation Recap
Three key implementation efficiencies in BOLT1. Re-use of existing renaming hardware
2. Hybrid register management algorithm in single register file
3. Efficient management of loads and stores
Slice BufferAD
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT HKL E
Chained Store Buffer
[ 57 ][ 57 ]
Experimental Evaluation
SPEC 2006 Benchmarks • Focus on memory bound programs (TurboBoost gets < 15%)
Performance: detailed cycle-level timing simulation in x86• Baseline “Core i7” (includes prefetching)
Energy: re-execution overhead + new structures• Estimate energy of new structures using CACTI-4.1 [Tarjan06]
CFP vs. BOLT
[ 58 ][ 58 ]
• Speedups: Overall 5% 11% MEM 14% 18%
Hig
her
Is b
ett
er
CFP vs. BOLT
[ 59 ][ 59 ]
• Re-execution: increases due to more latency tolerance
Hig
her
Is b
ett
er
Low
er
is
bett
er
CFP vs. BOLT
[ 60 ][ 60 ]
• ED2: overall improvement• Fewer and simpler new structures (lower energy)• Increased re-executions typically correspond to higher performance
Hig
her
Is b
ett
er
Low
er
is
bett
er
Low
er
is
bett
er
Talk Outline
Introduction
Background: memory latency & latency tolerance
My work: energy efficient latency tolerance in BOLT • Implementation aspects• Runtime aspects
Other work and future plans
[ 61 ][ 61 ]
Non-Blocking Latency Tolerance
Latency tolerance = non-blocking execution• Re-execution should not block pipeline either• Suppose B & C miss (C depends on B)• C should also not block the pipeline: reapply latency tolerance
[ 62 ][ 62 ]
Time
Execution Inefficiency
Dynamic inefficiency: excessive multiple re-execution• Observe: multiple re-execution dependence on multiple loads• Two possibilities: loads in parallel or loads in series• Different approaches to each
[ 63 ][ 63 ]
C
A B
C
A
B
Loads in Parallel
Example: accumulating sumfor(i = 0; i < n; i++)
total += array[i];
Assembly:loop:
load [r1] -> r2
add r2 + r3 -> r3
add r1 + 4 -> r1
bnz r1, r5 loop
load
addload
addload
addload
addload
addload
addload
addload
add
add
add
add
add
add
add
add
bnz
bnz
bnz
bnz
bnz
bnz
bnz
Loads in Parallel
Example: accumulating sumfor(i = 0; i < n; i++)
total += array[i];
Assembly:loop:
load [r1] -> r2
add r2 + r3 -> r3
add r1 + 4 -> r1
bne r1, r5 loop
load
addload
addload
addload
addload
addload
addload
addload
add
add
add
add
add
add
add
add
bnz
bnz
bnz
bnz
bnz
bnz
bnz
Loads in Parallel
[ 66 ][ 66 ]Time
A
BC
DE
FG
HI
JK
LM
NO
P
A
B
D
F
H
J
L
N
P
C
D
MLP!
Energy:3.8xDelay: 0.4x
ED2: 0.6x
Keep PerformanceReduce re-executions
Join Pruning
A’s miss poisoned B… so A’s return provides its antidote
[ 67 ][ 67 ]
A
BC
DE
FG
H
Join Pruning
A’s miss poisoned B… so A’s return provides its antidote
B now executes correctly, provides antidote to D• D must capture this input
[ 68 ][ 68 ]
A
BC
DE
FG
H
Join Pruning
A’s miss poisoned B… so A’s return provides its antidote
B now executes correctly, provides antidote to D• D must capture this input
D is still poisoned by C, cannot provide antidote
[ 69 ][ 69 ]
A
BC
DE
FG
H
Join Pruning
A’s miss poisoned B… so A’s return provides its antidote
B now executes correctly, provides antidote to D• D must capture this input
D is still poisoned by C, cannot provide antidote
F is not receiving any antidote, no need to re-execute
[ 70 ][ 70 ]
A
BC
DE
FG
H
[ 71 ][ 71 ]
Antidote Vector
BOLT filters re-execution using an antidote bit-vector• Track (per-logical register) if antidote is available• Also through store to load dependences (know poisoning store)
Antidote
Slice BufferAD
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT HKL E
Chained Store Buffer
Join Pruning
[ 72 ][ 72 ]Time
A
BC
DE
FG
HI
JK
LM
NO
P
D
F
B
A
C
D
H
J
L
N
P
E
FG
HI
JK
LM
NO
P
Energy:
2.8x
Delay:0.4
x
ED2:0.45
x
Energy:
3.8x
Delay:0.4
x
ED2:0.6
x
Join Pruning Performance
[ 73 ][ 73 ]
• Performance: strictly better (especially lbm)
Hig
her
Is
bett
er
Join Pruning Performance
[ 74 ][ 74 ]
• Performance: strictly better (especially lbm)• Execution overhead: strictly lower (especially lbm)
Hig
her
Is
bett
er
Low
er
is
bett
er
Join Pruning Performance
[ 75 ][ 75 ]
• Performance: strictly better (especially lbm)• Execution overhead: strictly lower (especially lbm)• ED2: overall improvements (again, especially lbm)
Hig
her
Is
bett
er
Low
er
is
bett
er
Low
er
is
bett
er
Loads in Series
Example: Count elements of linked listwhile (node != NULL) {
count++;
node = node->next;
}
Assemblyloop:
load [r1] -> r1
add r2 + 1 -> r2
bnz r1, loop
[ 76 ][ 76 ]
load
add
…
bnzload
add bnzload
add bnzload
add bnzload
add bnzload
add bnzload
add bnz
D
G
J
M
Pointer Chasing
[ 77 ][ 77 ]Time
A
CB D
FE G
IH J
LK M
ON …
Energy:
2.2x
Delay: 1x
ED2: 2.2x
A
CD
FG
IJ
LM
O
B
E
H
K
N
Dr! Dr! It hurts when I apply latency tolerance to pointer
chasing…
D
G
J
M
So Don’t Do It…
[ 78 ][ 78 ]Time
A
CB
FE
IH
LK
ON …
Energy:
1xDelay: 1x
ED2: 1x
Energy:
2.2x
Delay: 1x
ED2:2.2
x
Loads in Series
Not all dependent loads are badfor (int i =0; i < n; i++)
x += objects[i]->val;
Assemblyloop:
load [r1] -> r2
load [r2] -> r3
add r4, r3 -> r4
add r1, 4 -> r1
bne r1, r5 loop
Important: prune pointer chasing only• Preserve general indirection
[ 79 ][ 79 ]
load
load
load
load
load
load
load
load
add
add
add
add
add
add
add
add
load
load
load
load
load
load
load
load
Parallel
Parallel
Pointer Chasing
How to distinguish the two?
[ 80 ][ 80 ]
loop1: load [r1] -> r1 add r2 + 1 -> r2 bnz r1, loop1
loop2: load [r1] -> r2 load [r2] -> r3 add r4, r3 -> r4 add r1, 4 -> r1 bne r1, r5 loop2
Pointer Chasing
How to distinguish the two?• Pointer chasing: load poisons younger instances of itself
[ 81 ][ 81 ]
loop2: load [r1] -> r2 load [r2] -> r3 add r4, r3 -> r4 add r1, 4 -> r1 bne r1, r5 loop2
loop1: load [r1] -> r1 add r2 + 1 -> r2 bnz r1, loop1
Pointer Chasing
How to distinguish the two?• Pointer chasing: load poisons younger instances of itself• Benign indirection: poison comes from different (static) load
[ 82 ][ 82 ]
loop2: load [r1] -> r2 load [r2] -> r3 add r4, r3 -> r4 add r1, 4 -> r1 bne r1, r5 loop2
loop1: load [r1] -> r1 add r2 + 1 -> r2 bnz r1, loop1
Pointer Chasing
How to distinguish the two?• Pointer chasing: load poisons younger instances of itself• Benign indirection: poison comes from different (static) load
• Loop induction not chain of poison loads
[ 83 ][ 83 ]
loop2: load [r1] -> r2 load [r2] -> r3 add r4, r3 -> r4 add r1, 4 -> r1 bne r1, r5 loop2
loop1: load [r1] -> r1 add r2 + 1 -> r2 bnz r1, loop1
[ 84 ][ 84 ]
Extended Antidote Vector
Idea: extend poison information with low bits of PC• Poison from same PC pointer chasing
One implementation: detect at execution• Shuts pointer-chasing down immediately• Complicates latency-critical execution structures
A better one: detect at re-dispatch (extend antidotes) • Learn identity of pointer-chasing PC and shut down future instances
Antidote
Slice BufferAD
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT HKL E
Chained Store Buffer
Pointer Chasing Performance
[ 85 ][ 85 ]
• Speedups: same (good: no harm)
Hig
her
Is
bett
er
Pointer Chasing Performance
[ 86 ][ 86 ]
• Speedups: same (good: no harm)• Execution overhead: reduced (mcf 290% 44%)
Hig
her
Is
bett
er
Low
er
is
bett
er
Pointer Chasing Performance
[ 87 ][ 87 ]
• Speedups: same (good: no harm)• Execution overhead: reduced (mcf 290% 44%)• ED2: overall improvement (mcf basically breaks even now)
Hig
her
Is
bett
er
Low
er
is
bett
er
Low
er
is
bett
er
BOLT vs. TurboBoost
[ 88 ][ 88 ]
BOLT able to help performance where TurboBoost cannot
…and more energy efficiently
Hig
her
Is
bett
er
Low
er
is
bett
er
BOLT vs. TurboBoost
[ 89 ][ 89 ]
BOLT + TurboBoost? • Synergistic: BOLT “un-memory-bounds” programs• BOLT + TurboBoost still an ED2 win!
Hig
her
Is
bett
er
Low
er
is
bett
er
Partial Summary
Latency tolerance • Scale window virtually under long cache misses• No good implementations + excessive overhead• Potentially good complement to TurboBoost
Energy-efficient latency tolerance• Low-cost implementation: re-use SMT, registers & load/stores• Low runtime overhead: prune pointer-chasing and “joins”• Actually good complement to TurboBoost• Applicable to both in-order and out-of-order cores
[ 90 ][ 90 ]
[ 91 ][ 91 ]
iCFP: In-order Latency Tolerance
BOLT – (out-of-order core) + (in-order core) = ?
Antidote
Slice Buffer Chk
IssueQueu
e
Register
FileFU
D$
I$
RenameFetch
Reorder Buffer
Chained Store Buffer
[ 92 ][ 92 ]
iCFP: In-order Latency Tolerance
BOLT – (out-of-order core) + (in-order core) = ?
Chained Store Buffer
Slice Buffer Chk
Antidote
[ 93 ][ 93 ]
iCFP: In-order Latency Tolerance
BOLT – (out-of-order core) + (in-order core) = iCFP [HPCA’09]
• Some details obviously different due to in-order pipeline• Useful for any level cache miss (L1, L2, L3)• Joint work with Santosh Nagarakatte
Other in-order latency tolerant designs• Sun’s Rock “processor” [Chaudhry’09]
• Simple Latency Tolerant Processor [Nekkalapu’09]
RF 0
RF 1
FetchI$
FU
D$
Chained Store Buffer
Slice Buffer Chk
Antidote
Talk Outline
Introduction
Background: memory latency & latency tolerance
My work: energy efficient latency tolerance in BOLT
Other work and future plans
[ 94 ][ 94 ]
Other Work / Future Directions
Micro-architecture• Control independence [ISCA’07]
• Plans for more work related to latency tolerance• Store latency tolerance• Possibly changes out-of-order “sweet spot”
• In submission / in progress: energy efficient load/store data path• Trident: reduce D$ accesses improve energy + perf• SMT-directory: reduce lq coherence searches in SMT
• Future work: register reference counting register file gating• Generally interested in performance and energy
[ 95 ][ 95 ]
Other Work / Future Directions
Simulation and workload methodologies• Multi-programming workload methodology [MobS’09]
• Future plans include adapting ideas to multi-threaded applications• Generally interested in research on better simulation
Operating systems and security• Operating system based security project for layered sandboxing
• Provides system calls to restrict behavior of less trusted code• Many future plans on this project, most involving hardware support• Generally interested in how hardware can improve software
[ 96 ][ 96 ]
[ 97 ][ 97 ]
[ 98 ][ 98 ]
Sun’s Rock
Rock [Chaudhry’09] does in-order latency tolerance• Slice buffer (“Deferral Queues”) divided by multiple checkpoints• Re-execution limited to oldest region• Values from slices reintegrated to main register file when DQs empty
RF 0
RF 1
FetchI$
FU
D$
Slice Buffer
ChkChkChk
ADHKL E
Unrolled Loops
What if compiler unrolled loop with pointer chasing?• Still detectable, just takes one time per unrolling
[ 99 ][ 99 ]
loop1: load [r1] -> r1 bz r1, endMidLoop load [r1] -> r1 add r2 + 2 -> r2 bnz r1, loop1
[ 100 ][ 100 ]
FIESTA
Experiments with multiple programs• Simulation: cannot run entire program (too slow)• How do you do this?
[ 101 ][ 101 ]
FIESTA
Experiments with multiple programs• Simulation: cannot run entire program (too slow)• How do you do this?
Fixed workloads: run all programs for X million insns
[ 102 ][ 102 ]
Other Work: FIESTA
Experiments with multiple programs• Simulation: cannot run entire program (too slow)• How do you do this?
Fixed workloads: run all programs for X million insns
Variable workloads: run both until sum = X million insns
[ 103 ][ 103 ]
Other Work: FIESTA
Experiments with multiple programs• Simulation: cannot run entire program (too slow)• How do you do this?
Fixed workloads: run all programs for X million insns
Variable workloads: run both until sum = X million insns
[ 104 ][ 104 ]
Two very different answers for one question…
Why is this? Which is the right answer?
[ 105 ][ 105 ]
Traditional Fixed-Workload
Single-program workload x N• X insns (i.e. 5M/sample) from each program• Workload composition is fixed across experiments + Direct comparisons between experiments– Load imbalance: time spent executing only slowest programs
A:A: 5M5M
B:B: 5M5M
timetime
[ 106 ][ 106 ]
Variable-Workload
Multi-program execution defines workload• Execute all programs until some condition (i.e. total insns = 10M)• Normalize to single-program region defined by this execution+Eliminates load imbalance (by construction)- Naturally oversamples programs which perform better
A:A: 3M3M
B:B: 7M7M
timetime
[ 107 ][ 107 ]
Deconstructing Load Imbalance
Fixed-workload runs experience two forms of imbalance
Sample imbalance: different standalone runtimes• Artifact of finite experiments• Should be eliminated• Easy: choose samples with same standalone runtimes
Schedule imbalance: asymmetric (“unfair”) contention• Characteristic of concurrent execution• Should be preserved, measured
[ 108 ][ 108 ]
FIESTA
FIESTA: Fixed-Instruction with Equal STAndalone runtimes• Run single-programs for C cycles, record insn count• Build fixed workloads from time-balanced samples+ Eliminates sample imbalance+ Remaining imbalance is schedule imbalance
A:A: 5M5M
B:B: 7M7Mtimetime
A:A: 5M5M
B:B: 7M7M
timetime
schedule imbalanceschedule imbalance
Other Work: FIESTA
Experiments with multiple programs• Simulation: cannot run entire program (too slow)• How do you do this?
Fixed workloads: run all programs for X million insns
Variable workloads: run both until sum = X million insns
FIESTA [MobS’09]: create a-priori balanced samples
Joint work with Neeraj Eswaran[ 109 ][ 109 ]
Other Work: Paladin
Large software systems: many components, different trust• No way to restrict behavior of called modules
[ 110 ][ 110 ]
Trusted Code
Junior Developer’s
Module
PluginThird Party
Library
Other Work: Paladin
Large software systems: many components, different trust• No way to restrict behavior of called modules
Paladin [In submission]: OS support for layered sandboxing• New system calls to restrict system call behavior• Also ensure restrictions only removed when module returns• Joint work with Jeff Vaughan
[ 111 ][ 111 ]
Trusted Code
Junior Developer’s
Module
PluginThird Party
Library