puma 2 : bridging the cpu/memory gap through prediction & speculation babak falsafi team...
TRANSCRIPT
PUMAPUMA22: Bridging the CPU/Memory Gap : Bridging the CPU/Memory Gap through Prediction & Speculationthrough Prediction & Speculation
Babak FalsafiTeam Members: Chi Chen, Chris Gniady, Jangwoo Kim,
Tom Wenisch, Se-Hyun Yang
Alumni: Cem Fide, An-Chow Lai
Impetus GroupImpetus GroupComputer Architecture Lab (CALCM)Carnegie Mellon Universityhttp://www.ece.cmu.edu/~impetus
Copyright © 2002 by Babak Falsafi 2
Network
Our Group’s Research Focus
Handheld to Server Memory Design Processor Design
Design issues Performance Power Reliability Programmability
Copyright © 2002 by Babak Falsafi 3
ImpetusImpetus Projects
Today’s talk: PUMA2: Bridging CPU/Memory Gap
Others:
1. PowerTap: Power-Aware Computer Systems
2. JITR: Soft-error Tolerant Microarchitecture
3. GigaTrans: Beyond Superscalar & ILP
Goals: Impact products, research, education in architecture E.g., Reactive NUMA => Sun WildFire DSM
Copyright © 2002 by Babak Falsafi 4
Outline
Impetus Overview PUMAPUMA22
Hitting the Memory Wall Last-touch Memory Access Speculative Memory Ordering
Copyright © 2002 by Babak Falsafi 5
Hitting the Memory Wall
Growing distance: Processors getting faster Memory only getting larger
Caching less effective: Simplistic (demand) fetch/replace Deeper higher worst case access latencies Multiple hierarchies in multiprocessors!
50% processor utilization in servers [Ailamaki, VLDB’99] Commercial databases running on a Xeon server
Copyright © 2002 by Babak Falsafi 6
Conventional Data Demand Fetch
Fetch data upon CPU request zero lookahead upon miss crude guess for replacement
Works only when working set fits in L1/L2 changes infrequently
Out-of-order core at best tolerate L1-L2 latency
CPU
L1
L2
L3
Memory
10 c
lk2 c
lk
50 c
lk
500
clk
Copyright © 2002 by Babak Falsafi 7
PUMAPUMA22 Proactively Uniform Memory Access Architecture
Goal:
Bridge the CPU/Memory Performance Gap
How? Prediction and speculation Hide/tolerate memory latency Hardware techniques transparent to software
Copyright © 2002 by Babak Falsafi 8
This Talk
1. Last-touch memory access model predict the last processor reference evict and fetch upon last reference
+ significantly enhance fetch lookahead
2. Speculative memory ordering overlap accesses tolerate latency but, overlapping memory affects memory order show, hardware can both relax & enforce order
Copyright © 2002 by Babak Falsafi 9
Outline
Impetus Overview PUMAPUMA22
Hitting the Memory Wall Last-Touch Memory Access Speculative Memory Ordering
Copyright © 2002 by Babak Falsafi 10
Hiding the Memory Latency Using Prediction/Speculation
Mechanisms required:
1. Predict “what” memory address to fetch Goal: Minimize traffic, avoid thrashing, etc.
2. Predict “when” to fetch Goal: Maximize hiding latency
3. Storage “where” fetched data is placed Goal: Avoid lookups in auxiliary structures
Copyright © 2002 by Babak Falsafi 11
Current Proposals for Data Prefetching
Custom Prefetchers Stride, stream, dependence-based, etc.
General-Purpose Prefetchers Precomputation/slipstream prefetcher Address correlating prefetcher
Key shortcomings: Insufficient lookahead (e.g., 10~100 cycles) Low accuracy for general access patterns Can not place directly in L1 (use buffers)
Copyright © 2002 by Babak Falsafi 12
Markov Prefetchers (Joseph & Grunwald, ISCA’97) Predict “what”: correlate L1 miss addresses Predict “when”: consecutive L1 misses High prediction coverage Clustered L1 misses Insufficient lookahead One-to-many predictions Low accuracy Prefetch into a buffer High prefetch hit time
Related Work: Address Correlating Prefetchers
Copyright © 2002 by Babak Falsafi 13
Insufficient Lookahead in Correlating Prefetchers
Consecutive L1 misses often clustered Exacerbated in out-of-order cores
load/store A1 (miss)
load/store A1 (hit)
load/store C3 (miss)
...
load/store A3 (miss)
Fetch on miss
lookahead
Copyright © 2002 by Babak Falsafi 14
We Propose Fetch on Last Touch Predict & fetch on last touch+ Evict dead block+ Enhance fetch lookahead+ Fetch directly into L1
load/store A1 (miss)
load/store A1 (hit)
load/store C3 (miss)
...
load/store A3 (miss)
Fetch on last touch
lookahead
load/store A1 (miss)
load/store A1 (hit)
load/store C3 (miss)
...
load/store A3 (miss)
Fetch on miss
Copyright © 2002 by Babak Falsafi 15
Enhancing Fetch LookaheadC
umul
ativ
e D
istr
ibut
ion
2 4 8 16 32 64 128
256
2048
102451
2
> 20
48
Between last touch & next miss(Our proposal)
Between two misses(Markov)
Processor cycles
0
20
40
60
80
100L2
late
ncy
Mem
ory
late
ncy
Copyright © 2002 by Babak Falsafi 16
Dead-Block Prediction [ISCA’01]
Correlate a trace of memory accesses to a block Uniquely identify different dead-times
PC3: load/store A1
PC1: load/store A1
PC3: load/store A1
PC5: load/store A3
Acc
ess
str
ea
m to
a b
lock
fra
me
(miss)
(hit)
(hit)
(miss)
PC0: load/store A0 (hit)
Trace = (PC1,PC3, PC3)
Last touch
First touch
Copyright © 2002 by Babak Falsafi 17
Miss-Address Prediction
Correlate last 2 misses within a cache block frame
Correlation = (A0,A1) (A3)
PC3: load/store A1
PC1: load/store A1
PC3: load/store A1
PC5: load/store A3
(miss)
(hit)
(hit)
(miss)
PC0: load/store A0 (hit)
…
…
Trace = (PC1,PC3, PC3)
(A0,A1,PC1,PC3,PC3) (A3)
Copyright © 2002 by Babak Falsafi 18
Prefetch A3
11
Dead-Block Correlating Prefetcher
Correlating Prediction Table
A3A0,A1,PC1,PC3,PC3A0,PC1,PC3
History Table
PC3
encode
Current Access
Latest
A1
Two-level prediction table History table Correlating Prediction Table Encoding truncated addition Two bit saturating counter
Copyright © 2002 by Babak Falsafi 19
Methodology
Simulated using SimpleScalar 3.0 2 GHz, 8-issue, 128-entry window 32K, DM, 1-cycle L1D 1M, 4-way, 12-cycle L2 70-cycle memory 2M, 8-way, 24-cycle prediction table 128-entry prefetch buffer (for Markov only)
Memory-intensive integer, float-point, linked-data apps 14 Benchmarks 5 Olden, 4 SpecINT, 5 SpecFP
Copyright © 2002 by Babak Falsafi 20
0%
20%
40%
60%
80%
100%
120%
Dead-Block Coverage and Accuracy
MispredictedTrainingPredicted
DBP predicts 90% and miss by 4% only
bhem
3d
healt
hm
st
treea
dd
com
pres
spe
rlgc
cm
cf
amm
p art
equa
kem
grid
swim
Fra
ctio
n of
all
mis
ses
Copyright © 2002 by Babak Falsafi 21
Miss-Address Prediction
0%
20%40%
60%80%
100%120%
140%MispredictedTrainingPredicted > 190%
DBCP predicts 82%, misses 3% Markov (Joseph & Grunwald) predicts 81%, but misses 229%
bhem
3d
healt
hm
st
treea
dd
com
pres
spe
rlgc
cm
cf
amm
p art
equa
kem
grid
swim
MD MD MD MD MD MD MD MD MD MD MD MD MD MD
Fra
ctio
n of
all
mis
ses
M=Markov D=DBCP
Copyright © 2002 by Babak Falsafi 22
Memory Stall Time Reduction
0%10%20%30%40%50%60%70%80%90%
100%Markov DBCP
DBCP reduces memory stall time by 62% on average Markov reduces memory stall time by 30% only
bhem
3d
healt
hm
st
treea
dd
com
pres
spe
rlgc
cm
cf
amm
p art
equa
kem
grid
swim
Fra
ctio
n R
educ
ed
Copyright © 2002 by Babak Falsafi 23
DBCP vs. Larger On-Chip L2
-40%
-20%
0%
20%
40%
60%
80%
100%
bhem
3d
healt
hm
st
treea
dd
com
pres
spe
rlgc
cm
cf
amm
p art
equa
ke
mgr
idsw
im
18-cycle 3M-L2 24-cycle 2M DBCP12-cycle 3M-L2
Fra
ctio
n R
educ
ed
Copyright © 2002 by Babak Falsafi 24
Conclusions
Dead-Block Predictors (DBP) Predict when to evict block Enable timely prefetching Can prefetch into L1 cache High coverage of 90%, mispredicting only 4%
Dead-Block Correlating Prefetchers (DBCP) Accurate and timely prefetch Reduce memory stall time by 62%
Copyright © 2002 by Babak Falsafi 25
Other Mechanisms in PUMA2
Self-invalidation predictors [ISCA’00]
Predict when to self-invalidate in multiprocessors Converts 3-hop latencies to 2-hop
Memory sharing predictors [ISCA’99]
Predict subsequent sharers of block Powerful mechanism to move data
Both exhibit high coverage and accuracy
Copyright © 2002 by Babak Falsafi 26
Outline
Impetus Overview PUMAPUMA22
Hitting the Memory Wall Last-Touch Memory Access Speculative Memory Ordering
Copyright © 2002 by Babak Falsafi 27
Sequential Consistency (SC) [LAMPORT]
Memory should appear in program order & atomic e.g., critical section
lock modify data unlock
+ intuitive programming– extremely slow!
What Programmers Want (SC)
P P P....
Shared Memory
Copyright © 2002 by Babak Falsafi 28
What Machines Provide (RC)
Overlap remote accesses software enforces order (when needed)
e.g., first lock, then data special “ordering” instructions
Release Consistency (RC) [Gharachorloo, et al.]
allows any (re-)ordering in e.g., IA-64, SPARC+ high performance– complicates programming
P P P....
Shared Memory
Overlap Accesses
Copyright © 2002 by Babak Falsafi 29
Can We Have SC Programming With RC Performance?
Observation: SC must only appear in program order need order only when others race to access
SC hardware can emulate RC iff overlap accesses speculatively keep a log of computation in program order roll back in case of a race + no help from software SC programming+ infrequent rollback better than RC performance
Copyright © 2002 by Babak Falsafi 30
Related Work: Hiding the Store Latency
A number of SC optimizations
1. Multiple pending prefetches Commit to L1 in order [Gharachorloo et al.] E.g., MIPS R10000’s pending misses
2. Relaxing order within ROB Speculative loads [Gharachorloo et al.] E.g., MIPS R10000’s speculative loads
3. Extensions to ROB Speculative retirement [Ranganathan et al.]
Limited speculation in small associative buffers!
Copyright © 2002 by Babak Falsafi 31
MemoryQueuePipeline
ReorderBuffer
Done WR X
RD Z
WR A
RD Y
RD A
ALUWR A Idle
WR X Miss
RD A Idle
Execution in SC Memory System
WR X, RD Y, RD Z access remote memory X, Y, Z, A are unrelated need not be ordered WR X blocks pipeline hundreds of cycles Can not overlap RD Y & RD Z with WR X
RD Y Idle
Copyright © 2002 by Babak Falsafi 32
WR X
WR A
RD A
ALUDone
Out of order
Execution in RC Memory System
+ Accesses to A complete while WR X is pending+ Overlaps remote accesses to X, Y, Z– Software must guarantee that X, Y, Z, A are unrelated
PipelineReorderBuffer ...
...
RD Y
RD ZRD Z Miss
WR X Miss
RD Y Miss
...
MemoryQueue
Copyright © 2002 by Babak Falsafi 33
Speculatively & Fully Relaxing Order
With Vijaykumar [ISCA’99]
H/W support for relaxing all order
Storage to tolerate long latencies Old processor state Old memory state
Fast lookup to detect possible order violation upon cache invalidations and replacements
Infrequent rollbacks Typical of well-behaved applications Rollbacks are due to false sharing or data races
Copyright © 2002 by Babak Falsafi 34
Done
WR X
WR A
RD A
ALU
SC++: A Design for Speculative SC
SHiQ: Back up computation in a queue BLT: Quick lookup to detect races
SpeculativeHistoryQueue
PipelineReorderBuffer ...
...
RD Y
RD ZRD Z Miss
WR X Miss
RD Y Miss
...
MemoryQueue
Block containing A
Block containing Y & Z
BlockLookupTable
Detect races from directory accesses
Copyright © 2002 by Babak Falsafi 35
Applications Beyond Memory Order
SC++ can be used as generic speculation:
1. rollbacks are rare
2. verifying speculation >> ROB can sustain
Examples: Value speculation [Sorin et al.] Speculating beyond locks [Rajwar et al.]
Copyright © 2002 by Babak Falsafi 36
Performance of SC, RC and SC++
1.01.11.21.31.41.51.61.71.8
Sp
ee
du
p o
ve
r S
C
d
RC SC++
Data from RSIM DSM simulator 16, 1 GHz MIPS R10000 processors Up to 70% gap between SC & RC
SC++ can fully emulate RC
Copyright © 2002 by Babak Falsafi 37
Sensitivity to Queue Size
Queue size varies across apps (& systems) History is highly bursty Can spill history to L2
50%
60%
70%
80%
90%
100%
16 32 64 128 256 512 1024 2048 4096 8192
Number of Entries
Fra
cti
on
of
Be
st a
P
erf
orm
an
ce
appbtbarnesem3dfftradixtomcatvunstruct.waterwater-spAverage
Copyright © 2002 by Babak Falsafi 38
Conclusions
First to show SC + Speculation = RC! identified design requirements current systems do not satisfy requirements proposed a design, SC++
Hardware can provide
simple programming with high performance!
Copyright © 2002 by Babak Falsafi 39
Other Ongoing Projects
Ultra-Deep-Submicron Designs1. Power Management: [MICRO’01,HPCA’02,HPCA’01,ISLPED’00]
First architectural proposal to reduce leakage Resizable Caches Way/Bank Predicting Caches Power-Aware Snoopy Coherence
2. Transient-Error Tolerant Superscalar: [MICRO’01] Error-tolerant instruction scheduling
Beyond ILP & Superscalar: [ICS’01,PPoPP’01] Low-overhead mechanisms for thread-level speculation Selective dependence tracking