® 1 stack value file : custom microarchitecture for the stack hsien-hsin lee mikhail smelyanskiy...
TRANSCRIPT
1
R
®
Stack Value FileStack Value File : : Custom Microarchitecture for the StackCustom Microarchitecture for the Stack
Hsien-Hsin LeeHsien-Hsin Lee Mikhail Mikhail SmelyanskiySmelyanskiy
Chris NewburnChris Newburn Gary TysonGary Tyson
University of MichiganUniversity of Michigan
Intel CorporationIntel Corporation
2Hsien-Hsin Lee HPCA-7
R
®
AgendaAgenda
Organization of Memory RegionsStack Reference CharacteristicsStack Value FilePerformance AnalysisConclusions
3Hsien-Hsin Lee HPCA-7
R
®
Memory Space Memory Space PartitioningPartitioning
Based on programming language
Non-overlapped subdivisions
Split code and data I-cache & D-cache
Split data into regions– Stack ()– Heap ()– Global (static)– Read-only (static)
Protected
reserved
reservedmax mem
min mem
Read-only data
Code Region
Global Static Data Region
Heap grows upward
Stack grows downward
4Hsien-Hsin Lee HPCA-7
R
®
Memory Access Memory Access DistributionDistribution
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Read-only
Heap ref
Static ref
Stack ref
SPEC2000int benchmark (Alpha binary) 42% instructions access memory
5Hsien-Hsin Lee HPCA-7
R
®
Access Method Access Method BreakdownBreakdown
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Read-only ref
Heap ref
Static thru $gpr
Static thru $gp
Stack thru $gpr
Stack thru $fp
Stack thru $sp
86% of the stack references use ($sp+disp)
6Hsien-Hsin Lee HPCA-7
R
®
Morphing $sp-relative Morphing $sp-relative ReferencesReferences
Morph $sp-relative references into register accesses
Use a Stack Value File (SVF)Resolve address early in decode stage
for stack-pointer indexed accessesResolve stack memory dependency
earlyAliased references are re-routed to
SVF
7Hsien-Hsin Lee HPCA-7
R
®
Stack Reference Stack Reference CharacteristicsCharacteristics
Contiguity –Good temporal and spatial locality– Can be stored in a simple, fast
structure•Smaller die area relative to a regular cache
•Less power dissipation
–No address tag need for each datum
8Hsien-Hsin Lee HPCA-7
R
®
Stack Reference Stack Reference CharacteristicsCharacteristics
First touch is almost always a StoreStore–Avoid waste bandwidth to bring in
dead data–A register write to the SVF
Deallocated stack frame–Dead data–No need to write them back to memory
9Hsien-Hsin Lee HPCA-7
R
®
Baseline Baseline MicroarchitectureMicroarchitecture
Ld/StUnit
Instr-Cache Decoder
ArchRF
ReOrder Buffer
Fetch Decode Dispatch
Issue Execute Commit
MO
B
Reservation Station / L
SQDecoderQ
RegRenamer
(RAT) Func Unit
10Hsien-Hsin Lee HPCA-7
R
®
Microarchitecture Microarchitecture ExtensionExtension
Hash
MaxSP
Ld/StUnit
SP
Pre-Decode
Instr-Cache
offset
Decoder
ArchRF
Value Stack
File
ReOrder Buffer
Fetch Decode Dispatch
Issue Execute Commit
interlock
MO
B
Reservation Station / L
SQDecoderQ
Reg
Morphing
Renamer(RAT) Func Unit
11Hsien-Hsin Lee HPCA-7
R
®
Microarchitecture Microarchitecture ExtensionExtension
Hash
MaxSP
Ld/StUnit
SP
Pre-Decode
Instr-Cache
offset
Decoder
ArchRF
Value Stack
File
ReOrder Buffer
Fetch Decode Dispatch
Issue Execute Commit
interlock
MO
B
Reservation Station / L
SQDecoderQ
Reg
Morphing
Renamer(RAT) Func Unit
stq $r10, 24($sp)stq $r10, 24($sp)
TOSTOS
12Hsien-Hsin Lee HPCA-7
R
®
Microarchitecture Microarchitecture ExtensionExtension
Hash
MaxSP
Ld/StUnit
SP
Pre-Decode
Instr-Cache
offset
Decoder
ArchRF
Value Stack
File
ReOrder Buffer
Fetch Decode Dispatch
Issue Execute Commit
interlock
MO
B
Reservation Station / L
SQDecoderQ
Reg
Morphing
Renamer(RAT) Func Unit
stq $r10, 24($sp)stq $r10, 24($sp)
33
TOSTOS
13Hsien-Hsin Lee HPCA-7
R
®
Microarchitecture Microarchitecture ExtensionExtension
Hash
MaxSP
Ld/StUnit
SP
Pre-Decode
Instr-Cache
offset
Decoder
ArchRF
Value Stack
File
ReOrder Buffer
Fetch Decode Dispatch
Issue Execute Commit
interlock
MO
B
Reservation Station / L
SQDecoderQ
Reg
MorphingMorphing
Renamer(RATRAT) Func Unit
stq $r10, 24($sp)stq $r10, 24($sp)
TOSTOS
$r35 $r35 ROB-18ROB-18
14Hsien-Hsin Lee HPCA-7
R
®
Microarchitecture Microarchitecture ExtensionExtension
Hash
MaxSP
Ld/StUnit
SP
Pre-Decode
Instr-Cache
offset
Decoder
ArchRF
Value Stack
File
ReOrder Buffer
Fetch Decode Dispatch
Issue Execute Commit
interlock
MO
B
Reservation Station / L
SQDecoderQ
Reg
MorphingMorphing
Renamer(RATRAT) Func Unit
stq $r10, 24($sp)stq $r10, 24($sp)
TOSTOS
$r35 $r35 ROB-18ROB-18
15Hsien-Hsin Lee HPCA-7
R
®
Microarchitecture Microarchitecture ExtensionExtension
Hash
MaxSP
Ld/StUnit
SP
Pre-Decode
Instr-Cache
offset
Decoder
ArchRF
Stack
Value File
ReOrder Buffer
Fetch Decode Dispatch
Issue Execute Commit
interlock
MO
B
Reservation Station / L
SQDecoderQ
Reg
MorphingMorphing
Renamer(RATRAT) Func Unit
stq $r10, 24($sp)stq $r10, 24($sp)
TOSTOS
$r35 $r35 SVF3SVF3
16Hsien-Hsin Lee HPCA-7
R
®
Why could SVF be faster ?Why could SVF be faster ?
It reduces the latency of stack references
It effectively increases the number of memory port by rerouting more than ½ of all memory references to the SVF
It reduces contention in the MOB More flexibility in renaming stack
references It reduces memory traffic
17Hsien-Hsin Lee HPCA-7
R
®
Simulation FrameworkSimulation FrameworkParamemters 4-wide 8-wide 16-wideDecode width 4 8 16
Issue width 4 8 16Commit width 4 8 16
IFQ size 16 32 64LSQ size 32 64 128RUU size 64 128 256DL1$ size 4w 64KB 4w 64KB 4w 64KB
DL1$ latency 3 3 3UL2$ size 4w 512KB 4w 512KB 4w 512KB
UL2$ latency 16 16 16Mem latency 60 60 60
Simple Scalar (Alpha binary), OOO model
18Hsien-Hsin Lee HPCA-7
R
®
Speedup Potential of SVFSpeedup Potential of SVF
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
bzip2 crafty eon gap gcc gzip mcf parser twolf vortex perlbmk vpr Avg
4-wide 8-wide 16-wide 16-wide (gshare)
Assume all references can be morphed ~30% speedup for a 16-wide with dual-ported L1
19Hsien-Hsin Lee HPCA-7
R
®
SVF Reference Type SVF Reference Type BreakdownBreakdown
86% stack references can be morphed Re-routed references enter normal memory pipeline
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
rerouted_svf_st
rerouted_svf_ld
fast_svf_st
fast_svf_ld
20Hsien-Hsin Lee HPCA-7
R
®
Comparison with stack Comparison with stack cachecache
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
bzip2 crafty eon gap gcc gzip mcf parser twolf vortex perlbmk vpr Avg
Baseline (2+0) StackCache (2+2) SVF (2+2) Baseline (4+0)
(R+S) : RRegular and SStack or SSVF cache ports
21Hsien-Hsin Lee HPCA-7
R
®
Memory TrafficMemory Traffic
SVF dramatically reduces memory traffic by many order of magnitude.– For gcc, ~28M (Stk$ L2) reduced to
~86K (SVF L1). Incoming traffic is eliminated because
SVF does not allocate a cache line on a miss.
Outgoing traffic consists of only those wordswords that are dirty when evicted (instead of entire cache lines).
22Hsien-Hsin Lee HPCA-7
R
®
SVF over Baseline SVF over Baseline PerformancePerformance
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
bzip2 crafty eon gap gcc gzip mcf parser twolf vortex perlbmk vpr Avg
(1+1)/(1+0) (1+2)/(1+0) (2+2)/(2+0) (2+3)/(2+0) (2+4)/(2+0)
(R+S) : RRegular and SSVF cache ports
23Hsien-Hsin Lee HPCA-7
R
®
ConclusionsConclusions
Stack references have several unique characteristics – Contiguity, $sp+disp, first reference
store, frame deallocation.Stack Value File
– a microarchitecture extension to exploit these characteristics
– improves performance by 24 - 65%
28Hsien-Hsin Lee HPCA-7
R
®
Offset Locality of StackOffset Locality of Stack
Cumulative offset within a function call
Avg: 3b - 380b >80% offset
within“400b” >99% offset
within“8Kb”10
20
30
40
50
60
70
80
90
100
10 100 1000 10000
Offset in Bytes (Log scale)
Cu
mu
lati
ve %