dynamically collapsing dependencies for ipc and frequency gain peter g. sassone d. scott wills...
TRANSCRIPT
Dynamically Collapsing Dynamically Collapsing Dependencies for IPC and Dependencies for IPC and Frequency GainFrequency Gain
Dynamically Collapsing Dynamically Collapsing Dependencies for IPC and Dependencies for IPC and Frequency GainFrequency Gain
Peter G. SassoneD. Scott Wills
Georgia TechElectrical and Computer Engineering{ sassone, scott.wills } @ ece.gatech.edu
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 2
MotivationMotivation
• Outside of pipeline, global communication dominates
• Memory wall is well studied• Inside, traditionally computation or logic
dominated
fetchfetch
decodedecode
renamerename
issueissue
execexec
commitcommit
I cacheI cache
D cacheD cache
L2 cacheL2 cache memorymemory
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 3
MotivationMotivation
issuequeue
• Now dominated by local communication paths:– issue window– reorder buffer– register file– bypass network
• Bottlenecks both IPC and frequency
issuelogicissuelogic
alualu
alualu
alualu
regfileregfile
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 4
MotivationMotivation
issuequeue
• RISC instruction sets create RISC instruction sets create superfluous trafficsuperfluous traffic• All instructions and operands are treated as All instructions and operands are treated as
equalequal• Little focus on exposing Little focus on exposing sequentialitysequentiality
issuelogicissuelogic
alualu
alualu
alualu
regfileregfile
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 5
ContributionsContributions
• Dynamic Strands:– collapse dependence-chains without fan-out– exploit properties for simple value precomputation– increase efficiency of critical resources– preserve binary compatibility
• IPC improvements:– 17-20% speedup on Spec2000int and MediaBench
• Frequency improvements:– 37% fewer in-flight instructions– reduced dependence on dependencies
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 6
OutlineOutline
• MotivationMotivation• Transient Operands and StrandsTransient Operands and Strands• Instruction Replacement HardwareInstruction Replacement Hardware• ResultsResults• ConclusionConclusion
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 7
Dyadic DilemmaDyadic Dilemma
Performing any operation on more than two sources requires temporary values
R1’R1’
R1’’R1’’
R1R1 R2R2
R3R3
R4R4
R9R9
++
++
++. . .add R1 R1, R2add R1 R1, R3add R9 R1, R4. . .
. . .add R1 R1, R2add R1 R1, R3add R9 R1, R4. . .
int sum( int a, int b, int c, int d ){ return a + b + c + d;}
int sum( int a, int b, int c, int d ){ return a + b + c + d;}
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 8
0%
10%
20%
30%
40%
50%
60%jp
eg e
ncod
ejp
eg d
ecod
eep
ic e
ncod
eep
ic d
ecod
eg7
21 d
ecod
eg7
21 e
ncod
em
peg2
dec
ode
mpe
g2 e
ncod
epe
gwit
deco
depe
gwit
enco
dead
pcm
enc
ode
adpc
m d
ecod
ebz
ipgc
cgz
ipm
cfpa
rser
vort
ex vpr
Mediabench Spec2000int
perc
ent
of dyn
am
ic o
pera
nds
.
0%
10%
20%
30%
40%
50%
60%jp
eg e
ncod
ejp
eg d
ecod
eep
ic e
ncod
eep
ic d
ecod
eg7
21 d
ecod
eg7
21 e
ncod
em
peg2
dec
ode
mpe
g2 e
ncod
epe
gwit
deco
depe
gwit
enco
dead
pcm
enc
ode
adpc
m d
ecod
ebz
ipgc
cgz
ipm
cfpa
rser
vort
ex vpr
Mediabench Spec2000int
perc
ent
of dyn
am
ic o
pera
nds
.
Transient OperandsTransient Operands• We term these temporary values transient operands:
– values produced by an ALU inst– values consumed only once, and only by an ALU inst
• Common in modern integer workloads…
On average, about 40% of all dynamic operands are transient
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 9
StrandsStrands
• Strands:– linear chains of instructions
joined by transient operands
– non-consecutive– span basic blocks– three instructions– only the final output needs
to be committed
• Strands are common– dyadic temporaries– compiler strategies– language semantics
++cc
dd
bbaa
++
++
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 10
Outline
• Motivation• Transient Operands and Strands• Instruction Replacement Hardware• Results• Conclusion
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 11
closed-loopALUs
Hardware Overview
fetchfetch
decodedecode
renamerename
reg filereg file
commitcommit
ALUALU ALUALU ALUALU
strand cachefill unit
strand cachefill unit
instructions
strandcachestrandcache
transients
dispatchengine
dispatchengine
strands
instructions
strands
issue queue
off the critical path
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 12
Algorithm ExampleAlgorithm Example
closed-loopclosed-loopALUsALUs
fetchfetch
decodedecode
renamerename
reg filereg file
commitcommit
ALUALU ALUALU ALUALU
strand cachestrand cachefill unitfill unit
strand cachestrand cachefill unitfill unit
instructions
strandstrandcachecachestrandstrandcachecache
transients
dispatchdispatchengineengine
dispatchdispatchengineengine
strands
instructions
strands
issue queueissue queue
11
22
33
1111
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 13
Strand Cache Fill UnitStrand Cache Fill Unit
• Based around the operand table• Detects conditions of transients• When found…
– append to existing strand– begin new strand
last producerlast producerinstructioninstruction
last consumerlast consumerinstructioninstruction
consumeconsumerr
countcount
R5R5
R6R6
R4R4
archarchregreg
1404: R5 R0 + 0
PC 14161412: R1 R5 + 0
1416: R5 R0 + 0
1408: . . .PC 1404 PC 1412 1
operand tableoperand table
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 14
Strand Cache
101110101
status bits previous reader info
strand 2 i1 i2 i3 pc ready value
instructions
seen pc inst seen pc instseen pc inst
this instruction source 1 source 2
++++
++
About 175 bytes per line, though very few lines are needed for effect
strand 1
strand 3
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 15
Dispatch EngineDispatch Engine
• Watches for strand cache matches• Inserts ready strands into the stream eagerly• Removes component instructions when seen• Correctness checking with dirty table
dispatchdispatchengineengine
dispatchdispatchengineengine
decodedecodedecodedecode
renamerenamerenamerename
pre-renamedinstructions
strands,recovery strands,
kill signals,
dirtytable
strandstrandcachecache
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 16
Closed-Loop ALUsClosed-Loop ALUs
• Full bypass is half of the execute stage delay• Regular ALUs with double-speed closed-loop
mode– two dependent ALU operations in a single cycle– intermediate values (the transients) are discarded!– final result still takes ½ cycle for full bypass
full bypass network
“free”local
bypass
mode switch
ALUALUALUALU ½ cycle
½ cycle
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 17
Oops… Dirty ReadOops… Dirty Read
R1’R1’
R1’’R1’’
R1R1 R2R2
R3R3
R4R4
R9R9
++
++
++load 16 [ R1 ]
R1’R1’
R1’’R1’’
R1R1 R2R2
R3R3
++
++
insert recovery
sub-strand to recover R1R1 is dirty!
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 18
Oops… Anti-Dependence Oops… Anti-Dependence ViolationViolation
R1’R1’
R1’’R1’’
R1R1 R2R2
R3R3
R4R4
R9R9
++
++
++load 32 [ R9 ]
insert load immediate of previous
value
R9 has already been replaced
R9 previous valueprevious valueprevious valueprevious value
renaming not sufficent –
outside reorder
buffer safety net
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 19
OutlineOutline
• MotivationMotivation• Transient Operands and StrandsTransient Operands and Strands• Instruction Replacement HardwareInstruction Replacement Hardware• ResultsResults• ConclusionConclusion
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 20
coverage with various strand cache sizescoverage with various strand cache sizes
Instruction CoverageInstruction Coverage
0%
10%
20%
30%
40%
50%
60%
jpeg
enco
de
epic
enco
de
g721
enco
de
mpe
g2en
code
pegw
iten
code
adpc
men
code gcc
gzip
pars
er vpr
aver
age
Mediabench sample Spec2000int sample
ALU
Inst
ruct
ion C
ove
rage . 16 cache entries
64 cache entries
256 cache entries
1024 cache entries
0%
10%
20%
30%
40%
50%
60%
jpeg
enco
de
epic
enco
de
g721
enco
de
mpe
g2en
code
pegw
iten
code
adpc
men
code gcc
gzip
pars
er vpr
aver
age
Mediabench sample Spec2000int sample
ALU
Inst
ruct
ion C
ove
rage . 16 cache entries
64 cache entries
256 cache entries
1024 cache entries
High coverage rates, but only
with a big strand cache.
High coverage rates, but only
with a big strand cache.
Less than a 15%
replacement rate,
regardless of cache size
Less than a 15%
replacement rate,
regardless of cache size
Average ALUinst
coverage:
16: 12%1024: 27%
Average ALUinst
coverage:
16: 12%1024: 27%
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 21
4-wide IPC speedup with 16-entry strand cache4-wide IPC speedup with 16-entry strand cache
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
jpeg
enc
ode
jpeg
dec
ode
epic
enc
ode
epic
dec
ode
g721
dec
ode
g721
enc
ode
mpe
g2 d
ecod
e
mpe
g2 e
ncod
e
pegw
it de
code
pegw
it en
code
adpc
m e
ncod
e
adpc
m d
ecod
e
bzip
gcc
gzip
mcf
pars
er
vort
ex vpr
harm
mea
n
Mediabench Spec2000int
IPC S
peedup .
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
jpeg
enc
ode
jpeg
dec
ode
epic
enc
ode
epic
dec
ode
g721
dec
ode
g721
enc
ode
mpe
g2 d
ecod
e
mpe
g2 e
ncod
e
pegw
it de
code
pegw
it en
code
adpc
m e
ncod
e
adpc
m d
ecod
e
bzip
gcc
gzip
mcf
pars
er
vort
ex vpr
harm
mea
n
Mediabench Spec2000int
IPC S
peedup .
IPC ImprovementsIPC Improvements
Average IPC Speedup:
4-wide: 17%8-wide: 20%
Average IPC Speedup:
4-wide: 17%8-wide: 20%
Some benchmark
s almost double in
IPC
Some benchmark
s almost double in
IPC
Some see almost no speedup at allSome see almost no speedup at all
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 22
strandstrandstrandstrand
Resource OccupancyResource Occupancy
• CISCification of instructions reduces traffic– reorder buffer occupancy is reduced up to 37%.– issue queue occupancy is reduced up to 34%.
– traffic reduction coverage
• Reduced dependence on dependencies– opportunity for pipelined bypass– opportunity for pipelined issue.
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 23
strandstrandstrandstrand
Resource OccupancyResource Occupancy
• Caveat emptor– more worst case issue CAMs– more worst case register ports
• Prior work applicable– only 1.2 live inputs / strand
++ ++ ++++ ++++ ++ ++++ ++
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 24
OutlineOutline
• MotivationMotivation• Transient Operands and StrandsTransient Operands and Strands• Instruction Replacement HardwareInstruction Replacement Hardware• ResultsResults• ConclusionConclusion
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 25
ConclusionConclusion• Key points:
– eagerly executing macro-instructions value precomputation
– limiting focus to transient operands– all new hardware off critical path
• Results:– IPC speedup of 18-20% with 3KB strand cache– potential for frequency gains– full binary compatibility
• Lots of current and future research:– relaxed constraint of ALU instructions– quantified frequency improvements– static detection of strands
Questions?Questions?
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 26
Backup SlidesBackup Slides
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 27
Sensitivity to Dispatch DelaySensitivity to Dispatch Delay
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
jpeg
enc
ode
jpeg
dec
ode
epic
enc
ode
epic
dec
ode
g721
dec
ode
g721
enc
ode
mpe
g2 d
ecod
e
mpe
g2 e
ncod
e
pegw
it de
code
pegw
it en
code
adpc
m e
ncod
e
adpc
m d
ecod
e
bzip
gcc
gzip
mcf
pars
er
vort
ex vpr
harm
mea
n
Mediabench Spec2000int
IPC S
peedup .
0 cycle delay
3 cycle delay
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
jpeg
enc
ode
jpeg
dec
ode
epic
enc
ode
epic
dec
ode
g721
dec
ode
g721
enc
ode
mpe
g2 d
ecod
e
mpe
g2 e
ncod
e
pegw
it de
code
pegw
it en
code
adpc
m e
ncod
e
adpc
m d
ecod
e
bzip
gcc
gzip
mcf
pars
er
vort
ex vpr
harm
mea
n
Mediabench Spec2000int
IPC S
peedup .
0 cycle delay
3 cycle delay
4-wide IPC speedup with 16-entry strand cache4-wide IPC speedup with 16-entry strand cache
On average, speedup only
drops 1% with three cycles of
delay
On average, speedup only
drops 1% with three cycles of
delay
Some actually get faster due to less errant
strands
Some actually get faster due to less errant
strands
Most benchmarks lose a small
amount of speedup
Most benchmarks lose a small
amount of speedup