dynamically collapsing dependencies for ipc and frequency gain

Dynamically Collapsing Dynamically Collapsing Dependencies for IPC and Dependencies for IPC and Frequency GainFrequency GainPeter G. SassoneD. Scott Wills

Georgia TechElectrical and Computer Engineering{ sassone, scott.wills } @ ece.gatech.edu

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 2

MotivationMotivation

• Outside of pipeline, global communication dominates

• Memory wall is well studied• Inside, traditionally computation or logic

dominated

fetchdecoderename

issueexec

commit

I cache

D cache

L2 cache memory



issuequeue

• Now dominated by local communication paths:– issue window– reorder buffer– register file– bypass network

• Bottlenecks both IPC and frequency

issuelogic

alu

alu

alu

regfile



issuequeue

• RISC instruction sets create RISC instruction sets create superfluous trafficsuperfluous traffic• All instructions and operands are treated as All instructions and operands are treated as

equalequal• Little focus on exposing Little focus on exposing sequentialitysequentiality

issuelogic

alu

alu

alu

regfile


ContributionsContributions• Dynamic Strands:

– collapse dependence-chains without fan-out– exploit properties for simple value precomputation– increase efficiency of critical resources– preserve binary compatibility

• IPC improvements:– 17-20% speedup on Spec2000int and MediaBench

• Frequency improvements:– 37% fewer in-flight instructions– reduced dependence on dependencies


OutlineOutline

• MotivationMotivation• Transient Operands and StrandsTransient Operands and Strands• Instruction Replacement HardwareInstruction Replacement Hardware• ResultsResults• ConclusionConclusion


Dyadic DilemmaDyadic Dilemma

Performing any operation on more than two sources requires temporary values

R1’

R1’’

R1 R2

R3

R4

R9

+

+

+. . .add R1 R1, R2add R1 R1, R3add R9 R1, R4. . .

int sum( int a, int b, int c, int d ){ return a + b + c + d;}


0%

10%

20%

30%

40%

50%

60%jp

eg en

code

jpeg

dec

ode

epic

enco

deep

ic de

code

g721

dec

ode

g721

enco

demp

eg2

deco

demp

eg2

enco

depe

gwit

deco

depe

gwit

enco

dead

pcm

enco

dead

pcm

deco

de bzip gcc

gzip mcf

parse

rvo

rtex

vpr

Mediabench Spec2000int

perc

ent o

f dyn

amic

ope

rand

s

.

Transient OperandsTransient Operands• We term these temporary values transient operands:

– values produced by an ALU inst– values consumed only once, and only by an ALU inst

• Common in modern integer workloads…

On average, about 40% of all dynamic operands are transient


StrandsStrands• Strands:

– linear chains of instructions joined by transient operands

– non-consecutive– span basic blocks– three instructions– only the final output needs

to be committed

• Strands are common– dyadic temporaries– compiler strategies– language semantics

+c

d

ba

+

+


Outline

• Motivation• Transient Operands and Strands• Instruction Replacement Hardware• Results• Conclusion


closed-loopALUs

Hardware Overviewfetch

decode

rename

reg file

commit

ALU ALU ALU

strand cachefill unitinstructions

strandcache

transients

dispatchengine

strands

instructions

strandsissue queue

off the critical path


Algorithm ExampleAlgorithm Example

closed-loopclosed-loopALUsALUs

fetch

decode

rename

reg file

commit

ALU ALU ALU

strand cachestrand cachefill unitfill unitinstructions

strandstrandcachecache

transients

dispatchdispatchengineengine

strands

instructions

strandsissue queueissue queue

1

2

3

11


Strand Cache Fill UnitStrand Cache Fill Unit

• Based around the operand table• Detects conditions of transients• When found…

– append to existing strand– begin new strand

last producerlast producerinstructioninstruction

last consumerlast consumerinstructioninstruction

consumeconsumerr

countcount

R5R5R6R6

R4R4

archarchregreg

1404: R5 R0 + 0

PC 14161412: R1 R5 + 0

1416: R5 R0 + 0

1408: . . . PC 1404 PC 1412 1

operand tableoperand table


Strand Cache

101110101

status bits previous reader info

strand 2 i1 i2 i3 pc ready value

instructions

seen pc inst seen pc instseen pc instthis instruction source 1 source 2

++

+

About 175 bytes per line, though very few lines are needed for effect

strand 1

strand 3


Dispatch EngineDispatch Engine

• Watches for strand cache matches• Inserts ready strands into the stream eagerly• Removes component instructions when seen• Correctness checking with dirty table

dispatchdispatchengineengine

decodedecode

renamerename

pre-renamedinstructions

strands,recovery strands,

kill signals,

dirtytable

strandstrandcachecache


Closed-Loop ALUsClosed-Loop ALUs

• Full bypass is half of the execute stage delay• Regular ALUs with double-speed closed-loop

mode– two dependent ALU operations in a single cycle– intermediate values (the transients) are discarded!– final result still takes ½ cycle for full bypass

full bypass network

“free”local

bypass

mode switch

ALUALU ½ cycle

½ cycle


Oops… Dirty ReadOops… Dirty Read

R1’

R1’’

R1 R2

R3

R4

R9

+

+

+load 16 [ R1 ]

R1’

R1’’

R1 R2

R3+

+

insert recovery

sub-strand to recover R1R1 is dirty!


Oops… Anti-Dependence Oops… Anti-Dependence ViolationViolation

R1’

R1’’

R1 R2

R3

R4

R9

+

+

+load 32 [ R9 ]

insert load immediate of previous

value

R9 has already been replaced

R9 previous valueprevious value

renaming not sufficent –

outside reorder

buffer safety net


OutlineOutline



coverage with various strand cache sizes

Instruction CoverageInstruction Coverage

0%

10%

20%

30%

40%

50%

60%

jpeg

enco

de

epic

enco

de

g721

enco

de

mpeg

2en

code

pegw

iten

code

adpc

men

code gcc

gzip

parse

r

vpr

aver

age

Mediabench sample Spec2000int sample

ALU

Inst

ruct

ion

Cove

rage

.16 cache entries64 cache entries256 cache entries1024 cache entries

High coverage rates, but only

with a big strand cache.

Less than a 15%

replacement rate,

regardless of cache size

Average ALUinst

coverage:

16: 12%1024: 27%


4-wide IPC speedup with 16-entry strand cache

1.01.11.21.31.41.51.61.71.81.92.0

jpeg

enco

dejp

eg d

ecod

eep

ic en

code

epic

deco

deg7

21 d

ecod

eg7

21 en

code

mpeg

2 de

code

mpeg

2 en

code

pegw

it de

code

pegw

it en

code

adpc

m en

code

adpc

m de

code bzip gcc

gzip mcf

parse

rvo

rtex

vpr

harm

mea

n


IPC

Spee

dup

.

IPC ImprovementsIPC ImprovementsAverage IPC

Speedup:

4-wide: 17%8-wide: 20%

Some benchmark

s almost double in

IPC

Some see almost no speedup at all


strandstrand

Resource OccupancyResource Occupancy

• CISCification of instructions reduces traffic– reorder buffer occupancy is reduced up to 37%.– issue queue occupancy is reduced up to 34%.– traffic reduction coverage

• Reduced dependence on dependencies– opportunity for pipelined bypass– opportunity for pipelined issue.


strandstrand

Resource OccupancyResource Occupancy

• Caveat emptor– more worst case issue CAMs– more worst case register ports

• Prior work applicable– only 1.2 live inputs / strand

+ + ++ ++ + ++ +


OutlineOutline



ConclusionConclusion• Key points:

– eagerly executing macro-instructions value precomputation

– limiting focus to transient operands– all new hardware off critical path

• Results:– IPC speedup of 18-20% with 3KB strand cache– potential for frequency gains– full binary compatibility

• Lots of current and future research:– relaxed constraint of ALU instructions– quantified frequency improvements– static detection of strands

Questions?Questions?


Backup SlidesBackup Slides


Sensitivity to Dispatch DelaySensitivity to Dispatch Delay

1.01.11.21.31.41.51.61.71.81.92.0

jpeg

enco

dejp

eg d

ecod

eep

ic en

code

epic

deco

deg7

21 d

ecod

eg7

21 en

code

mpeg

2 de

code

mpeg

2 en

code

pegw

it de

code

pegw

it en

code

adpc

m en

code

adpc

m de

code bzip gcc

gzip mcf

parse

rvo

rtex

vpr

harm

mea

n


IPC

Spee

dup

.0 cycle delay

3 cycle delay

4-wide IPC speedup with 16-entry strand cache

On average, speedup only

drops 1% with three cycles of

delay

Some actually get faster due to less errant

strands

Most benchmarks lose a small

amount of speedup

dynamically collapsing dependencies for ipc and frequency gain

Documents

dynamic operands

r2add r1 r1

int sum int

r3add r9 r1

b c d

linear chains of instructions

alu instcommon

alu instvalues