dynamically collapsing dependencies for ipc and frequency gain peter g. sassone d. scott wills...

Dynamically Collapsing Dynamically Collapsing Dependencies for IPC and Dependencies for IPC and Frequency GainFrequency Gain

Dynamically Collapsing Dynamically Collapsing Dependencies for IPC and Dependencies for IPC and Frequency GainFrequency Gain

Peter G. SassoneD. Scott Wills

Georgia TechElectrical and Computer Engineering{ sassone, scott.wills } @ ece.gatech.edu

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 2

MotivationMotivation

• Outside of pipeline, global communication dominates

• Memory wall is well studied• Inside, traditionally computation or logic

dominated

fetchfetch

decodedecode

renamerename

issueissue

execexec

commitcommit

I cacheI cache

D cacheD cache

L2 cacheL2 cache memorymemory



issuequeue

• Now dominated by local communication paths:– issue window– reorder buffer– register file– bypass network

• Bottlenecks both IPC and frequency

issuelogicissuelogic

alualu

alualu

alualu

regfileregfile



issuequeue

• RISC instruction sets create RISC instruction sets create superfluous trafficsuperfluous traffic• All instructions and operands are treated as All instructions and operands are treated as

equalequal• Little focus on exposing Little focus on exposing sequentialitysequentiality

issuelogicissuelogic

alualu

alualu

alualu

regfileregfile


ContributionsContributions

• Dynamic Strands:– collapse dependence-chains without fan-out– exploit properties for simple value precomputation– increase efficiency of critical resources– preserve binary compatibility

• IPC improvements:– 17-20% speedup on Spec2000int and MediaBench

• Frequency improvements:– 37% fewer in-flight instructions– reduced dependence on dependencies


OutlineOutline

• MotivationMotivation• Transient Operands and StrandsTransient Operands and Strands• Instruction Replacement HardwareInstruction Replacement Hardware• ResultsResults• ConclusionConclusion


Dyadic DilemmaDyadic Dilemma

Performing any operation on more than two sources requires temporary values

R1’R1’

R1’’R1’’

R1R1 R2R2

R3R3

R4R4

R9R9

++

++

++. . .add R1 R1, R2add R1 R1, R3add R9 R1, R4. . .

. . .add R1 R1, R2add R1 R1, R3add R9 R1, R4. . .

int sum( int a, int b, int c, int d ){ return a + b + c + d;}

int sum( int a, int b, int c, int d ){ return a + b + c + d;}


0%

10%

20%

30%

40%

50%

60%jp

eg e

ncod

ejp

eg d

ecod

eep

ic e

ncod

eep

ic d

ecod

eg7

21 d

ecod

eg7

21 e

ncod

em

peg2

dec

ode

mpe

g2 e

ncod

epe

gwit

deco

depe

gwit

enco

dead

pcm

enc

ode

adpc

m d

ecod

ebz

ipgc

cgz

ipm

cfpa

rser

vort

ex vpr

Mediabench Spec2000int

perc

ent

of dyn

am

ic o

pera

nds

.

0%

10%

20%

30%

40%

50%

60%jp

eg e

ncod

ejp

eg d

ecod

eep

ic e

ncod

eep

ic d

ecod

eg7

21 d

ecod

eg7

21 e

ncod

em

peg2

dec

ode

mpe

g2 e

ncod

epe

gwit

deco

depe

gwit

enco

dead

pcm

enc

ode

adpc

m d

ecod

ebz

ipgc

cgz

ipm

cfpa

rser

vort

ex vpr


perc

ent

of dyn

am

ic o

pera

nds

.

Transient OperandsTransient Operands• We term these temporary values transient operands:

– values produced by an ALU inst– values consumed only once, and only by an ALU inst

• Common in modern integer workloads…

On average, about 40% of all dynamic operands are transient


StrandsStrands

• Strands:– linear chains of instructions

joined by transient operands

– non-consecutive– span basic blocks– three instructions– only the final output needs

to be committed

• Strands are common– dyadic temporaries– compiler strategies– language semantics

++cc

dd

bbaa

++

++


Outline

• Motivation• Transient Operands and Strands• Instruction Replacement Hardware• Results• Conclusion


closed-loopALUs

Hardware Overview

fetchfetch

decodedecode

renamerename

reg filereg file

commitcommit

ALUALU ALUALU ALUALU

strand cachefill unit

strand cachefill unit

instructions

strandcachestrandcache

transients

dispatchengine

dispatchengine

strands

instructions

strands

issue queue

off the critical path


Algorithm ExampleAlgorithm Example

closed-loopclosed-loopALUsALUs

fetchfetch

decodedecode

renamerename

reg filereg file

commitcommit

ALUALU ALUALU ALUALU

strand cachestrand cachefill unitfill unit

strand cachestrand cachefill unitfill unit

instructions

strandstrandcachecachestrandstrandcachecache

transients

dispatchdispatchengineengine


strands

instructions

strands

issue queueissue queue

11

22

33

1111


Strand Cache Fill UnitStrand Cache Fill Unit

• Based around the operand table• Detects conditions of transients• When found…

– append to existing strand– begin new strand

last producerlast producerinstructioninstruction

last consumerlast consumerinstructioninstruction

consumeconsumerr

countcount

R5R5

R6R6

R4R4

archarchregreg

1404: R5 R0 + 0

PC 14161412: R1 R5 + 0

1416: R5 R0 + 0

1408: . . .PC 1404 PC 1412 1

operand tableoperand table


Strand Cache

101110101

status bits previous reader info

strand 2 i1 i2 i3 pc ready value

instructions

seen pc inst seen pc instseen pc inst

this instruction source 1 source 2

++++

++

About 175 bytes per line, though very few lines are needed for effect

strand 1

strand 3


Dispatch EngineDispatch Engine

• Watches for strand cache matches• Inserts ready strands into the stream eagerly• Removes component instructions when seen• Correctness checking with dirty table



decodedecodedecodedecode

renamerenamerenamerename

pre-renamedinstructions

strands,recovery strands,

kill signals,

dirtytable

strandstrandcachecache


Closed-Loop ALUsClosed-Loop ALUs

• Full bypass is half of the execute stage delay• Regular ALUs with double-speed closed-loop

mode– two dependent ALU operations in a single cycle– intermediate values (the transients) are discarded!– final result still takes ½ cycle for full bypass

full bypass network

“free”local

bypass

mode switch

ALUALUALUALU ½ cycle

½ cycle


Oops… Dirty ReadOops… Dirty Read

R1’R1’

R1’’R1’’

R1R1 R2R2

R3R3

R4R4

R9R9

++

++

++load 16 [ R1 ]

R1’R1’

R1’’R1’’

R1R1 R2R2

R3R3

++

++

insert recovery

sub-strand to recover R1R1 is dirty!


Oops… Anti-Dependence Oops… Anti-Dependence ViolationViolation

R1’R1’

R1’’R1’’

R1R1 R2R2

R3R3

R4R4

R9R9

++

++

++load 32 [ R9 ]

insert load immediate of previous

value

R9 has already been replaced

R9 previous valueprevious valueprevious valueprevious value

renaming not sufficent –

outside reorder

buffer safety net


OutlineOutline



coverage with various strand cache sizescoverage with various strand cache sizes

Instruction CoverageInstruction Coverage

0%

10%

20%

30%

40%

50%

60%

jpeg

enco

de

epic

enco

de

g721

enco

de

mpe

g2en

code

pegw

iten

code

adpc

men

code gcc

gzip

pars

er vpr

aver

age

Mediabench sample Spec2000int sample

ALU

Inst

ruct

ion C

ove

rage . 16 cache entries

64 cache entries

256 cache entries

1024 cache entries

0%

10%

20%

30%

40%

50%

60%

jpeg

enco

de

epic

enco

de

g721

enco

de

mpe

g2en

code

pegw

iten

code

adpc

men

code gcc

gzip

pars

er vpr

aver

age

Mediabench sample Spec2000int sample

ALU

Inst

ruct

ion C

ove

rage . 16 cache entries

64 cache entries

256 cache entries

1024 cache entries

High coverage rates, but only

with a big strand cache.

High coverage rates, but only

with a big strand cache.

Less than a 15%

replacement rate,

regardless of cache size

Less than a 15%

replacement rate,

regardless of cache size

Average ALUinst

coverage:

16: 12%1024: 27%

Average ALUinst

coverage:

16: 12%1024: 27%


4-wide IPC speedup with 16-entry strand cache4-wide IPC speedup with 16-entry strand cache

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

jpeg

enc

ode

jpeg

dec

ode

epic

enc

ode

epic

dec

ode

g721

dec

ode

g721

enc

ode

mpe

g2 d

ecod

e

mpe

g2 e

ncod

e

pegw

it de

code

pegw

it en

code

adpc

m e

ncod

e

adpc

m d

ecod

e

bzip

gcc

gzip

mcf

pars

er

vort

ex vpr

harm

mea

n


IPC S

peedup .

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

jpeg

enc

ode

jpeg

dec

ode

epic

enc

ode

epic

dec

ode

g721

dec

ode

g721

enc

ode

mpe

g2 d

ecod

e

mpe

g2 e

ncod

e

pegw

it de

code

pegw

it en

code

adpc

m e

ncod

e

adpc

m d

ecod

e

bzip

gcc

gzip

mcf

pars

er

vort

ex vpr

harm

mea

n


IPC S

peedup .

IPC ImprovementsIPC Improvements

Average IPC Speedup:

4-wide: 17%8-wide: 20%

Average IPC Speedup:

4-wide: 17%8-wide: 20%

Some benchmark

s almost double in

IPC

Some benchmark

s almost double in

IPC

Some see almost no speedup at allSome see almost no speedup at all


strandstrandstrandstrand

Resource OccupancyResource Occupancy

• CISCification of instructions reduces traffic– reorder buffer occupancy is reduced up to 37%.– issue queue occupancy is reduced up to 34%.

– traffic reduction coverage

• Reduced dependence on dependencies– opportunity for pipelined bypass– opportunity for pipelined issue.


strandstrandstrandstrand

Resource OccupancyResource Occupancy

• Caveat emptor– more worst case issue CAMs– more worst case register ports

• Prior work applicable– only 1.2 live inputs / strand

++ ++ ++++ ++++ ++ ++++ ++


OutlineOutline



ConclusionConclusion• Key points:

– eagerly executing macro-instructions value precomputation

– limiting focus to transient operands– all new hardware off critical path

• Results:– IPC speedup of 18-20% with 3KB strand cache– potential for frequency gains– full binary compatibility

• Lots of current and future research:– relaxed constraint of ALU instructions– quantified frequency improvements– static detection of strands

Questions?Questions?


Backup SlidesBackup Slides


Sensitivity to Dispatch DelaySensitivity to Dispatch Delay

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

jpeg

enc

ode

jpeg

dec

ode

epic

enc

ode

epic

dec

ode

g721

dec

ode

g721

enc

ode

mpe

g2 d

ecod

e

mpe

g2 e

ncod

e

pegw

it de

code

pegw

it en

code

adpc

m e

ncod

e

adpc

m d

ecod

e

bzip

gcc

gzip

mcf

pars

er

vort

ex vpr

harm

mea

n


IPC S

peedup .

0 cycle delay

3 cycle delay

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

jpeg

enc

ode

jpeg

dec

ode

epic

enc

ode

epic

dec

ode

g721

dec

ode

g721

enc

ode

mpe

g2 d

ecod

e

mpe

g2 e

ncod

e

pegw

it de

code

pegw

it en

code

adpc

m e

ncod

e

adpc

m d

ecod

e

bzip

gcc

gzip

mcf

pars

er

vort

ex vpr

harm

mea

n


IPC S

peedup .

0 cycle delay

3 cycle delay

4-wide IPC speedup with 16-entry strand cache4-wide IPC speedup with 16-entry strand cache

On average, speedup only

drops 1% with three cycles of

delay

On average, speedup only

drops 1% with three cycles of

delay

Some actually get faster due to less errant

strands

Some actually get faster due to less errant

strands

Most benchmarks lose a small

amount of speedup

Most benchmarks lose a small

amount of speedup

dynamically collapsing dependencies for ipc and frequency gain peter g. sassone d. scott wills...

Documents

dynamic operands

transient slide

int sum int

b c d

d cache l2 cache memory

strandstransient operands

issue exec

equalall instructions