dynamically collapsing dependencies for ipc and frequency gain

27
Dynamically Dynamically Collapsing Collapsing Dependencies for IPC Dependencies for IPC and Frequency Gain and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone, scott.wills } @ ece.gatech.edu

Upload: keran

Post on 25-Feb-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Dynamically Collapsing Dependencies for IPC and Frequency Gain. Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone, scott.wills } @ ece.gatech.edu. Motivation. Outside of pipeline, global communication dominates Memory wall is well studied - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dynamically Collapsing Dependencies for IPC and Frequency Gain

Dynamically Collapsing Dynamically Collapsing Dependencies for IPC and Dependencies for IPC and Frequency GainFrequency GainPeter G. SassoneD. Scott Wills

Georgia TechElectrical and Computer Engineering{ sassone, scott.wills } @ ece.gatech.edu

Page 2: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 2

MotivationMotivation

• Outside of pipeline, global communication dominates

• Memory wall is well studied• Inside, traditionally computation or logic

dominated

fetchdecoderename

issueexec

commit

I cache

D cache

L2 cache memory

Page 3: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 3

MotivationMotivation

issuequeue

• Now dominated by local communication paths:– issue window– reorder buffer– register file– bypass network

• Bottlenecks both IPC and frequency

issuelogic

alu

alu

alu

regfile

Page 4: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 4

MotivationMotivation

issuequeue

• RISC instruction sets create RISC instruction sets create superfluous trafficsuperfluous traffic• All instructions and operands are treated as All instructions and operands are treated as

equalequal• Little focus on exposing Little focus on exposing sequentialitysequentiality

issuelogic

alu

alu

alu

regfile

Page 5: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 5

ContributionsContributions• Dynamic Strands:

– collapse dependence-chains without fan-out– exploit properties for simple value precomputation– increase efficiency of critical resources– preserve binary compatibility

• IPC improvements:– 17-20% speedup on Spec2000int and MediaBench

• Frequency improvements:– 37% fewer in-flight instructions– reduced dependence on dependencies

Page 6: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 6

OutlineOutline

• MotivationMotivation• Transient Operands and StrandsTransient Operands and Strands• Instruction Replacement HardwareInstruction Replacement Hardware• ResultsResults• ConclusionConclusion

Page 7: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 7

Dyadic DilemmaDyadic Dilemma

Performing any operation on more than two sources requires temporary values

R1’

R1’’

R1 R2

R3

R4

R9

+

+

+. . .add R1 R1, R2add R1 R1, R3add R9 R1, R4. . .

int sum( int a, int b, int c, int d ){ return a + b + c + d;}

Page 8: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 8

0%

10%

20%

30%

40%

50%

60%jp

eg en

code

jpeg

dec

ode

epic

enco

deep

ic de

code

g721

dec

ode

g721

enco

demp

eg2

deco

demp

eg2

enco

depe

gwit

deco

depe

gwit

enco

dead

pcm

enco

dead

pcm

deco

de bzip gcc

gzip mcf

parse

rvo

rtex

vpr

Mediabench Spec2000int

perc

ent o

f dyn

amic

ope

rand

s

.

Transient OperandsTransient Operands• We term these temporary values transient operands:

– values produced by an ALU inst– values consumed only once, and only by an ALU inst

• Common in modern integer workloads…

On average, about 40% of all dynamic operands are transient

Page 9: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 9

StrandsStrands• Strands:

– linear chains of instructions joined by transient operands

– non-consecutive– span basic blocks– three instructions– only the final output needs

to be committed

• Strands are common– dyadic temporaries– compiler strategies– language semantics

+c

d

ba

+

+

Page 10: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 10

Outline

• Motivation• Transient Operands and Strands• Instruction Replacement Hardware• Results• Conclusion

Page 11: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 11

closed-loopALUs

Hardware Overviewfetch

decode

rename

reg file

commit

ALU ALU ALU

strand cachefill unitinstructions

strandcache

transients

dispatchengine

strands

instructions

strandsissue queue

off the critical path

Page 12: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 12

Algorithm ExampleAlgorithm Example

closed-loopclosed-loopALUsALUs

fetch

decode

rename

reg file

commit

ALU ALU ALU

strand cachestrand cachefill unitfill unitinstructions

strandstrandcachecache

transients

dispatchdispatchengineengine

strands

instructions

strandsissue queueissue queue

1

2

3

11

Page 13: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 13

Strand Cache Fill UnitStrand Cache Fill Unit

• Based around the operand table• Detects conditions of transients• When found…

– append to existing strand– begin new strand

last producerlast producerinstructioninstruction

last consumerlast consumerinstructioninstruction

consumeconsumerr

countcount

R5R5R6R6

R4R4

archarchregreg

1404: R5 R0 + 0

PC 14161412: R1 R5 + 0

1416: R5 R0 + 0

1408: . . . PC 1404 PC 1412 1

operand tableoperand table

Page 14: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 14

Strand Cache

101110101

status bits previous reader info

strand 2 i1 i2 i3 pc ready value

instructions

seen pc inst seen pc instseen pc instthis instruction source 1 source 2

++

+

About 175 bytes per line, though very few lines are needed for effect

strand 1

strand 3

Page 15: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 15

Dispatch EngineDispatch Engine

• Watches for strand cache matches• Inserts ready strands into the stream eagerly• Removes component instructions when seen• Correctness checking with dirty table

dispatchdispatchengineengine

decodedecode

renamerename

pre-renamedinstructions

strands,recovery strands,

kill signals,

dirtytable

strandstrandcachecache

Page 16: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 16

Closed-Loop ALUsClosed-Loop ALUs

• Full bypass is half of the execute stage delay• Regular ALUs with double-speed closed-loop

mode– two dependent ALU operations in a single cycle– intermediate values (the transients) are discarded!– final result still takes ½ cycle for full bypass

full bypass network

“free”local

bypass

mode switch

ALUALU ½ cycle

½ cycle

Page 17: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 17

Oops… Dirty ReadOops… Dirty Read

R1’

R1’’

R1 R2

R3

R4

R9

+

+

+load 16 [ R1 ]

R1’

R1’’

R1 R2

R3+

+

insert recovery

sub-strand to recover R1R1 is dirty!

Page 18: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 18

Oops… Anti-Dependence Oops… Anti-Dependence ViolationViolation

R1’

R1’’

R1 R2

R3

R4

R9

+

+

+load 32 [ R9 ]

insert load immediate of previous

value

R9 has already been replaced

R9 previous valueprevious value

renaming not sufficent –

outside reorder

buffer safety net

Page 19: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 19

OutlineOutline

• MotivationMotivation• Transient Operands and StrandsTransient Operands and Strands• Instruction Replacement HardwareInstruction Replacement Hardware• ResultsResults• ConclusionConclusion

Page 20: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 20

coverage with various strand cache sizes

Instruction CoverageInstruction Coverage

0%

10%

20%

30%

40%

50%

60%

jpeg

enco

de

epic

enco

de

g721

enco

de

mpeg

2en

code

pegw

iten

code

adpc

men

code gcc

gzip

parse

r

vpr

aver

age

Mediabench sample Spec2000int sample

ALU

Inst

ruct

ion

Cove

rage

.16 cache entries64 cache entries256 cache entries1024 cache entries

High coverage rates, but only

with a big strand cache.

Less than a 15%

replacement rate,

regardless of cache size

Average ALUinst

coverage:

16: 12%1024: 27%

Page 21: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 21

4-wide IPC speedup with 16-entry strand cache

1.01.11.21.31.41.51.61.71.81.92.0

jpeg

enco

dejp

eg d

ecod

eep

ic en

code

epic

deco

deg7

21 d

ecod

eg7

21 en

code

mpeg

2 de

code

mpeg

2 en

code

pegw

it de

code

pegw

it en

code

adpc

m en

code

adpc

m de

code bzip gcc

gzip mcf

parse

rvo

rtex

vpr

harm

mea

n

Mediabench Spec2000int

IPC

Spee

dup

.

IPC ImprovementsIPC ImprovementsAverage IPC

Speedup:

4-wide: 17%8-wide: 20%

Some benchmark

s almost double in

IPC

Some see almost no speedup at all

Page 22: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 22

strandstrand

Resource OccupancyResource Occupancy

• CISCification of instructions reduces traffic– reorder buffer occupancy is reduced up to 37%.– issue queue occupancy is reduced up to 34%.– traffic reduction coverage

• Reduced dependence on dependencies– opportunity for pipelined bypass– opportunity for pipelined issue.

Page 23: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 23

strandstrand

Resource OccupancyResource Occupancy

• Caveat emptor– more worst case issue CAMs– more worst case register ports

• Prior work applicable– only 1.2 live inputs / strand

+ + ++ ++ + ++ +

Page 24: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 24

OutlineOutline

• MotivationMotivation• Transient Operands and StrandsTransient Operands and Strands• Instruction Replacement HardwareInstruction Replacement Hardware• ResultsResults• ConclusionConclusion

Page 25: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 25

ConclusionConclusion• Key points:

– eagerly executing macro-instructions value precomputation

– limiting focus to transient operands– all new hardware off critical path

• Results:– IPC speedup of 18-20% with 3KB strand cache– potential for frequency gains– full binary compatibility

• Lots of current and future research:– relaxed constraint of ALU instructions– quantified frequency improvements– static detection of strands

Questions?Questions?

Page 26: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 26

Backup SlidesBackup Slides

Page 27: Dynamically Collapsing Dependencies for IPC and Frequency Gain

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 27

Sensitivity to Dispatch DelaySensitivity to Dispatch Delay

1.01.11.21.31.41.51.61.71.81.92.0

jpeg

enco

dejp

eg d

ecod

eep

ic en

code

epic

deco

deg7

21 d

ecod

eg7

21 en

code

mpeg

2 de

code

mpeg

2 en

code

pegw

it de

code

pegw

it en

code

adpc

m en

code

adpc

m de

code bzip gcc

gzip mcf

parse

rvo

rtex

vpr

harm

mea

n

Mediabench Spec2000int

IPC

Spee

dup

.0 cycle delay

3 cycle delay

4-wide IPC speedup with 16-entry strand cache

On average, speedup only

drops 1% with three cycles of

delay

Some actually get faster due to less errant

strands

Most benchmarks lose a small

amount of speedup