dynamically collapsing dependencies for ipc and frequency gain peter g. sassone d. scott wills...

27
Dynamically Dynamically Collapsing Collapsing Dependencies for IPC Dependencies for IPC and Frequency Gain and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone, scott.wills } @ ece.gatech.edu

Upload: lily-parmley

Post on 14-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

Dynamically Collapsing Dynamically Collapsing Dependencies for IPC and Dependencies for IPC and Frequency GainFrequency Gain

Dynamically Collapsing Dynamically Collapsing Dependencies for IPC and Dependencies for IPC and Frequency GainFrequency Gain

Peter G. SassoneD. Scott Wills

Georgia TechElectrical and Computer Engineering{ sassone, scott.wills } @ ece.gatech.edu

Page 2: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 2

MotivationMotivation

• Outside of pipeline, global communication dominates

• Memory wall is well studied• Inside, traditionally computation or logic

dominated

fetchfetch

decodedecode

renamerename

issueissue

execexec

commitcommit

I cacheI cache

D cacheD cache

L2 cacheL2 cache memorymemory

Page 3: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 3

MotivationMotivation

issuequeue

• Now dominated by local communication paths:– issue window– reorder buffer– register file– bypass network

• Bottlenecks both IPC and frequency

issuelogicissuelogic

alualu

alualu

alualu

regfileregfile

Page 4: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 4

MotivationMotivation

issuequeue

• RISC instruction sets create RISC instruction sets create superfluous trafficsuperfluous traffic• All instructions and operands are treated as All instructions and operands are treated as

equalequal• Little focus on exposing Little focus on exposing sequentialitysequentiality

issuelogicissuelogic

alualu

alualu

alualu

regfileregfile

Page 5: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 5

ContributionsContributions

• Dynamic Strands:– collapse dependence-chains without fan-out– exploit properties for simple value precomputation– increase efficiency of critical resources– preserve binary compatibility

• IPC improvements:– 17-20% speedup on Spec2000int and MediaBench

• Frequency improvements:– 37% fewer in-flight instructions– reduced dependence on dependencies

Page 6: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 6

OutlineOutline

• MotivationMotivation• Transient Operands and StrandsTransient Operands and Strands• Instruction Replacement HardwareInstruction Replacement Hardware• ResultsResults• ConclusionConclusion

Page 7: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 7

Dyadic DilemmaDyadic Dilemma

Performing any operation on more than two sources requires temporary values

R1’R1’

R1’’R1’’

R1R1 R2R2

R3R3

R4R4

R9R9

++

++

++. . .add R1 R1, R2add R1 R1, R3add R9 R1, R4. . .

. . .add R1 R1, R2add R1 R1, R3add R9 R1, R4. . .

int sum( int a, int b, int c, int d ){ return a + b + c + d;}

int sum( int a, int b, int c, int d ){ return a + b + c + d;}

Page 8: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 8

0%

10%

20%

30%

40%

50%

60%jp

eg e

ncod

ejp

eg d

ecod

eep

ic e

ncod

eep

ic d

ecod

eg7

21 d

ecod

eg7

21 e

ncod

em

peg2

dec

ode

mpe

g2 e

ncod

epe

gwit

deco

depe

gwit

enco

dead

pcm

enc

ode

adpc

m d

ecod

ebz

ipgc

cgz

ipm

cfpa

rser

vort

ex vpr

Mediabench Spec2000int

perc

ent

of dyn

am

ic o

pera

nds

.

0%

10%

20%

30%

40%

50%

60%jp

eg e

ncod

ejp

eg d

ecod

eep

ic e

ncod

eep

ic d

ecod

eg7

21 d

ecod

eg7

21 e

ncod

em

peg2

dec

ode

mpe

g2 e

ncod

epe

gwit

deco

depe

gwit

enco

dead

pcm

enc

ode

adpc

m d

ecod

ebz

ipgc

cgz

ipm

cfpa

rser

vort

ex vpr

Mediabench Spec2000int

perc

ent

of dyn

am

ic o

pera

nds

.

Transient OperandsTransient Operands• We term these temporary values transient operands:

– values produced by an ALU inst– values consumed only once, and only by an ALU inst

• Common in modern integer workloads…

On average, about 40% of all dynamic operands are transient

Page 9: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 9

StrandsStrands

• Strands:– linear chains of instructions

joined by transient operands

– non-consecutive– span basic blocks– three instructions– only the final output needs

to be committed

• Strands are common– dyadic temporaries– compiler strategies– language semantics

++cc

dd

bbaa

++

++

Page 10: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 10

Outline

• Motivation• Transient Operands and Strands• Instruction Replacement Hardware• Results• Conclusion

Page 11: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 11

closed-loopALUs

Hardware Overview

fetchfetch

decodedecode

renamerename

reg filereg file

commitcommit

ALUALU ALUALU ALUALU

strand cachefill unit

strand cachefill unit

instructions

strandcachestrandcache

transients

dispatchengine

dispatchengine

strands

instructions

strands

issue queue

off the critical path

Page 12: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 12

Algorithm ExampleAlgorithm Example

closed-loopclosed-loopALUsALUs

fetchfetch

decodedecode

renamerename

reg filereg file

commitcommit

ALUALU ALUALU ALUALU

strand cachestrand cachefill unitfill unit

strand cachestrand cachefill unitfill unit

instructions

strandstrandcachecachestrandstrandcachecache

transients

dispatchdispatchengineengine

dispatchdispatchengineengine

strands

instructions

strands

issue queueissue queue

11

22

33

1111

Page 13: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 13

Strand Cache Fill UnitStrand Cache Fill Unit

• Based around the operand table• Detects conditions of transients• When found…

– append to existing strand– begin new strand

last producerlast producerinstructioninstruction

last consumerlast consumerinstructioninstruction

consumeconsumerr

countcount

R5R5

R6R6

R4R4

archarchregreg

1404: R5 R0 + 0

PC 14161412: R1 R5 + 0

1416: R5 R0 + 0

1408: . . .PC 1404 PC 1412 1

operand tableoperand table

Page 14: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 14

Strand Cache

101110101

status bits previous reader info

strand 2 i1 i2 i3 pc ready value

instructions

seen pc inst seen pc instseen pc inst

this instruction source 1 source 2

++++

++

About 175 bytes per line, though very few lines are needed for effect

strand 1

strand 3

Page 15: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 15

Dispatch EngineDispatch Engine

• Watches for strand cache matches• Inserts ready strands into the stream eagerly• Removes component instructions when seen• Correctness checking with dirty table

dispatchdispatchengineengine

dispatchdispatchengineengine

decodedecodedecodedecode

renamerenamerenamerename

pre-renamedinstructions

strands,recovery strands,

kill signals,

dirtytable

strandstrandcachecache

Page 16: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 16

Closed-Loop ALUsClosed-Loop ALUs

• Full bypass is half of the execute stage delay• Regular ALUs with double-speed closed-loop

mode– two dependent ALU operations in a single cycle– intermediate values (the transients) are discarded!– final result still takes ½ cycle for full bypass

full bypass network

“free”local

bypass

mode switch

ALUALUALUALU ½ cycle

½ cycle

Page 17: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 17

Oops… Dirty ReadOops… Dirty Read

R1’R1’

R1’’R1’’

R1R1 R2R2

R3R3

R4R4

R9R9

++

++

++load 16 [ R1 ]

R1’R1’

R1’’R1’’

R1R1 R2R2

R3R3

++

++

insert recovery

sub-strand to recover R1R1 is dirty!

Page 18: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 18

Oops… Anti-Dependence Oops… Anti-Dependence ViolationViolation

R1’R1’

R1’’R1’’

R1R1 R2R2

R3R3

R4R4

R9R9

++

++

++load 32 [ R9 ]

insert load immediate of previous

value

R9 has already been replaced

R9 previous valueprevious valueprevious valueprevious value

renaming not sufficent –

outside reorder

buffer safety net

Page 19: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 19

OutlineOutline

• MotivationMotivation• Transient Operands and StrandsTransient Operands and Strands• Instruction Replacement HardwareInstruction Replacement Hardware• ResultsResults• ConclusionConclusion

Page 20: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 20

coverage with various strand cache sizescoverage with various strand cache sizes

Instruction CoverageInstruction Coverage

0%

10%

20%

30%

40%

50%

60%

jpeg

enco

de

epic

enco

de

g721

enco

de

mpe

g2en

code

pegw

iten

code

adpc

men

code gcc

gzip

pars

er vpr

aver

age

Mediabench sample Spec2000int sample

ALU

Inst

ruct

ion C

ove

rage . 16 cache entries

64 cache entries

256 cache entries

1024 cache entries

0%

10%

20%

30%

40%

50%

60%

jpeg

enco

de

epic

enco

de

g721

enco

de

mpe

g2en

code

pegw

iten

code

adpc

men

code gcc

gzip

pars

er vpr

aver

age

Mediabench sample Spec2000int sample

ALU

Inst

ruct

ion C

ove

rage . 16 cache entries

64 cache entries

256 cache entries

1024 cache entries

High coverage rates, but only

with a big strand cache.

High coverage rates, but only

with a big strand cache.

Less than a 15%

replacement rate,

regardless of cache size

Less than a 15%

replacement rate,

regardless of cache size

Average ALUinst

coverage:

16: 12%1024: 27%

Average ALUinst

coverage:

16: 12%1024: 27%

Page 21: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 21

4-wide IPC speedup with 16-entry strand cache4-wide IPC speedup with 16-entry strand cache

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

jpeg

enc

ode

jpeg

dec

ode

epic

enc

ode

epic

dec

ode

g721

dec

ode

g721

enc

ode

mpe

g2 d

ecod

e

mpe

g2 e

ncod

e

pegw

it de

code

pegw

it en

code

adpc

m e

ncod

e

adpc

m d

ecod

e

bzip

gcc

gzip

mcf

pars

er

vort

ex vpr

harm

mea

n

Mediabench Spec2000int

IPC S

peedup .

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

jpeg

enc

ode

jpeg

dec

ode

epic

enc

ode

epic

dec

ode

g721

dec

ode

g721

enc

ode

mpe

g2 d

ecod

e

mpe

g2 e

ncod

e

pegw

it de

code

pegw

it en

code

adpc

m e

ncod

e

adpc

m d

ecod

e

bzip

gcc

gzip

mcf

pars

er

vort

ex vpr

harm

mea

n

Mediabench Spec2000int

IPC S

peedup .

IPC ImprovementsIPC Improvements

Average IPC Speedup:

4-wide: 17%8-wide: 20%

Average IPC Speedup:

4-wide: 17%8-wide: 20%

Some benchmark

s almost double in

IPC

Some benchmark

s almost double in

IPC

Some see almost no speedup at allSome see almost no speedup at all

Page 22: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 22

strandstrandstrandstrand

Resource OccupancyResource Occupancy

• CISCification of instructions reduces traffic– reorder buffer occupancy is reduced up to 37%.– issue queue occupancy is reduced up to 34%.

– traffic reduction coverage

• Reduced dependence on dependencies– opportunity for pipelined bypass– opportunity for pipelined issue.

Page 23: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 23

strandstrandstrandstrand

Resource OccupancyResource Occupancy

• Caveat emptor– more worst case issue CAMs– more worst case register ports

• Prior work applicable– only 1.2 live inputs / strand

++ ++ ++++ ++++ ++ ++++ ++

Page 24: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 24

OutlineOutline

• MotivationMotivation• Transient Operands and StrandsTransient Operands and Strands• Instruction Replacement HardwareInstruction Replacement Hardware• ResultsResults• ConclusionConclusion

Page 25: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 25

ConclusionConclusion• Key points:

– eagerly executing macro-instructions value precomputation

– limiting focus to transient operands– all new hardware off critical path

• Results:– IPC speedup of 18-20% with 3KB strand cache– potential for frequency gains– full binary compatibility

• Lots of current and future research:– relaxed constraint of ALU instructions– quantified frequency improvements– static detection of strands

Questions?Questions?

Page 26: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 26

Backup SlidesBackup Slides

Page 27: Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 27

Sensitivity to Dispatch DelaySensitivity to Dispatch Delay

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

jpeg

enc

ode

jpeg

dec

ode

epic

enc

ode

epic

dec

ode

g721

dec

ode

g721

enc

ode

mpe

g2 d

ecod

e

mpe

g2 e

ncod

e

pegw

it de

code

pegw

it en

code

adpc

m e

ncod

e

adpc

m d

ecod

e

bzip

gcc

gzip

mcf

pars

er

vort

ex vpr

harm

mea

n

Mediabench Spec2000int

IPC S

peedup .

0 cycle delay

3 cycle delay

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

jpeg

enc

ode

jpeg

dec

ode

epic

enc

ode

epic

dec

ode

g721

dec

ode

g721

enc

ode

mpe

g2 d

ecod

e

mpe

g2 e

ncod

e

pegw

it de

code

pegw

it en

code

adpc

m e

ncod

e

adpc

m d

ecod

e

bzip

gcc

gzip

mcf

pars

er

vort

ex vpr

harm

mea

n

Mediabench Spec2000int

IPC S

peedup .

0 cycle delay

3 cycle delay

4-wide IPC speedup with 16-entry strand cache4-wide IPC speedup with 16-entry strand cache

On average, speedup only

drops 1% with three cycles of

delay

On average, speedup only

drops 1% with three cycles of

delay

Some actually get faster due to less errant

strands

Some actually get faster due to less errant

strands

Most benchmarks lose a small

amount of speedup

Most benchmarks lose a small

amount of speedup