cse502: computer architecture superscalar decode

20
CSE502: Computer Architecture CSE 502: Computer Architecture Superscalar Decode

Upload: luis-wisby

Post on 31-Mar-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

CSE 502:Computer Architecture

Superscalar Decode

Page 2: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

Superscalar Decode for RISC ISAs• Decode X insns. per cycle (e.g., 4-wide)– Just duplicate the hardware– Instructions aligned at 32-bit boundaries

32-bit inst

Decoder

decodedinst

scalar

Decoder Decoder Decoder

32-bit inst

Decoder

decodedinst

superscalar

4-wide superscalar fetch

32-bit inst32-bit inst32-bit inst

decodedinst

decodedinst

decodedinst

1-Fetch

Page 3: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

uop Limits• How many uops can a decoder generate?– For complex x86 insts, many are needed (10’s, 100’s?)– Makes decoder horribly complex– Typically there’s a limit to keep complexity under control

• One x86 instruction 1-4 uops• Most instructions translate to 1.5-2.0 uops

• What if a complex insn. needs more than 4 uops?

Page 4: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

UROM/MS for Complex x86 Insts• UROM (microcode-ROM) stores the uop equivalents– Used for nasty x86 instructions

• Complex like > 4 uops (PUSHA or STRREP.MOV)• Obsolete (like AAA)

• Microsequencer (MS) handles the UROM interaction

Page 5: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

UROM/MS Example (3 uop-wide)

ADD

STORE

SUB

Cycle 1

ADD

STA

STD

SUB

Cycle 2

REP.MOV

ADD [ ]

Fetch- x86 insts

Decode - uops

UROM - uops

SUB

LOAD

STORE

REP.MOV

ADD [ ]

XOR

Cycle 3

INC

mJCC

LOAD

Cycle 4

Complex instructions, getuops from mcode sequencer

REP.MOV

ADD [ ]

XOR

STORE

INC

mJCC

Cycle 5

REP.MOV

ADD [ ]

XOR

Cycle n

REP.MOV

ADD [ ]

XOR

mJCC

LOAD

Cycle n+1

ADD

XOR

Page 6: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

Superscalar CISC Decode• Instruction Length Decode (ILD)– Where are the instructions?

• Limited decode – just enough to parse prefixes, modes

• Shift/Alignment– Get the right bytes to the decoders

• Decode– Crack into uops

Do this for N instructions per cycle!

Page 7: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

ILD Recurrence/LoopPCi = XPCi+1 = PCi + sizeof( Mem[PCi] )PCi+2 = PCi+1 + sizeof( Mem[PCi+1] )

= PCi + sizeof( Mem[PCi] ) + sizeof( Mem[PCi+1] )

• Can’t find start of next insn. before decoding the first• Must do ILD serially– ILD of 4 insns/cycle implies cycle time will be 4x

Critical component not pipeline-able

Page 8: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

Bad x86 Decode Implementation

Left Shifter

Decoder 3

Cycl

e 3

Decoder 2Decoder 1

Instruction Bytes (ex. 16 bytes)

ILD (limited decode)Length 1

Length 2

+

Cycl

e 1

Inst 1 Inst 2 Inst 3 Remainder

Cycl

e 2

+

Length 3

bytesdecoded

ILD (limited decode)

ILD (limited decode)

ILD dominates cycle time; not scalable

Page 9: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

Hardware-Intensive Decode

Decode from everypossible instructionstarting point!

Giant MUXes toselect instructionbytes

ILD

DecoderIL

D

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

ILD

DecoderDecoder

Page 10: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

ILD in Hardware-Intensive Approach

6 bytes

4 bytes

3 bytes+

Total bytes decode = 11Previous: 3 ILD + 2add

Now: 1ILD + 2(mux+add)

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Deco

der

Page 11: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

Predecoding• ILD loop is hardware intensive– Impacts latency– Consumes substantial power

• If instructions A, B and C are decoded – ... lengths for A, B, and C will still be the same next time– No need to repeat ILD

Possible to cache the ILD work

Page 12: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

Decoder Example: AMD K5 (1/2)

Predecode Logic

From Memory

b0 b1 b2 … b78 bytes

I$

b0 b1 b2 … b7

8 bits

+5 bits

13 bytes

Decode

16 (8-bit inst + 5-bit predecode)

8 (8-bit inst + 5-bit predecode)

Up to 4 ROPs

Compute ILD on fetch, store ILD in the I$

Page 13: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

Decoder Example: AMD K5 (2/2)• Predecode makes decode easier by providing:– Instruction start/end location (ILD)– Number of ROPs needed per inst– Opcode and prefix locations

• Power/performance tradeoffs– Larger I$ (increase array by 62.5%)

• Longer I$ latency, more I$ power consumption

– Remove logic from decode• Shorter pipeline, simpler logic• Cache and reused decode work less decode power

– Longer effective I-L1 miss latency (ILD on fill)

Page 14: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

Decoder Example: Intel P-Pro

• Only Decoder 0 interfaces with the uROM and MS• If insn. in Decoder 1 or Decoder 2 requires > 1

uop1) do not generate output2) shift to Decoder 0 on the next cycle

16 Raw Instruction Bytes

Decoder0

Decoder1

Decoder2

BranchAddress

Calculator

Fetch resteerif needed

mROM

4 uops 1 uop 1 uop

Page 15: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

Fetch Rate is an ILP Upper Bound• Instruction fetch limits performance– To sustain IPC of N, must sustain a fetch rate of N per cycle

• If you consume 1500 calories per day,but burn 2000 calories per day,then you will eventually starve.

– Need to fetch N on average, not on every cycle

• N-wide superscalar ideally fetches N insns. per cycle• This doesn’t happen in practice due to:– Instruction cache organization– Branches– … and interaction between the two

Page 16: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

Instruction Cache Organization• To fetch N instructions per cycle...– L1-I line must be wide enough for N instructions

• PC register selects L1-I line• A fetch group is the set of insns. starting at PC– For N-wide machine, [PC,PC+N-1]

Deco

der

Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst

Tag Inst Inst Inst InstTag Inst Inst Inst Inst

Cache LinePC

Page 17: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

Fetch Misalignment (1/2)• If PC = xxx01001, N=4:– Ideal fetch group is xxx01001 through xxx01100 (inclusive)

Deco

der

Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

000001010011

111

PC: xxx01001 00 01 10 11

Line widthFetch group

Misalignment reduces fetch width

Page 18: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

Fetch Misalignment (2/2)• Now takes two cycles to fetch N instructions– ½ fetch bandwidth!

Deco

der

Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

000001010011

111

PC: xxx01001 00 01 10 11

Deco

der

Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

000001010011

111

PC: xxx01100 00 01 10 11

Inst Inst Inst

Inst

Cycle 1

Cycle 2

Inst Inst Inst

Might not be ½ by combining with the next fetch

Page 19: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

Reducing Fetch Fragmentation (1/2)• Make |Fetch Group| < |L1-I Line|

Deco

der

Tag Inst Inst Inst InstInst Inst Inst Inst

Tag Inst Inst Inst Inst

TagInst Inst Inst InstInst Inst Inst Inst

Inst Inst Inst InstCache Line

PC

Can deliver N insns. when PC > N from end of line

Page 20: CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture

Reducing Fetch Fragmentation (2/2)• Needs a “rotator” to decode insns. in correct order

Deco

der

Tag Inst Inst Inst InstInst Inst Inst Inst

Tag Inst Inst Inst Inst

TagInst Inst Inst InstInst Inst Inst Inst

Inst Inst Inst Inst

Rotator

Inst Inst Inst Inst

Aligned fetch group

PC