ee382a lecture 4: instruction fetch techniquesinstruction

EE382A Lecture 4:

Instruction Fetch TechniquesInstruction Fetch Techniques

Department of Electrical EngineeringStanford University

EE382A – Autumn 2009 John P. ShenLecture 4 - 1

Stanford University

http://eeclass.stanford.edu/ee382a

Announcements

• Start thinking about the research projectT 2 3 t d t– Team: 2-3 students

– Topic: an open problem in computer architecture• See our suggestions or propose your own• Don’t limit yourselves to scope of lectures• Don t limit yourselves to scope of lectures

– Project proposal due by 10/14

N i d• No new required paper– But you have one from previous lecture due today– One optional paper on-line, don’t forget textbook readings

• Class photos– Upload your picture on axess


– Upload your picture on axess

A Dynamic Superscalar Processor

Instruction/decode buffer

Fetch

Decode

Dispatch buffer

Dispatch

In o

rder

Reservationstations

Issue

Finish

Execute

Out

of

orde

r

Reorder/completion buffer

Store buffer

Complete

In o

rder


Sto e bu e

Retire

Lecture 4 Outline

1. Superscalar Pipeline Implementationa. Instruction Fetch & Decodeb. Instruction Dispatch & Issuec. Instruction Executec. Instruction Executed. Instruction Complete & Retire

2. Instruction Flow Techniques


1 Superscalar Pipeline Implementation1. Superscalar Pipeline Implementation


a. Instruction Fetch & Decode

• Goal of instruction fetch & decode– Supply pipeline with maximum number of useful instructions per cycle– The fetch logic sets the maximum possible performance (IPCmax)

• Impediments– Instruction cache misses

I t ti li tInstruction Memory

PC

– Instruction alignment– Complex instruction sets

• CISC (x86, 390, etc)

y

3 instructions fetched– Branches and jumps

• Determining the instruction address (branch direction & targets) • Will start in this lecture & finish discussion in next one…

3 instructions fetched


Instruction Cache Organization

1 cache line = 1 physical row 1 cache line = 2 physical rows


The Fetch Alignment Problem

PC � XX00001PC XX00001

00000 01 10 11

001001

w d

ecod

er

111Row

Fetch group


g p

Row width

Solving the Alignment Problem

• Software solution: align taken branch targets to I-cache row startsEffect on code size?– Effect on code size?

– Effect on the I-cache miss rate? – What happens when we go to the next chip?

• Hardware solutionDetect (mis)alignment case– Detect (mis)alignment case

– Allow access of multiple rows (current and sequential next)• True multi-ported cache, over-clocked cache, multi-banked cache, …• Or keep around the cache line from previous access• Or keep around the cache line from previous access

– Assuming a large basic block

– Collapse the two fetched rows into one instruction group


IBM RS/6000 2-way I-Cacheand Fetch Logicand Fetch Logic

Tags

Tags


Instruction Decoding Issues

• Primary decoding tasksIdentify individual instructions for CISC ISAs– Identify individual instructions for CISC ISAs

– Determine instruction types (especially branches)• Drop instructions after (predicted) taken branches and jumps

P t ti ll t t th f t h i li f t t dd• Potentially restart the fetch pipeline from target address

– Determine dependences between instructions

• Two important factors– Instruction set architecture (CISC vs RISC)

• Determines decoding difficulty• Determines decoding difficulty

– Pipeline width• Determines the number of comparators needed for dependence detection


Pentium Pro Fetch/Decode


I-Cache Pre-decoding in the AMD K5

From memory

Byte1 Byte2 Byte88 instruction bytes 64

Predecode logic5 bits 5 bits 5 bits

Byte1 Byte2 Byte88 instruction bytes � predecode bits 64 � 40

I-Cache

16 instruction bytes � predecode bits 128 � 80

Decode, translate,d di h • Overheads of pre-decoding?

Up to 4 ROPs ROP1 ROP2 ROP3 ROP4

and dispatch • Overheads of pre-decoding?

• Any uses for RISC processors?


b. Instruction Dispatch & Issue

Instruction fetching

Instruction decoding

Instruction dispatching

FU nFU 3FU 2FU 1


Instruction execution

Centralized Reservation Station

Centralized reservationt ti (di t h b ff )

Dispatch(i ) station (dispatch buffer)(issue)

Execute

Completion buffer


Completion buffer

Distributed Reservation Station

Dispatch bufferDispatch

Distributedreservationstations

Issue

Execute

Completion buffer

Finish


Completion buffer

Complete

Instruction Fetch Buffer

FetchUnit

Out-of-orderCore

• Smoothes out the rate mismatch between fetch and execution– Neither the fetch bandwidth nor the execution bandwidth is consistent

F t h b d idth h ld b hi h th ti b d idth• Fetch bandwidth should be higher than execution bandwidth– We prefer to have a stockpile of instructions in the buffer to hide cache

miss latencies. This requires both raw cache bandwidth + control flow l ti


speculation

c. Instruction Execute

• Current trends– More parallelism bypassing very challenging– More parallelism bypassing very challenging– Deeper pipelines– More diversity

• Functional unit types– Integer– Integer– Floating point– Load/store most difficult to make parallel– Branch– Specialized units (media)


Bypass Networks

PCI-Cache

BRFetch Q BR Scan

BR Predict

Fetch Q

Decode

R d B ffBR/CRFX/LD 1 FX/LD 2FP Reorder BufferBR/CRIssue Q

CRUnit

BRUnit

FX/LD 1Issue Q

FX1Unit LD1

Unit

FX/LD 2Issue Q

LD2Unit

FX2Unit

FPIssue Q

FP1Unit

FP2Unit

• O(n2) interconnect from/to FU inputs and outputs

StQ

D-Cache

• O(n ) interconnect from/to FU inputs and outputs• Associative tag‐match to find operands• Solutions (hurt IPC, help cycle time)

– Use RF only (IBM Power4) with no bypass network


Use RF only (IBM Power4) with no bypass network– Decompose into clusters (Alpha 21264)

Specialized units

Intel Pentium 4 d ddstaggered adders

– FireballRun at 2x clock f

Carry

frequencyTwo 16‐bit bitslicesDependent ops execute on half‐cycle boundariesFull result not

l bl l f llavailable until full cycle later


Specialized units

FP multiply‐accumulate

ALU 0ALU 2Shift

accumulateR = (A x B) + CDoubles FLOP/instruction

A � B

(A � B) � C

FLOP/instructionLose RISC instruction format symmetry:

3 source operands

ALU C

(A � B) � C

Round/Normalize

– 3 source operandsWidely used

(a) (b)


(a) (b)

d. Instruction Complete & Retire

• Out‐of‐order execution– ALU instructionsALU instructions– Load/store instructions

• In‐order completion/retirement– Precise exceptions– Memory coherence and consistency

• Solutions• Solutions– Reorder buffer– Store buffer– Load queue snooping (later)


Exceptional Limitations

• Precise exceptions: when exception occurs on instruction i– Instruction i has not modified the processor state (regs, memory)– All older instructions have fully completed– No younger instruction has modified the processor state (regs, memory)

• How do we maintain precise exception in a simple pipelined ?processor?

• What makes precise exceptions difficult on a – In-order diversified pipeline?– Out-of-order pipeline?


Out of order pipeline?

Precise Exceptions & OOO Processors

• Solution: force instructions to update processor state in-orderRegister and memory updates done in order– Register and memory updates done in order

– Usually called instruction retirement or graduation or commitment

• Implementation: Re-order Buffer (ROB)– A FIFO for instruction tracking– 1 entry per instruction

• PC, register/memory address, new value, ready, exceptiong y y p

• ROB algorithm– Allocate ROB entries at dispatch time in-order (FIFO)

I t ti d t th i ROB t h th l t– Instructions update their ROB entry when they complete– Examine head of ROB for in-order retirement

• If done retire, otherwise wait (this forces order)If ti fl h t t f ROB t t f i t ti PC ft h dl


• If exception, flush contents of ROB, restart from instruction PC after handler

Re-order Buffer

In-order Allocate

Out-of-order Complete

In-order Retire


PC Dst Value Dst Address Ready Exception Info

Re-order Buffer Issues

• ProblemAl d t d lt t it i th d b ff h th b– Already computed results must wait in the reorder buffer when they may be needed by other instructions which could otherwise execute.

– Data dependent instructions must wait until the result has been committed to registerregister

• Solution– Forwarding from the re-order buffer

• Allows data in ROB to be used in place of data in registers– Forwarding implementation 1: search ROB for values when registers read

• Only latest entry in ROB can be used• Many comparators but logic is conceptually simple• Many comparators, but logic is conceptually simple

– Forwarding implementation 2: use score-board to track results in ROB• Register scoreboard notes if latest value in ROB and # of ROB entry• Don’t need to track FU any more; an instruction is fully identified by ROB entry


Don t need to track FU any more; an instruction is fully identified by ROB entry

ROB Alternatives: History File

• A FIFO buffer operated similarly to the ROB but logs old register valuesEntry format just like ROB– Entry format just like ROB

• Algorithm– Entries allocated in-order at dispatch – Entries updated out-of-order at completion time

• Destination register updated immediately• Old value of register noted in re-order buffer

– Examine head of history file in-order• If no exception just de-allocate• If exception, reverse history file and undo all register updates before flushing

• Advantage: no need for separate forwarding from ROB

• Disadvantage: slower recovery from exceptions


ROB Alternatives: Future File

• Use two separate register files:Architectural file ⇒ Represents sequential execution– Architectural file ⇒ Represents sequential execution.

– Future file ⇒ Updated immediately upon instruction execution and used as the working file.

• Algorithm– When instruction reaches the head of ROB, it is committed to the

architectural file– On an exception changes are brought over from the architectural file to the

future file based on which instructions are still represented in the ROB

• Advantage: no need for separate forwarding from ROBg p g

• Disadvantage: slower recovery from exceptions


Impediments to High IPC

I-cache

FETCHBranchPredictor Instruction

Buffer

InstructionFlow

DECODE

Buffer

Integer Floating-point Media Memory

MemoryData

EXECUTE Flow

COMMITD-cacheStore

Queue

ReorderBufferRegister

Data (ROB)

Flow


Queue

2 Instruction Flow Techniques2. Instruction Flow Techniques


Control Flow Graph

• Shows possible paths of control flow through basic blocksBB 1

BB 2

main: addi r2, r0, A addi r3, r0, B addi r4, r0, C BB 1 addi r5, r0, N add r10,r0, r0

b 10 5 d

BB 3 BB 4

bge r10,r5, end loop: lw r20, 0(r2) lw r21, 0(r3) BB 2 bge r20,r21,T1 sw r21, 0(r4) BB 3

b T2

BB 5

b T2 T1: sw r20, 0(r4) BB 4 T2: addi r10,r10,1 addi r2, r2, 4 addi r3, r3, 4 BB 5

• Control DependenceNode X is control dependant on Node Y if the computation in Y determines

add 3, 3, 5 addi r4, r4, 4 blt r10,r5, loop end:


– Node X is control dependant on Node Y if the computation in Y determines whether X executes

Mapping CFG to Linear Instruction Sequence

AA

BBB

C

D

D

C


Disruption of Sequential Control Flow

Instruction/Decode Buffer

Fetch

Instruction/Decode Buffer

Dispatch Buffer

Decode

Reservation

Dispatch

StationsI StationsIssue

Execute

Branch

Reorder/

Complete

FinishCompletion Buffer


Store Buffer

Retire

Condition Resolution

Decode Buffer

Fetch

Dispatch Buffer

Decode

Dispatch

CCreg.

GPreg.value

Reservation

p

StationsIssue

valuecomp.

Execute

Branch

Complete

Finish Completion Buffer


Store Buffer

Retire

Target Address Generation

Decode Buffer

Fetch

PC-rel.

Dispatch Buffer

Decode

Dispatch

rel.Reg.ind.

Reg.ind.with

Reservation

p

StationsIssue

withoffset

Execute

Branch

Complete



Store Buffer

Retire

What’s So Bad About Branches?

• Performance Penalties– Use up execution resources– Fragmentation of I-Cache lines– Disruption of sequential control flows up o o seque a co o o

• Need to determine branch direction (conditional branches)• Need to determine branch target

Robs instruction fetch bandwidth and ILP

E l• Example:– We have an N-way superscalar processor (N is very large)– A branch every 5 instructions and each branch takes 3 cycles to resolve


– What is the effective fetch bandwidth?

Riseman and Foster’s Study

• 7 benchmark programs on CDC-3600

• Assume infinite machine:– Infinite memory and instruction stack, register file, functional units

Consider only true dependency at data-flow limity p y

• If bounded to single basic block, i.e. no bypassing of branches ⇒maximum speedup is 1.72

• Suppose one can bypass conditional branches and jumps (i.e. assume the actual branch path is always known such that branches do not impede instruction execution)p )

Br. Bypassed: 0 1 2 8 32 128

Max Speedup:1 72 2 72 3 62 7 21 24 4 51 2


Max Speedup:1.72 2.72 3.62 7.21 24.4 51.2

Determining Branch Direction

Problem: Cannot fetch subsequent instructions until branch direction is d t i ddetermined

• Minimize penalty– Move the instruction that computes the branch condition away fromMove the instruction that computes the branch condition away from

branch (ISA & compiler)– 3 branch components can be separated

S if d f BB if diti if t t• Specify end of BB, specify condition, specify target

• Make use of penalty– Bias for not-taken– Fill delay slots with useful/safe instructions (ISA &compiler)– Follow both paths of execution (hardware)


– Predict branch direction (hardware)

Determining Branch Target

Problem: Cannot fetch subsequent instructions until branch target is determineddetermined

• Minimize delayy– Generate branch target early in the pipeline (ISA & compiler)

• Make use of delay– Bias for not taken– Predict branch target (hardware)Predict branch target (hardware)

PC-relative vs Register Indirect targets


Branch Condition Speculation Options

• Software PredictionEncode an extra bit in the branch instruction– Encode an extra bit in the branch instruction

• Predict not taken: set bit to 0• Predict taken: set bit to 1

Bit set by compiler or user; can use profiling– Bit set by compiler or user; can use profiling– Static prediction, same behavior every time

• Biased For Not Taken– Does not affect the instruction set architecture– Not effective for loops

• Prediction Based on Branch Offsets• Prediction Based on Branch Offsets– Positive offset: predict not taken– Negative offset: predict taken


• Prediction Based on History

Branch Instruction SpeculationnPC to Icache

Fetch

nPC to Icache

nPC(seq.) = PC+4PCBranch

specu. target

prediction FA-mux

Decode Buffer

Fetch

Decode

BranchPredictor(using a BTB)

BTB

specu. cond.

Dispatch Buffer

Dispatch

update(target addr.and history)

ReservationStations

Issue

Execute

Branch



Branch Target Buffer (BTB)

• A small “cache-like” memory in the instruction fetch stage

……. ……. ……

currentPC

Branch Inst.Address (tag)

Branch History

Branch Target(Most Recent)

• Remembers previously executed branches, their addresses, information to aid prediction, and most recent target addresses

• I-fetch stage compares current PC against those in BTB to “guess” nPC– If matched then prediction is made else nPC=PC+4– If predict taken then nPC=target address in BTB else nPC=PC+4


• When branch is actually resolved, BTB is updated

More on BTB (aka BTAC)

• Typically a large associative structure Pentium3: 512 entries 4 way SA– Pentium3: 512 entries, 4-way SA

– Opteron: 2K entries, 4-way SA• Entry format

V lid bit dd t (PC) t t dd f ll th h BB dd (l th– Valid bit, address tag (PC), target address, fall-through BB address (length of BB), branch type info, branch direction prediction

• BTB provides both target and direction prediction– More on direction prediction later on

• Multi-cycle BTB access? – The case in most modern processors (2 cycle BTB)– Start BTB access along with I-cache in cycle 0– In cycle 1, fetch from BTB+N (predict not-taken)– In cycle 2, use BTB output to verify


• 1 cycle fetch bubble if branch was taken

I-Cache & BTB Integration

• Why does this make sense?

Pl BTB t i h h li• Place a BTB entry in each cache line– Each cache line tells you which line to address next– Do not need the full PC, just an index the cache + a way select– This is called way & line prediction

Implemented in Alpha 21264 processor• Implemented in Alpha 21264 processor– On refills, prediction value points to the next sequential fetch line.– Prediction is trained later on as program executes…

• When correct targets are known…

– Line prediction is verified in next cycle. If line-way prediction is incorrect, slot stage is flushed and PC generated using instruction info and direction


prediction

Alpha 21264 Line & Way Predictionp y


Source: IEEE Micro, March-April 1999

Branch Direction UCB Study [Lee and Smith 1984]UCB Study [Lee and Smith, 1984]

• Benchmarks– 26 programs (traces on IBM 370, DEC PDP-11, CDC 6400)p g ( , , )– Use trace-driven simulation with parameterized machine models

• Branch types– Unconditional: always taken or always not taken– Subroutine call: always taken– Loop control: usually taken (loop back)– Decision: either way, e.g. IF-THEN-ELSE– Computed GOTO: always taken, with changing targetComputed GOTO: always taken, with changing target– Supervisor call: always taken– “Execute”: always taken (IBM 370)

• Branch behavior: Taken vs Not TakenIBM1 IBM2 IBM3 IBM4 DEC CDC Average

T 0.640 0.657 0.704 0.540 0.738 0.778 0.676NT 0.360 0.343 0.296 0.460 0.262 0.222 0.324


Branch Prediction Function

• Based on opcode only (%)IBM1 IBM2 IBM3 IBM4 DEC CDCIBM1 IBM2 IBM3 IBM4 DEC CDC66 69 71 55 80 78

• Based on history of branchBased on history of branch– Branch prediction function F (X1, X2, .... )– Use up to 5 previous branches for history (%)

IBM1 IBM2 IBM3 IBM4 DEC CDCIBM1 IBM2 IBM3 IBM4 DEC CDC0 64.1 64.4 70.4 54.0 73.8 77.81 91.9 95.2 86.6 79.7 96.5 82.32 93 3 96 5 90 8 83 4 97 5 90 62 93.3 96.5 90.8 83.4 97.5 90.63 93.7 96.7 91.2 83.5 97.7 93.54 94.5 97.0 92.0 83.7 98.1 95.35 94.7 97.1 92.2 83.9 98.2 95.7


Example Prediction Algorithm based on History

• Prediction accuracy approaches maximum with as few as 2 preceding branch occurrences used as historypreceding branch occurrences used as history

TT

last two branches

TTT

N

NTT

TT

TTT next prediction

N

TNTTNT

NNN

N

TN

Results (%)IBM1 IBM2 IBM3 IBM4 DEC CDC93 3 96 5 90 8 83 4 97 5 90 6

NN


93.3 96.5 90.8 83.4 97.5 90.6

Other Prediction Algorithms Based on History

N

TN

TN

Tt

Tt?tt?

T NSaturation Hysteresis

N

T

TNTn? N

TNT

T N

n? nn

T NCounter Counter

TT

TNN

• Combining prediction accuracy with BTB hit rate (86.5% for 128 sets of 4 entries each), branch prediction can provide the net prediction accuracy of approximately 80%. This implies a 5-20% performance enhancement.


Designing History-based FSMs for Prediction

– 2-bit FSM as good as n-bit FSM– Saturating counter as good as any FSM IBM Study [Nair, 1992]g g y

• 220 possible FSMs, 5248 actually interestingBenchmark Optimal “Counter”

spice2g6 97.2 97.0

N

Tspice2g6 97.2 97.0

doduc 94.3 94.3 *

T

gcc 89.1 89.1

espresso 89.1 89.1 *

*

li 87.1 86.8

eqntott 87.9 87.2

*

*


* Initial state Predict NT Predict T

Simple Branch History Table (BHT)

• Can be integrated with the BTB

• See any shortcoming?


PowerPC 604 Branch Predictor

FA

FA-m

ux

I-cache

FA FA

FAR

Branch TargetAddress Cache

x

Branch HistoryTable (BHT)

BTAC

(BTAC)+4

Decode Buffer

Dispatch Buffer

DecodeBHT update

updateBHT prediction

Dispatch Buffer

ReservationDispatch

StationsSFX SFX CFX FPU LSBRN

BTAC predictionBTAC:- 64 entries- fully associative

hit => predict takenIssue

ExecuteBranch

- hit => predict taken

BHT:- 512 entries- direct mapped

2 bit saturating counter



- 2-bit saturating counter history based prediction- overrides BTAC prediction

Branch Correlation

• So far, the prediction of each static branch instruction is based solely on its own past behavior and not the behaviors of other neighboringon its own past behavior and not the behaviors of other neighboring static branch instructions

• How about this one? x=0;If (someCondition) x=3; /* Branch A*/If (someOtherCondition) y+=19; /* Branch B*/If (someOtherCondition) y 19; / Branch B /If (x<=0) dosomething(); /* Branch C*?

• Other correlation examples?


Global History Branch Predictor

• BHR: a shift register for global history (Shift in latest result in each cycle)

• Advantages & shortcomings?


Local History Branch Predictor

• What could be the patterns detected by such a predictor?

• Advantages & shortcomings?


2-Level Adaptive Prediction [Yeh & Patt]

Nomenclature: {G,P}A{g,p,s}{ , } {g,p, }Pattern History Table (PHT)PC

00...0000...0100...10

Branch History Shift

(shift left when update)Register (BHSR)

1 0 1 1 1 PHTBits

old1 1 1 0 0

11...1011...11 Prediction

index

FSM

new

1 1 1 1 0

To achieve 97% average prediction accuracy: G (1) BHR: 18 bits; g (1) PHT: 218 x 2 bits total = 524 kbits

Branch ResultLogic


P (512x4) BHR: 12 bits; g (1) PHT: 212 x 2 bits total = 33 kbitsP (512x4) BHR: 6 bits; s (512) PHT: 26 x 2 bits total = 78 kbits

Example: Global BHSR Scheme (GAs)

Branch Addressj bitj bits

ctio

n

Branch HistoryShift Register (BHSR)

Pre

di

k bits


BHT of 2 x 2 j+k

Example: Per-Branch BHSR Scheme (PAs)

Branch Address

j bits Standard BHT

Branch HistoryShift R i t (BHSR)

j bitsi bits

Standard BHT

ctio

n

yShift Register (BHSR)k x 2 i

Pre

di

k bit


BHT of 2 x 2 j+kk bits

Gshare Branch Prediction [McFarling]

Branch Address

j bits

xor

tionBranch History

Shift Register (BHSR)

Pre

dict

k bits


BHT of 2 x 2 max(j,k)

Next Lecture

• More Branch Prediction


ee382a lecture 4: instruction fetch techniquesinstruction

Documents