ibm’s experience on pipelined processors [agerwala and cocke 1987] attributes and assumptions: ...

IBM’s Experience on Pipelined Processors [Agerwala and Cocke 1987]

Attributes and Assumptions: Memory Bandwidth

at least one word/cycle to fetch 1 instruction/cycle from I-cache 40% of instructions are load/store, require access to D-cache

Code Characteristics (dynamic) loads - 25% stores - 15% ALU/RR - 40% branches - 20% 1/3 unconditional (always taken);

1/3 conditional taken;

1/3 conditional not taken

More Statistics and Assumptions

Cache Performance hit ratio of 100% is assumed in the experiments cache latency: I-cache = i; D-cache = d; default: i=d=1 cycle

Load and Branch Scheduling loads:

• 25% cannot be scheduled• 75% can be moved back 1 instruction

branches:• unconditional - 100% schedulable• conditional - 50% schedulable

CPI Calculations I

No cache bypass of reg. file, no scheduling of loads or branches Load Penalty: 2 cycles (0.25*2=0.5) Branch Penalty: 2 cycles (0.2*0.66*2=0.27) Total CPI: 1 + 0.5 + 0.27 = 1.77 CPI

Bypass, no scheduling of loads or branches Load Penalty: 1 cycle (0.25*1=0.25) Total CPI: 1 + 0.25 + 0.27 = 1.52 CPI

CPI Calculations II

Bypass, scheduling of loads and branches Load Penalty:

75% can be moved back 1 => no penalty

remaining 25% => 1 cycle penalty (0.25*0.25*1=0.063)

Branch Penalty:

1/3 Uncond. 100% schedulable => 1 cycle (.2*.33=.066)

1/3 Cond. Not Taken, if biased for NT => no penalty

1/3 Cond. Taken

50% schedulable => 1 cycle (0.2*.33*.5=.033)

50% unschedulable => 2 cycles (.2*.33*.5*2=.066)

Total CPI: 1 + 0.063 + 0.167 = 1.23 CPI

CPI Calculations III Parallel target address generation

90% of branches can be coded as PC relative i.e. target address can be computed without register

access A separate adder can compute (PC+offset) in the decode stage Branch Penalty:

Conditional: Unconditional:

Total CPI: 1 + 0.063 + 0.087 = 1.15 CPI = 0.87 IPC

PC-relative addressing

Schedulable Branch penalty

YES (90%) YES (50%) 0 cycle

YES (90%) NO (50%) 1 cycle

NO (10%) YES (50%) 1 cycle

NO (10%) NO (50%) 2 cycles

Pipelined Depth

Processor Performance = ---------------Time

Program

= ------------------ X ---------------- X ------------Instructions Cycles

Program Instruction

Time

Cycle

(code size) (CPI) (cycle time)

(T/ k +S )

TS

S

T/k

T/k

k-stage pipelined

unpipelined

?

Limitations of Scalar Pipelines

Upper Bound on Scalar Pipeline Throughtput

Limited by IPC = 1

Inefficient Unification Into Single Pipeline

Long latency for each instruction

Performance Lost Due to Rigid Pipeline

Unnecessary stalls

Stalls in an Inorder Scalar Pipeline

B ypassing o f S ta lledInstruc tion

Stalled Instruction

B ackw ard P ropagationo f S ta lling

N ot A llow edInstructions are in order with respect to any one stage i.e. no dynamic reordering

Architectures for Instruction-Level Parallelism

Scalar Pipeline (baseline)

Instruction Parallelism = D

Operation Latency = 1

Peak IPC = 1

12

34

56

IF DE EX WB

1 2 3 4 5 6 7 8 90

TIME IN CYCLES (OF BASELINE MACHINE)

SU

CC

ES

SIV

EIN

ST

RU

CT

ION

S

D

Superpipelined Machine

Superpipelined Execution

IP = DxM

OL = M minor cycles

Peak IPC = 1 per minor cycle (M per baseline cycle)

12

34

5

IF DE EX WB6

1 2 3 4 5 6

major cycle = M minor cycleminor cycle

Superscalar Machines

Superscalar (Pipelined) Execution

IP = DxN

OL = 1 baseline cycles

Peak IPC = N per baseline cycle

IF DE EX WB

123

456

9

78

N

Superscalar and Superpipelined

Superscalar and superpipelined machines of equal degree have roughly the same performance, i.e. if n = m then both have about the same IPC.

Superscalar Parallelism

Operation Latency: 1

Issuing Rate: N

Superscalar Degree (SSD): N

(Determined by Issue Rate)

Superpipeline Parallelism

Operation Latency: M

Issuing Rate: 1

Superpipelined Degree (SPD): M

(Determined by Operation Latency)

Time in Cycles (of Base Machine)0 1 2 3 4 5 6 7 8 9

SUPERPIPELINED

10 11 12 13

SUPERSCALARKey:

IFetchDcode

ExecuteWriteback

Limitations of Inorder Pipelines

CPI of inorder pipelines degrades very sharply if the machine parallelism is increased beyond a certain point, i.e. when NxM approaches average distance between dependent instructions

Forwarding is no longer effective

must stall more often

Pipeline may never be full due to frequent dependency stalls!!

IF DE EX WB

123

456

9

78

What is Parallelism?

WorkT1 - time to complete a computation

on a sequential system

Critical PathT - time to complete the same

computation on an infinitely-parallel system

Average ParallelismPavg = T1 / T

For a p wide system

Tp max{ T1/p, T }

Pavg>>p Tp T1/p

+

+-

*

*2

a b

xy

x = a + b; y = b * 2z =(x-y) * (x+y)

ILP: Instruction-Level Parallelism

ILP is a measure of the amount of inter-dependencies between instructions

Average ILP = no. instruction / no. cyc required

code1: ILP = 1i.e. must execute serially

code2: ILP = 3i.e. can execute at the same time

code1: r1 r2 + 1r3 r1 / 17r4 r0 - r3

code2: r1 r2 + 1r3 r9 / 17r4 r0 - r10

Inter-instruction Dependences Data dependence

r3 r1 op r2 Read-after-Write r5 r3 op r4 (RAW)

Anti-dependencer3 r1 op r2 Write-after-Read r1 r4 op r5 (WAR)

Output dependencer3 r1 op r2 Write-after-Write r5 r3 op r4 (WAW)r3 r6 op r7

Control dependence

Scope of ILP Analysis

r1 r2 + 1r3 r1 / 17r4 r0 - r3r11 r12 + 1r13 r19 / 17r14 r0 - r20

ILP=2

ILP=1

Out-of-order execution permits more ILP to be exploited

Purported Limits on ILP

Weiss and Smith [1984] 1.58Sohi and Vajapeyam [1987] 1.81Tjaden and Flynn [1970] 1.86Tjaden and Flynn [1973] 1.96Uht [1986] 2.00Smith et al. [1989] 2.00Jouppi and Wall [1988] 2.40Johnson [1991] 2.50Acosta et al. [1986] 2.79Wedig [1982] 3.00Butler et al. [1991] 5.8Melvin and Patt [1991] 6Wall [1991] 7Kuck et al. [1972] 8Riseman and Foster [1972] 51Nicolau and Fisher [1984] 90

Flow Path Model of Superscalars

I-cache

FETCH

DECODE

COMMIT

D-cache

BranchPredictor Instruction

Buffer

StoreQueue

ReorderBuffer

Integer Floating-point Media Memory

Instruction

RegisterData

MemoryData

Flow

EXECUTE

(ROB)

Flow

Flow

Superscalar Pipeline Design

Instruction Buffer

Fetch

Dispatch Buffer

Decode

Issuing Buffer

Dispatch

Completion Buffer

Execute

Store Buffer

Complete

Retire

InstructionFlow

Data Flow

Inorder Pipelines

IF

D1

D2

EX

WB

Intel i486

IF IF

D1 D1

D2 D2

EX EX

WB WB

Intel Pentium

U - Pipe V - Pipe

Inorder pipeline, no WAW no WAR (almost always true)

Out-of-order Pipelining 101

• • •

• • •

• • •

• • •IF

ID

RD

WB

INT Fadd1 Fmult1 LD/ST

Fadd2 Fmult2

Fmult3

EX

Program Order

Ia: F1 F2 x F3 . . . . .

Ib: F1 F4 + F5

What is the value of F1? WAW!!!

Out-of-order WB

Ib: F1 “F4 + F5”. . . . . .

Ia: F1 “F2 x F3”

Output Dependences (WAW)

Superscalar Execution Check List

INSTRUCTION PROCESSING CONSTRAINTS

Resource Contention Code Dependences

Control Dependences Data Dependences

True Dependences

Anti-Dependences

Storage Conflicts

(Structural Dependences)

(RAW)

(WAR)

In-order Issue into Diversified Pipelines

• • •

• • •

• • •

• • •

INT Fadd1 Fmult1 LD/ST

Fadd2 Fmult2

Fmult3

RD Fn (RS, RT)

Dest.Reg.

FuncUnit

SourceRegisters

Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard

InorderInst.

Stream

Simple Scoreboarding

Scoreboard: a bit-array, 1-bit for each GPR if the bit is not set, the register has valid data if the bit is set, the register has stale data

i.e. some outstanding inst is going to change it

Dispatch in Order: RD Fn (RS, RT)

Complete out-of-order

- update GPR[RD], clear SB[RD]

- else dispatch to FU, set SB[RD]- if SB[RD] is set is set WAW, stall

- if SB[RS] or SB[RT] is set RAW, stall

Scoreboarding ExampleFU Status Scoreboard

Int Fadd FMult FDiv WB R0 R1 R2 R3 R4 r5

t0 i1 x

t1 i2 i1 x x

t2

t3

t4

t5

t6

t7

t8

t9

t10

t11

i1: FDIV R3, R3, R2i2: LD R1, 0(R6)i3: FMUL R0, R1, R2i4: FDIV R4, R3, R1i5 FSUB R5, R0, R3i6 FMUL R3, R3, R1 assume 1 issue/cyc

Scoreboarding ExampleFU Status Scoreboard

Int Fadd FMult FDiv WB R0 R1 R2 R3 R4 r5

t0 i1 i1

t1 i2 i1 i2 i1

t2 i1 i2 i2 i1

t3 i3 i1 i3 i1

t4 i3 i1 i3 i1

t5 i3 i4 i3 i4

t6 i4 i3 i3 i4

t7 i5 i4 i4 i5

t8 i6 i4 i5 i6 i4 i5

t9 i6 i4, i6 i4

t10 i6 i6

t11 i6 i6

Can WAW really happen here?Can we go to multiple issue?Can we go to out-of-order issue?

i1: FDIV R3, R3, R2i2: LD R1, 0(R6)i3: FMUL R0, R1, R2i4: FDIV R4, R3, R1i5 FSUB R5, R0, R3i6 FMUL R3, R3, R1

Out-of-Order Issue

• • •

INT LD/ST

• • •

• • •

• • •InorderInst.

StreamDispatch

RD

WB

EX

ID

IF

FmultFadd

Scoreboarding for Out-of-Order Issue

Scoreboard: one entry per GPR

(what do we need to record?) Dispatch in order: “RD Fn (RS, RT)”

- if FU is busy structural hazard, stall

- if SB[RD] is set is set WAW, stall

- if SB[RS] or SB[RT] is set is set RAW (what to do??) Issue out-of-order: (when?) Complete out-of-order

- update GPR[RD], clear SB[RD]

(what about WAR?)

Scoreboard for Out-of-Order Issue [H&P pp242~251]

Function Unit Status

Name Busy Op Fii Fj Fk Qj Qk Rj Rk

Integer Fn RD RS RT Yes No

FAdd

FMult

LD/ST

Register Results Status (a.k.a Scoreboard)

R0 R1 R2 R3 R4 R5 R6 . . . . . . . .

FU

RD Fn (RS, RT)”

Which FU is computing the new value if not ready?

Which FU is going to update the register?

Scoreboard Management: “RD Fn (RS, RT)”

Status Wait until Bookkeeping

Dispatch not busy (FU) and

not Result (‘RD’)

Busy(FU) yes; Op(FU) Fn;

Fi(FU) ’RD’; Fj(FU) ’RS’; Fk(FU) ’RT’;

Qj(FU) Result(‘RS’); Qk(FU) Result(‘RT’);

Rj(FU) not Qj(FU); Rk(FU) not Qk(FU); Result(’RD’) FU;

Issue (Read operands)

Rj(FU) and Rk(FU) Rj(FU) no; Rk(FU) no;

Qj(FU) 0; Qk(FU) 0;

Execution

Complete

Functional unit done

Write Result

f (( Fj(f)Fi(FU) or Rj(f)==No )

and

( Fk(f)Fi(FU) or Rk(f)==No ))

f ( if Qj(f)==FU then Rj(f) yes);

f ( if Qk(f)==FU then Rk(f) yes);

Result(Fi(FU)) 0; Busy(FU) no;

Legends: FU -- the fxn unit used by the instruction;Fj( X ) -- content of entry Fj for fxn unit X;Result( X ) -- register result status entry for register X;

Scoreboarding Example 1/3Instruction Status

Instruction Dispatch Read Operands

Execution Complete

Write

Result

LD F6, 43 (R2) X X X X

LD F2, 45(R3) X X X

MULTD F0, F2, F4 X

SUBD F8, F6, F2 X

DIVD F10, F0, F6 X

ADDD F6, F8, F2



Integer (1) Yes LD F2 R3 No

Mult1(10) Yes MULTD F0 F2 F4 Integer No Yes

Mult2(10) No

Add(2) Yes SUBD F8 F6 F2 Integer Yes No

Div(40) Yes DIVD F10 F0 F6 Mult1 No Yes


F0 F2 F4 F6 F8 F10 F12 . . . . . . . .

FU Mult1 Integer Add Divide



Execution Complete

Write

Result

LD F6, 43 (R2) X X X X

LD F2, 45(R3) X X X X

MULTD F0, F2, F4 X X X

SUBD F8, F6, F2 X X X X

DIVD F10, F0, F6 X

ADDD F6, F8, F2 X X X



Integer (1) No

Mult1(10) Yes MULTD F0 F2 F4 No No

Mult2(10) No

Add(2) Yes ADDD F6 F8 F2 No No

Div(40) Yes DIVD F10 F0 F6 Mult1 No Yes


F0 F2 F4 F6 F8 F10 F12 . . . . . . . .

FU Mult1 Add Divide



Execution Complete

Write

Result

LD F6, 43 (R2) X X X X

LD F2, 45(R3) X X X X

MULTD F0, F2, F4 X X X X

SUBD F8, F6, F2 X X X X

DIVD F10, F0, F6 X X X

ADDD F6, F8, F2 X X X X



Integer (1) No

Mult1(10) No

Mult2(10) No

Add(2) No

Div(40) Yes DIVD F10 F0 F6 No No


F0 F2 F4 F6 F8 F10 F12 . . . . . . . .

FU Divide

Limitations of Scoreboarding

Consider a scoreboard processor with infinitely wide datapath

In the best case, how many instructions can be simultaneously outstanding?

Hints no structural hazards can always write a RAW-free code sequence

addi r1,r0,1; addi r2,r0,1; addi r3,r0,1; ……. think about x86 ISA with only 8 registers

Contribution to Register Recycling

9 $34: mul $14 $7, 4010 addu $15, $4, $1411 mul $24, $9, 412 addu $25, $15, $2413 lw $11, 0($25)14 mul $12, $9, 4015 addu $13, $5, $1216 mul $14, $8, 417 addu $15, $13, $1418 lw $24, 0($15)19 mul $25, $11, $2420 addu $10, $10, $2521 addu $9, $9, 122 ble $9, 10, $34

COMPILER REGISTER ALLOCATION

INSTRUCTION LOOPS

Single Assignment, Symbolic Reg.

Map Symbolic Reg. to Physical Reg. Maximize Reuse of Reg.

CODE GENERATION

REG. ALLOCATION

Reuse Same Set of Reg. in Each Iteration

Overlapped Execution of Different Iterations

For (k=1;k<= 10; k++) t += a [i] [k] * b [k] [j] ;

Resolving False Dependences

(2) R3 R5 + 1

Must Prevent (2) from completing • • •(1) R4 R3 + 1

before (1) is dispatched

Stalling: delay Dispatching (or write back) of the 2nd instruction

Copy Operands: Copy not-yet-used operand to prevent being overwritten (WAR)

Register Renaming: use a different register (WAW & WAR)

Must Prevent (2) from completing before (1) completes

(1) R3 R3 op R5

R3

(2) R3 R5 + 1

•••

•••

Register Renaming

Anti and output dependencies are false dependencies

The dependence is on name/location rather than data Given infinite number of registers, anti and output

dependencies can always be eliminated

r3 r1 op r2 r5 r3 op r4

r3 r6 op r7

Renamedr1 r2 / r3

r4 r1 * r5

r8 r3 + r6

r9 r8 - r4

Originalr1 r2 / r3

r4 r1 * r5

r1 r3 + r6

r3 r1 - r4

RenameRegister

File(t0 ... t63)

RenameTable

Hardware Register Renaming

maintain bindings from ISA reg. names to rename registers When issuing an instruction that updates ‘RD’:

allocate an unused rename register TX recording binding from ‘RD’ to TX

When to remove a binding? When to de-allocate a rename register?

ISA namee.g. R12

renameT56

R1 R2 / R3

R4 R1 * R5

R1 R3 + R6

To be continuednext lecture!!

ibm’s experience on pipelined processors [agerwala and cocke 1987] attributes and assumptions: ...

Documents

cpi slide

cycle penalty

d slide

schedulable slide

minor cycle minor cycle

baseline cycle n slide

cycle load

total cpi