lecture 2: pipelining and superscalar review. motivation: increase throughput with little increase...

Advanced MicroarchitectureLecture 2: Pipelining and Superscalar Review

2

Pipelined Design• Motivation: Increase throughput with little

increase in cost (hardware, power, complexity, etc.)

• Bandwidth or Throughput = Performance• BW = num. tasks/unit time• For a system that operates on one task at a

time:BW = 1 / latency

• Pipelining can increase BW if many repetitions of same operation/task

• Latency per task remains same or increasesLecture 2: Pipelining and Superscalar Review

3

Pipelining Illustrated

Lecture 2: Pipelining and Superscalar Review

Combinatorial LogicN Gate Delays

BW = ~(1/n)

Combinatorial LogicN/2 Gate Delays

Combinatorial LogicN/2 Gate Delays

Comb. LogicN/3 Gates



BW = ~(2/n)

BW = ~(3/n)

4

T/k

T/k

Performance Model• Starting from an

unpipelined version with propagation delay T and BW=1/T

Perfpipe = BWpipe =1 / (T/k + S)

where S = latch delaywhere k = num stages


TS

S

k-stage pipelinedunpipelined

5

G/k

G/k

Hardware Cost Model• Starting from an

unpipelined version with hardware cost G

Costpipe = G + kL

where L = latch cost incl. controlwhere k = num stages


GL

L

k-stage pipelinedunpipelined

6

Cost/Performance Tradeoff


Cost/Performance: C/P = [Lk + G] / [1/(T/k + S)] = (Lk + G) (T/k + S)

= LT + GS + LSk + GT/k

Optimal Cost/Performance: find min. C/P w.r.t. choice of k

k

C/P

è øç ÷

ç ÷ç ÷æ ö

koptGTLS--------=

çç ç

ç

Lk + G1

Tk+ S

ddk

= 0 + 0 + LS -GTk2

7

“Optimal” Pipeline Depth: kopt


0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104Co

st/P

erfo

rman

ce R

atio

(C/P

)

G=175, L=41, T=400, S=22

G=175, L=21, T=400, S=11

8

Cost?• “Hardware Cost”

– Transistor/Gate Count• Should include additional logic to control the pipeline

– Area (related to gate count)– Power!

• More gates more switching• More gates more leakage

• Many metrics to optimize• Very difficult to determine what really is

“optimal”


9

Pipelining Idealism• Uniform Suboperations

– The operation to be pipelined can be evenly partitioned into uniform-latency suboperations

• Repetition of Identical Operations– The same operations are to be performed

repeatedly on a large number of different inputs• Repetition of Independent Operations

– All the repetitions of the same operation are mutually independent, i.e., no data dependences and no resource conflicts


Good Examples:Automobile assembly lineFloating-Point multiplierInstruction pipeline (?)

10

Instruction Pipeline Design• Uniform Suboperations … NOT!

– Balance pipeline stages• Stage quantization to yield balanced stages• Minimize internal fragmentation (some waiting stages)

• Identical operations … NOT!– Unifying instruction types

• Coalescing instruction types into one “multi-function” pipe• Minimize external fragmentation (some idling stages)

• Independent operations … NOT!– Resolve data and resource hazards

• Inter-instruction dependency detection and resolution• Minimize performance loss


11

The Generic Instruction Cycle• The “computation” to be pipelined:

1. Instruction Fetch (IF)2. Instruction Decode (ID)3.Operand(s) Fetch (OF)4. Instruction Execution (EX)5.Operand Store (OS)

• a.k.a. writeback (WB)6.Update Program Counter (PC)


12

The Generic Instruction Pipeline


Based on Obvious Subcomputations:Instruction Fetch

Instruction Decode

Operand Fetch

Instruction Execute

Operand Store

IF

ID

OF/RF

EX

OS/WB

13

Balancing Pipeline Stages


TIF= 6 units

TID= 2 units

TID= 9 units

TEX= 5 units

TOS= 9 units

• Without pipeliningTcyc TIF+TID+TOF+TEX+TOS

= 31

• PipelinedTcyc max{TIF, TID, TOF, TEX, TOS}

= 9

Speedup= 31 / 9

Can we do better in terms of either performance or efficiency?

IF

ID

OF/RF

EX

OS/WB

14

Balancing Pipeline Stages• Two methods for stage quantization

– Merging multiple subcomputations into one– Subdividing a subcomputation into multiple

smaller ones

• Recent/Current trends– Deeper pipelines (more and more stages)

• To a certain point: then cost function takes over– Multiple different pipelines/subpipelines– Pipelining of memory accesses (tricky)


15

Granularity of Pipeline Stages


Coarser-Grained Machine Cycle: 4 machine cyc /

instructionTIF&ID= 8 units

TOF= 9 units

TEX= 5 units

TOS= 9 units

Finer-Grained Machine Cycle: 11 machine cyc

/instruction

Tcyc= 3 units

TIF,TID,TOF,TEX,TOS = (6/2/9/5/9)

IFID

OF

OS

EX

IFIFID

OFOFOFEXEX

OSOSOS

16

Hardware Requirements• Logic needed for

each pipeline stage

• Register file ports needed to support all (relevant) stages

• Memory accessing ports needed to support all (relevant) stages


IFID

OF

OS

EX

IFIFID

OFOFOFEXEX

OSOSOS

17

Pipeline Examples


IF

RD

ALU

MEM

WB

IF

IDOF

EX

OS

PC GENCache ReadCache Read

DecodeRead REGAdd GEN

Cache ReadCache Read

EX 1EX 2

Check ResultWrite Result

OS

EX

OFID

IFMIPS R2000/R3000

AMDAHL 470V/7

18

Instruction Dependencies• Data Dependence

– True Dependence (RAW)• Instruction must wait for all required input operands

– Anti-Dependence (WAR)• Later write must not clobber a still-pending earlier read

– Output Dependence (WAW)• Earlier write must not clobber an already-finished later write

• Control Dependence (a.k.a. Procedural Dependence)– Conditional branches cause uncertainty to instruction

sequencing– Instructions following a conditional branch depends on

the execution of the branch instruction– Instructions following a computed branch depends on the

execution of the branch instruction


Example: Quick Sort on MIPS

Lecture 2: Pipelining and Superscalar Review 19

bge $10, $9, $36mul $15, $10, 4addu $24, $6, $15lw $25, 0($24)mul $13, $8, 4addu $14, $6, $13lw $15, 0($14)bge $25, $15, $36

$35:addu $10, $10, 1. . .

$36:addu $11, $11, -1. . .

# for (;(j<high)&&(array[j]<array[low]);++j);# $10 = j; $9 = high; $6 = array; $8 = low

20

Hardware Dependency Analysis• Processor must handle

– Register Data Dependencies• RAW, WAW, WAR

– Memory Data Dependencies• RAW, WAW, WAR

– Control Dependencies


21

Terminology• Pipeline Hazards:

– Potential violations of program dependencies– Must ensure program dependencies are not

violated• Hazard Resolution:

– Static method: performed at compile time in software

– Dynamic method: performed at runtime using hardware

Stall, Flush or Forward

• Pipeline Interlock:– Hardware mechanism for dynamic hazard

resolution– Must detect and enforce dependencies at

runtime


22

Pipeline: Steady State


IF ID RD ALU MEM WBIF ID RD ALU MEM WB


IF ID RD ALU MEM WBIF ID RD ALU MEM

IF ID RD ALUIF ID RD

IF IDIF

t0 t1 t2 t3 t4 t5

Instj

Instj+1

Instj+2

Instj+3

Instj+4

23

Pipeline: Data Hazard


t0 t1 t2 t3 t4 t5





IF IDIF

Instj

Instj+1

Instj+2

Instj+3

Instj+4

24

Pipeline: Stall on Data Hazard



IF ID Stalled in RD ALU MEM WBIF Stalled in ID RD ALU MEM WB

Stalled in IF ID RD ALU MEMIF ID RD ALU

t0 t1 t2 t3 t4 t5

Instj

Instj+1

Instj+2

Instj+3

Instj+4

RDIDIF

IF ID RDIF ID

IF

25

Different Viewt0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

IF Ij Ij+1 Ij+2 Ij+3 Ij+4 Stall Ij+4

ID Ij Ij+1 Ij+2 Ij+3 Stall Ij+3 Ij+4

RD Ij Ij+1 Ij+2 Stall Ij+2 Ij+3 Ij+4

ALU Ij Ij+1 nop nop nop Ij+2 Ij+3 Ij+4

MEM Ij Ij+1 nop nop nop Ij+2 Ij+3

WB Ij Ij+1 nop nop nop Ij+2


26

Pipeline: Forwarding Paths






IF IDIF

t0 t1 t2 t3 t4 t5

Many possible pathsInstj

Instj+1

Instj+2

Instj+3

Instj+4

MEM ALURequires stalling even with fwding paths

27

ALU Forwarding Paths


Deeper pipeline mayrequire additionalforwarding paths

IF ID Register Filesrc1src2

==

ALU

MEM

==

dest

28

Pipeline: Control Hazard


t0 t1 t2 t3 t4 t5

InstiInsti+1

Insti+2

Insti+3

Insti+4





IF IDIF

29

Pipeline: Stall on Control Hazard



IF ID RD ALU MEMIF ID RD ALU

IF ID RDIF ID

IF

t0 t1 t2 t3 t4 t5

InstiInsti+1

Insti+2

Insti+3

Insti+4

Stalled in IF

Pipeline: Prediction for Control Hazards

Lecture 2: Pipelining and Superscalar Review 30

t0 t1 t2 t3 t4 t5

InstiInsti+1

Insti+2

Insti+3

Insti+4


IF ID RD ALU nop nopIF ID RD nop nop

IF ID nop nopIF ID RD

IF IDIF

nopnop nopALU nopRD ALUID RD

nopnopnop

New Insti+2New Insti+3New Insti+4

Speculative State Cleared

Fetch Resteered

31

Going Beyond Scalar• Simple pipeline limited to execution of CPI

≥ 1.0• “Superscalar” can achieve CPI ≤ 1.0 (i.e.,

IPC ≥ 1.0)– Superscalar means executing more than one

scalar instruction in parallel (e.g., add + xor + mul)

– Contrast to Vector which effectively executes multiple operations in parallel, but they all must be the same (e.g., four parallel additions)


32

Architectures for Instruction Parallelism• Scalar pipeline (baseline)

– Instruction/overlap parallelism = D– Operation Latency = 1– Peak IPC = 1


D

Succ

essiv

eIn

stru

ctio

ns

Time in cycles

1 2 3 4 5 6 7 8 9 10 11 12

D different instructions overlapped

33

Superscalar Machine• Superscalar (pipelined) Execution

– Instruction parallelism = D x N– Operation Latency = 1– Peak IPC = N per cycle


N

Succ

essiv

eIn

stru

ctio

ns

Time in cycles

1 2 3 4 5 6 7 8 9 10 11 12

D x N different instructions overlapped

34

Ex. Original Pentium


Prefetch

Decode1

Decode2 Decode2

Execute Execute

WritebackWriteback

4× 32-byte buffers

Decode up to 2 insts

Read operands, Addr comp

Asymmetric pipesu-pipe v-pipe

shiftrotate

some FP

jmp, jcc,call,fxch

Bothmov, lea,

simple ALU,push/poptest/cmp

35

Pentium Hazards, Stalls• “Pairing Rules” (when can/can’t two insts exec at the

same time?)– read/flow dependence

mov eax, 8mov [ebp], eax

– output dependencemov eax, 8

mov eax, [ebp]– partial register stalls

mov al, 1mov ah, 0

– function unit rules• some instructions can never be paired:

MUL, DIV, PUSHA, MOVS, some FP


36

Limitations of In-Order Pipelines• CPI of inorder pipelines degrades very

sharply if the machine parallelism is increased beyond a certain point– i.e., when N approaches the average distance

between dependent instructions– Forwarding is no longer effectiveMust stall more oftenPipeline may never be full due to frequency of

dependency stalls


37

N Instruction Limit


Ex. Superscalar degree N = 4

Any dependencybetween theseinstructions willcause a stall

Dependent instmust be N =

4 instructionsaway

On average, the parent-child separation is onlyabout 5± instructions!

(Franklin and Sohi ’92)

Pentium: Superscalar degree N=2is reasonable… going much further

encounters rapidly diminishing returns

Average of 5 means there are many cases when the

separation is < 4… each of these limits parallelism

38

In Search of Parallelism• “Trivial” Parallelism is limited

– What is trivial parallelism?• In-order: sequential instructions do not have

dependencies• in all previous examples, all instructions executed

either at the same time or after earlier instructions– previous slides show that superscalar execution

quickly hits a ceiling

• So what is “non-trivial” parallelism? …


39

What is Parallelism?• Work

T1: time to complete a computation on a sequential system

• Critical PathT: time to complete the same

computation on an infinitely-parallel system

• Average ParallelismPavg = T1/ T

• For a p-wide systemTp max{T1/p , T}Pavg >> p Tp T1/p


+

+-

*

*2

a b

xy

x = a + b; y = b * 2z =(x-y) * (x+y)

40

ILP: Instruction-Level Parallelism• ILP is a measure of the amount of inter-

dependencies between instructions• Average ILP = num instructions / longest

pathcode1: ILP = 1 (must execute serially)

T1 = 3, T = 3code2: ILP = 3 (can execute at the same time)

T1 = 3, T = 1


code1: r1 r2 + 1r3 r1 / 17r4 r0 - r3

code2: r1 r2 + 1r3 r9 / 17r4 r0 - r10

41

ILP != IPC• Instruction level parallelism usually

assumes infinite resources, perfect fetch, and unit-latency for all instructions

• ILP is more a property of the program dataflow

• IPC is the “real” observed metric of exactly how many instructions are executed per machine cycle, which includes all of the limitations of a real machine

• The ILP of a program is an upper-bound on the attainable IPC


42

Scope of ILP Analysis


r1 r2 + 1r3 r1 / 17r4 r0 - r3

r11 r12 + 1r13 r19 / 17r14 r0 - r20

ILP=2ILP=1

ILP=3

43

DFG AnalysisA: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R7 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R1]J: R1 = R1 – 1K: R3 ST 0[R1]


44

In-Order Issue, Out-of-Order Completion


Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard

Issue = send an instructionto execution

INT Fadd1

Fadd2

Fmul1

Fmul2

Fmul3

Ld/St

In-orderInst.

Stream

ExecutionBegins

In-order

Out-of-orderCompletion

45

Example


A: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R7 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R1]J: R1 = R1 – 1K: R3 ST 0[R1]

A BCycle 1:

C2:

D3:

4:

5:

E F6: GH JK

7:

8:

IPC = 10/8 = 1.25

A B

C

D

E F

G

H

J

K

46

Example (2)


A: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R9 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R9]J: R1 = R9 – 1K: R3 ST 0[R1]

A BCycle 1:

C2:

D3:

4:

5:

E F G

IPC = 10/7 = 1.43

H J6:

K7:

A B

C

D

E

F G

H J

K

47

Track with Simple Scoreboarding• Scoreboard: a bit-array, 1-bit for each GPR

– If the bit is not set: the register has valid data– If the bit is set: the register has stale data

i.e., some outstanding instruction is going to change it• Issue in Order: RD Fn (RS, RT)

– If SB[RS] or SB[RT] is set RAW, stall– If SB[RD] is set WAW, stall– Else, dispatch to FU (Fn) and set SB[RD]

• Complete out-of-order– Update GPR[RD], clear SB[RD]


48

Out-of-Order Issue


INT Fadd1

Fadd2

Fmul1

Fmul2

Fmul3

Ld/St

In-orderInst.

Stream

DR DR DR DR

Out-of-orderCompletion

Out ofProgram

OrderExecution

Need an extraStage/buffers forDependencyResolution

49

OOO Scoreboarding• Similar to In-Order scoreboarding

– Need new tables to track status of individual instructions and functional units

– Still enforce dependencies• Stall dispatch on WAW• Stall issue on RAW• Stall completion on WAR

• Limitations of Scoreboarding?• Hints

– No structural hazards– Can always write a RAW-free code sequence

Add R1 = R0 + 1; Add R2 = R0 + 1; Add R3 = R0 + 1; …– Think about x86 ISA with only 8 registers


Finite number of registers inany ISA will force you to reuseregister names at some point

WAR, WAW stalls

50

Lessons thus Far• More out-of-orderness More ILP exposed

But more hazards• Stalling is a generic technique to ensure

sequencing• RAW stall is a fundamental requirement (?)

• Compiler analysis and scheduling can help(not covered in this course)


51

Ex. Tomasulo’s Algorithm [IBM 360/91, 1967]


Adder

Floating Point

Registers FLR

0

2

4

8

Store

Data

1

2

3

Buffers SDB

Control

Decoder

Floating

Operand

Stack

FLOSControl

Floating Point

Buffers FLB

1

2

3

4

5

6

Decoder

Floating PointRegisters (FLR)

Control

0248

Floating

Operand Stack

Floating Point

Buffers (FLB)

123456

StoreData

123

Buffers (SDB)

Control

Storage Bus Instruction Unit

Result

Multiply/Divide

•

Common Data Bus (CDB)

Point

BusyBits

Adder

FLB BusFLR Bus

CDB ••

•

•

Tags

Tags

Sink TagTag Source Ctrl.Sink TagTag Source Ctrl.

Sink TagTag Source Ctrl.Sink TagTag Source Ctrl.Sink TagTag Source Ctrl.

•

Result

(FLOS)

52

FYI: Historical Note• Tomasulo’s algorithm (1967) was not the

first• Also at IBM, Lynn Conway proposed multi-

issue dynamic instruction scheduling (OOO) in Feb 1966– Ideas got buried due to internal politics,

changing project goals, etc.– But it’s still the first (as far as I know)


53

Modern Enhancements to Tomasulo’s Algorithm


TomasuloPeak IPC = 12 FP FU’sSingle CDBOperand copyingRS TagTag-based forwardingImprecise

ModernPeak IPC = 6+6-10+ FU’sMany forwarding busesRenamed registersRenamed registersTag-based forwardingPrecise (requires ROB)

Machine WidthStructural Deps

Anti-DepsOutput-DepsTrue DepsExceptions

lecture 2: pipelining and superscalar review. motivation: increase throughput with little increase...

Documents