cse502: computer architecture core pipelining. cse502: computer architecture before there was...

76
CSE502: Computer Architecture CSE 502: Computer Architecture Core Pipelining

Upload: gabriel-pleasants

Post on 29-Mar-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

CSE 502:Computer Architecture

Core Pipelining

Page 2: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Before there was pipelining…

• Single-cycle control: hardwired– Low CPI (1)– Long clock period (to accommodate slowest instruction)

• Multi-cycle control: micro-programmed– Short clock period– High CPI

• Can we have both low CPI and short clock period?

Single-cycle

Multi-cycle

insn0.(fetch,decode,exec) insn1.(fetch,decode,exec)

insn0.decinsn0.fetch insn1.decinsn1.fetchinsn0.exec insn1.exec

time

Page 3: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Pipelining

• Start with multi-cycle design• When insn0 goes from stage 1 to stage 2

… insn1 starts stage 1• Each instruction passes through all stages

… but instructions enter and leave at faster rate

Multi-cycle insn0.decinsn0.fetch insn1.decinsn1.fetchinsn0.exec insn1.exec

timePipelined

insn0.execinsn0.decinsn0.fetch

insn1.decinsn1.fetch insn1.exec

insn2.decinsn2.fetch insn2.exec

Can have as many insns in flight as there are stages

Page 4: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Pipeline Examples

Stage delay = Bandwidth =

Stage delay =Bandwidth =

Stage delay = Bandwidth =

address

hit?

==

==

Increases throughput at the expense of latency

address

hit?

==

==

address

hit?

==

==

Page 5: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Processor Pipeline Review

I-cacheRegFile

PC

+4

D-cacheALU

Fetch Decode Memory

(Write-back)

Execute

Page 6: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Stage 1: Fetch• Fetch an instruction from memory every cycle

– Use PC to index memory– Increment PC (assume no branches for now)

• Write state to the pipeline register (IF/ID)– The next stage will read this pipeline register

Page 7: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Stage 1: Fetch Diagram

Inst

ruct

ion

bit

s

IF / IDPipeline register

PC

InstructionCache

en

en

1

+

MUX

PC

+ 1

Deco

de

target

Page 8: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Stage 2: Decode• Decodes opcode bits

– Set up Control signals for later stages

• Read input operands from register file– Specified by decoded instruction bits

• Write state to the pipeline register (ID/EX)– Opcode– Register contents– PC+1 (even though decode didn’t use it)– Control signals (from insn) for opcode and destReg

Page 9: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Stage 2: Decode Diagram

ID / EXPipeline register

regA

conte

nts

regB

conte

ntsRegister File

regAregB

en

Inst

ruct

ion

bit

s

IF / IDPipeline register

PC

+ 1

PC

+ 1

Contr

ol

signals

Fetc

h

Exe

cute

destReg

data

target

Page 10: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Stage 3: Execute• Perform ALU operations

– Calculate result of instruction• Control signals select operation• Contents of regA used as one input• Either regB or constant offset (from insn) used as second input

– Calculate PC-relative branch target• PC+1+(constant offset)

• Write state to the pipeline register (EX/Mem)– ALU result, contents of regB, and PC+1+offset– Control signals (from insn) for opcode and destReg

Page 11: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Stage 3: Execute Diagram

ID / EXPipeline register

regA

conte

nts

regB

conte

nts

ALU

resu

lt

EX/MemPipeline register

PC

+ 1

Contr

ol

signals

Contr

ol

signals

PC

+1

+off

set

+

regB

conte

nts

ALUM

UX

Deco

de

Mem

ory

destRegdata

target

Page 12: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Stage 4: Memory• Perform data cache access

– ALU result contains address for LD or ST– Opcode bits control R/W and enable signals

• Write state to the pipeline register (Mem/WB)– ALU result and Loaded data– Control signals (from insn) for opcode and destReg

Page 13: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Stage 4: Memory Diagram

ALU

resu

lt

Mem/WBPipeline register

ALU

resu

lt

EX/MemPipeline register

Contr

ol

signals

PC

+1

+off

set

regB

conte

nts

Loaded

data

Data Cache

en R/W

in_data

in_addr

Contr

ol

signals

Exe

cute

Wri

te-b

ack

destRegdata

target

Page 14: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Stage 5: Write-back• Writing result to register file (if required)

– Write Loaded data to destReg for LD – Write ALU result to destReg for arithmetic insn– Opcode bits control register write enable signal

Page 15: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Stage 5: Write-back DiagramA

LUre

sult

Mem/WBPipeline register

Contr

ol

signals

Loaded

data

MUX

data

destRegMUX

Mem

ory

Page 16: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Putting It All Together

PC InstCache

Regis

ter

file

MUX

ALU

1

DataCache

++

MUX

IF/ID ID/EX EX/Mem Mem/WB

MUX

op

dest

offset

valB

valA

PC+1PC+1target

ALUresult

op

dest

valB

op

dest

ALUresult

mdata

eq?instru

ction

0

R2R3R4R5

R1

R6

R0

R7

regAregB

datadest

MUX

Page 17: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Pipelining Idealism• Uniform Sub-operations

– Operation can partitioned into uniform-latency sub-ops

• Repetition of Identical Operations– Same ops performed on many different inputs

• Repetition of Independent Operations– All repetitions of op are mutually independent

Page 18: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Pipeline Realism• Uniform Sub-operations … NOT!

– Balance pipeline stages• Stage quantization to yield balanced stages• Minimize internal fragmentation (left-over time near end of cycle)

• Repetition of Identical Operations … NOT!– Unifying instruction types

• Coalescing instruction types into one “multi-function” pipe• Minimize external fragmentation (idle stages to match length)

• Repetition of Independent Operations … NOT!– Resolve data and resource hazards

• Inter-instruction dependency detection and resolution

Pipelining is expensive

Page 19: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

The Generic Instruction Pipeline

Instruction Fetch

Instruction Decode

Operand Fetch

Instruction Execute

Write-back

IF

ID

OF

EX

WB

Page 20: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Balancing Pipeline StagesTIF= 6 units

TID= 2 units

TID= 9 units

TEX= 5 units

TOS= 9 units

Without pipeliningTcyc TIF+TID+TOF+TEX+TOS

= 31

PipelinedTcyc max{TIF, TID, TOF, TEX, TOS}

= 9

Speedup= 31 / 9

IF

ID

OF

EX

WB

Can we do better?

Page 21: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Balancing Pipeline Stages (1/2)• Two methods for stage quantization

– Merge multiple sub-ops into one– Divide sub-ops into smaller pieces

• Recent/Current trends– Deeper pipelines (more and more stages)– Multiple different pipelines/sub-pipelines– Pipelining of memory accesses

Page 22: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Balancing Pipeline Stages (2/2)Coarser-Grained Machine Cycle:

4 machine cyc / instructionFiner-Grained Machine Cycle: 11

machine cyc /instruction

TIF&ID= 8 units

TOF= 9 units

TEX= 5 units

TOS= 9 units

IFID

OF

WB

EX# stages = 11Tcyc= 3 units

IF

IFID

OF

OF

OF

EXEX

WB

WB

WB

# stages = 4Tcyc= 9 units

Page 23: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Pipeline Examples

IF

RD

ALU

MEM

WB

IF

ID

OF

EX

WB

PC GEN

Cache Read

Cache Read

Decode

Read REG

Addr GEN

Cache Read

Cache Read

EX 1

EX 2

Check Result

Write Result

WB

EX

OF

ID

IFMIPS R2000/R3000

AMDAHL 470V/7

Page 24: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Instruction Dependencies (1/2)• Data Dependence

– Read-After-Write (RAW) (only true dependence)• Read must wait until earlier write finishes

– Anti-Dependence (WAR)• Write must wait until earlier read finishes (avoid clobbering)

– Output Dependence (WAW)• Earlier write can’t overwrite later write

• Control Dependence (a.k.a. Procedural Dependence)– Branch condition must execute before branch target– Instructions after branch cannot run before branch

Page 25: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Instruction Dependencies (1/2)# for (;

(j<high)&&(array[j]<array[low]);++j);

bge j, high, $36mul $15, j, 4addu $24, array, $15lw $25, 0($24)mul $13, low, 4addu $14, array, $13lw $15, 0($14)bge $25, $15, $36

$35:addu j, j, 1. . .

$36:addu $11, $11, -1. . .Real code has lots of dependencies

Page 26: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Hardware Dependency Analysis• Processor must handle

– Register Data Dependencies (same register)• RAW, WAW, WAR

– Memory Data Dependencies (same address)• RAW, WAW, WAR

– Control Dependencies

Page 27: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Pipeline Terminology• Pipeline Hazards

– Potential violations of program dependencies– Must ensure program dependencies are not violated

• Hazard Resolution– Static method: performed at compile time in software– Dynamic method: performed at runtime using hardware– Two options: Stall (costs perf.) or Forward (costs hw.)

• Pipeline Interlock– Hardware mechanism for dynamic hazard resolution– Must detect and enforce dependencies at runtime

Page 28: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Pipeline: Steady State

IF ID RD ALU MEM WB

IF ID RD ALU MEM WB

IF ID RD ALU MEM WB

IF ID RD ALU MEM WB

IF ID RD ALU MEM WB

IF ID RD ALU MEM

IF ID RD ALU

IF ID RD

IF ID

IF

t0 t1 t2 t3 t4 t5

Instj

Instj+1

Instj+2

Instj+3

Instj+4

Page 29: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Pipeline: Data Hazard

t0 t1 t2 t3 t4 t5

IF ID RD ALU MEM WB

IF ID RD ALU MEM WB

IF ID RD ALU MEM WB

IF ID RD ALU MEM WB

IF ID RD ALU MEM WB

IF ID RD ALU MEM

IF ID RD ALU

IF ID RD

IF ID

IF

Instj

Instj+1

Instj+2

Instj+3

Instj+4

Page 30: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Option 1: Stall on Data Hazard

IF ID RD ALU MEM WB

IF ID RD ALU MEM WB

IF ID Stalled in RD ALU MEM WB

IF Stalled in ID RD ALU MEM WB

Stalled in IF ID RD ALU MEM

IF ID RD ALU

t0 t1 t2 t3 t4 t5

RD

ID

IF

IF ID RD

IF ID

IF

Instj

Instj+1

Instj+2

Instj+3

Instj+4

Page 31: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Option 2: Forwarding Paths (1/3)

IF ID RD ALU MEM WB

IF ID RD ALU MEM WB

IF ID RD ALU MEM WB

IF ID RD ALU MEM WB

IF ID RD ALU MEM WB

IF ID RD ALU MEM

IF ID RD ALU

IF ID RD

IF ID

IF

t0 t1 t2 t3 t4 t5

Many possible pathsInstj

Instj+1

Instj+2

Instj+3

Instj+4

MEM ALURequires stalling even with forwarding paths

Page 32: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Option 2: Forwarding Paths (2/3)

IF IDRegister File

src1

src2

ALU

MEM

dest

WB

Page 33: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Option 2: Forwarding Paths (3/3)

Deeper pipeline mayrequire additionalforwarding paths

IFRegister File

src1

src2

ALU

MEM

dest

==

==

WB

==

ID

Page 34: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Pipeline: Control Hazard

t0 t1 t2 t3 t4 t5

InstiInsti+1

Insti+2

Insti+3

Insti+4

IF ID RD ALU MEM WB

IF ID RD ALU MEM WB

IF ID RD ALU MEM WB

IF ID RD ALU MEM WB

IF ID RD ALU MEM WB

IF ID RD ALU MEM

IF ID RD ALU

IF ID RD

IF ID

IF

Page 35: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Pipeline: Stall on Control Hazard

IF ID RD ALU MEM WB

IF ID RD ALU MEM WB

IF ID RD ALU MEM

IF ID RD ALU

IF ID RD

IF ID

IF

t0 t1 t2 t3 t4 t5

InstiInsti+1

Insti+2

Insti+3

Insti+4

Stalled in IF

Page 36: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Pipeline: Prediction for Control Hazards

t0 t1 t2 t3 t4 t5

InstiInsti+1

Insti+2

Insti+3

Insti+4

IF ID RD ALU MEM WB

IF ID RD ALU MEM WB

IF ID RD ALU nop nop

IF ID RD nop nop

IF ID nop nop

IF ID RD

IF ID

IF

nop

nop nop

ALU nop

RD ALU

ID RD

nop

nop

nop

New Insti+2New Insti+3New Insti+4

Speculative State Cleared

Fetch Resteered

Page 37: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Going Beyond Scalar

• Scalar pipeline limited to CPI ≥ 1.0– Can never run more than 1 insn. per cycle

• “Superscalar” can achieve CPI ≤ 1.0 (i.e., IPC ≥ 1.0)– Superscalar means executing multiple insns. in parallel

Page 38: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Architectures for Instruction Parallelism• Scalar pipeline (baseline)

– Instruction/overlap parallelism = D– Operation Latency = 1– Peak IPC = 1.0

D

Su

ccess

ive

Inst

ruct

ion

s

Time in cycles1 2 3 4 5 6 7 8 9 10 11 12

D different instructions overlapped

Page 39: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Superscalar Machine• Superscalar (pipelined) Execution

– Instruction parallelism = D x N– Operation Latency = 1– Peak IPC = N per cycle

Su

ccess

ive

Inst

ruct

ion

s

Time in cycles1 2 3 4 5 6 7 8 9 10 11 12

N

D x N different instructions overlapped

Page 40: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Superscalar Example: Pentium

Prefetch

Decode1

Decode2 Decode2

Execute Execute

WritebackWriteback

4× 32-byte buffers

Decode up to 2 insts

Read operands, Addr comp

Asymmetric pipes

u-pipe v-pipeshift

rotatesome FP

jmp, jcc,call,fxch

bothmov, lea,

simple ALU,push/poptest/cmp

Page 41: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Pentium Hazards & Stalls• “Pairing Rules” (when can’t two insns exec?)

– Read/flow dependence• mov eax, 8• mov [ebp], eax

– Output dependence• mov eax, 8• mov eax, [ebp]

– Partial register stalls• mov al, 1• mov ah, 0

– Function unit rules• Some instructions can never be paired

– MUL, DIV, PUSHA, MOVS, some FP

Page 42: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Limitations of In-Order Pipelines• If the machine parallelism is increased

– … dependencies reduce performance– CPI of in-order pipelines degrades sharply

• As N approaches avg. distance between dependent instructions• Forwarding is no longer effective

– Must stall often

In-order pipelines are rarely full

Page 43: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

The In-Order N-Instruction Limit• On average, parent-child separation is about ± 5 insn.

– (Franklin and Sohi ’92)

Ex. Superscalar degree N = 4

Any dependencybetween theseinstructions willcause a stall

Dependent insnmust be N = 4

instructions away

Average of 5 means there are many cases when the

separation is < 4… each of these limits parallelism

Reasonable in-order superscalar is effectively N=2

Page 44: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

In Search of Parallelism• “Trivial” Parallelism is limited

– What is trivial parallelism?• In-order: sequential instructions do not have dependencies• In all previous examples, all instructions executed either at the

same time or after earlier instructions

– previous slides show that superscalar execution quickly hits a ceiling

• So what is “non-trivial” parallelism? …

Page 45: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

What is Parallelism?• Work

– T1: time to complete a computation on a sequential system

• Critical Path– T: time to complete the same computation on an

infinitely-parallel system

• Average Parallelism– Pavg = T1/ T

• For a p-wide system– Tp max{T1/p , T}– Pavg >> p Tp T1/p

+

+-

*

*2

a b

xy

x = a + b; y = b * 2z =(x-y) * (x+y)

Page 46: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

ILP: Instruction-Level Parallelism• ILP is a measure of the amount of inter-dependencies

between instructions• Average ILP = num instructions / longest path

– code1: ILP = 1 (must execute serially)– T1 = 3, T = 3– code2: ILP = 3 (can execute at the same

time)– T1 = 3, T = 1code1: r1 r2 + 1

r3 r1 / 17r4 r0 - r3

code2: r1 r2 + 1r3 r9 / 17r4 r0 - r10

Page 47: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

ILP != IPC• Instruction level parallelism usually assumes infinite

resources, perfect fetch, and unit-latency for all instructions

• ILP is more a property of the program dataflow

• IPC is the “real” observed metric of exactly how many instructions are executed per machine cycle, which includes all of the limitations of a real machine

• The ILP of a program is an upper-bound on the attainable IPC

Page 48: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Scope of ILP Analysis

r1 r2 + 1r3 r1 / 17r4 r0 - r3

r11 r12 + 1r13 r19 / 17r14 r0 - r20

ILP=2

ILP=1

ILP=3

Page 49: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

DFG Analysis• A: R1 = R2 + R3• B: R4 = R5 + R6• C: R1 = R1 * R4• D: R7 = LD 0[R1]• E: BEQZ R7, +32• F: R4 = R7 - 3• G: R1 = R1 + 1• H: R4 ST 0[R1]• J: R1 = R1 – 1• K: R3 ST 0[R1]

Page 50: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

In-Order Issue, Out-of-Order Completion

Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard

Issue = send an instructionto execution

INT Fadd1

Fadd2

Fmul1

Fmul2

Fmul3

Ld/St

In-orderInst.

Stream

ExecutionBegins

In-order

Out-of-orderCompletion

Page 51: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

ExampleA: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R7 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R1]J: R1 = R1 – 1K: R3 ST 0[R1]

A BCycle 1:

C2:

D

3:

4:

5:

E F6: G

H J

K

7:

8:

IPC = 10/8 = 1.25

A B

C

D

E F

G

H

J

K

Page 52: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Example (2)A: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R9 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R9]J: R1 = R9 – 1K: R3 ST 0[R1]

A BCycle 1:

C2:

D

3:

4:

5:

E F G

IPC = 10/7 = 1.43

H J6:

K7:

A B

C

D

E

F G

H J

K

Page 53: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Track with Simple Scoreboarding• Scoreboard: a bit-array, 1-bit for each GPR

– If the bit is not set: the register has valid data– If the bit is set: the register has stale data

• i.e., some outstanding instruction is going to change it

• Issue in Order: RD Fn (RS, RT)– If SB[RS] or SB[RT] is set RAW, stall– If SB[RD] is set WAW, stall– Else, dispatch to FU (Fn) and set SB[RD]

• Complete out-of-order– Update GPR[RD], clear SB[RD]

Page 54: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Out-of-Order Issue

INT Fadd1

Fadd2

Fmul1

Fmul2

Fmul3

Ld/St

In-orderInst.

Stream

DR DR DR DR

Out-of-orderCompletion

Out ofProgram

OrderExecution

Need an extraStage/buffers forDependencyResolution

Page 55: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

OOO Scoreboarding• Similar to In-Order scoreboarding

– Need new tables to track status of individual instructions and functional units

– Still enforce dependencies• Stall dispatch on WAW• Stall issue on RAW• Stall completion on WAR

• Limitations of Scoreboarding?• Hints

– No structural hazards– Can always write a RAW-free code sequence

• Add R1 = R0 + 1; Add R2 = R0 + 1; Add R3 = R0 + 1; …

– Think about x86 ISA with only 8 registers

Finite number of registers inany ISA will force you to reuseregister names at some point

WAR, WAW stalls

Page 56: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Lessons thus Far• More out-of-orderness More ILP exposed

– But more hazards

• Stalling is a generic technique to ensure sequencing• RAW stall is a fundamental requirement (?)

• Compiler analysis and scheduling can help• (not covered in this course)

Page 57: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Ex. Tomasulo’s Algorithm [IBM 360/91, 1967]

Adder

Floating Point

Registers FLR

0

2

4

8

Store

Data

1

2

3

Buffers SDB

Control

Decoder

Floating

Operand

Stack

FLOSControl

Floating Point

Buffers FLB

1

2

3

4

5

6

Decoder

Floating PointRegisters (FLR)

Control

0

248

Floating

Operand

Stack

Floating Point

Buffers (FLB)

1234

56

Store

Data

123

Buffers (SDB)

Control

Storage Bus Instruction Unit

Result

Multiply/Divide

Common Data Bus (CDB)

Point

BusyBits

Adder

FLB BusFLR Bus

CDB ••

Tags

Tags

Sink TagTag Source Ctrl.Sink TagTag Source Ctrl.

Sink TagTag Source Ctrl.

Sink TagTag Source Ctrl.

Sink TagTag Source Ctrl.

Result

(FLOS)

Page 58: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

FYI: Historical Note• Tomasulo’s algorithm (1967) was not the first• Also at IBM, Lynn Conway proposed multi-issue

dynamic instruction scheduling (OOO) in Feb 1966– Ideas got buried due to internal politics, changing project

goals, etc.– But it’s still the first (as far as I know)

Page 59: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Modern Enhancements to Tomasulo’s Algorithm

TomasuloPeak IPC = 12 FP FU’sSingle CDBOperand copyingRS TagTag-based forwardingImprecise

ModernPeak IPC = 6+6-10+ FU’sMany forwarding busesRenamed registersRenamed registersTag-based forwardingPrecise (requires ROB)

Machine WidthStructural Deps

Anti-DepsOutput-DepsTrue DepsExceptions

Page 60: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Page 61: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer ArchitectureBalancing Pipeline Stages

Without pipeliningTcyc TIF+TID+TEX+TMEM+TWB

= 31

PipelinedTcyc max{TIF,TID,TEX,TMEM,TWB}

= 9

Speedup= 31 / 9

Can we do better in terms of either performance or efficiency?

IF

ID

EX

MEM

WB

TIF= 6 units

TID= 2 units

TEX= 9 units

TMEM= 5 units

TWB= 9 units

Page 62: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Balancing Pipeline StagesTwo Methods for Stage Quantization:

– Merging of multiple stages– Further subdividing a stage

Recent Trends:– Deeper pipelines (more and more stages)

• Pipeline depth growing more slowly since Pentium 4. Why?

– Multiple pipelines (subpipelines)– Pipelined memory/cache accesses (tricky)

Page 63: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

The Cost of Deeper PipelinesInstruction pipelines are not ideal

i.e. Instructions in different stages can have dependencies

Suppose add 1 2 3nand 3 4 5

F D E M WF D E M W

t0 t1 t2 t3 t4 t5

Inst0

Inst1

RAW!!

F D E M WF D E M W

t0 t1 t2 t3 t4 t5

addnand E Stall

F E MD Stall D

Page 64: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Types of Dependencies and HazardsData Dependence (Both memory and register)

– True dependence (RAW)Instruction must wait for all required input operands

– Anti-Dependence (WAR)Later write must not clobber a still-pending earlier read

– Output dependence (WAW)Earlier write must not clobber already-completed later write

Control Dependence (aka Procedural Dependence)– Conditional branches may change instruction sequence– Instructions after cond. branch depend on outcome

(more exact definition later)

Page 65: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

TerminologyPipeline Hazards:

– Potential violations of program dependences– Must ensure program dependences are not violated

Hazard Resolution:– Static Method: Performed at compiled time in software– Dynamic Method: Performed at run time using hardware

Pipeline Interlock:– Hardware mechanisms for dynamic hazard resolution– Must detect and enforce dependences at run time

Page 66: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer ArchitectureNecessary Conditions for Data Hazards

i:rk_

j:rk_ Reg Write

Reg Write i:_rk

j:rk_ Reg Write

Reg Read i:rk_

j:_rk Reg Read

Reg Write

stage X

stage Y

dist(i,j) dist(X,Y) ??

dist(i,j) > dist(X,Y) ??

WAW Hazard WAR Hazard RAW Hazard

dist(i,j) dist(X,Y) Hazard!!

dist(i,j) > dist(X,Y) Safe

Haz

ard

Dis

tanc

e

Page 67: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Handling Data Hazards

Avoidance (static)– Make sure there are no hazards in the code

Detect and Stall (dynamic)– Stall until earlier instructions finish

Detect and Forward (dynamic)– Get correct value from elsewhere in pipeline

Page 68: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Handling Data Hazards:Avoidance

Programmer/compiler must know implementation details– Insert nops between dependent instructions

add 1 2 3nopnopnand 3 4 5

write R3 in cycle 5

read R3 in cycle 6

Page 69: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Problems with Avoidance

Binary compatability– New implementations may require more nops

Code size– Higher instruction cache footprint– Longer binary load times– Worse in machines that execute multiple instructions /

cycle• Intel Itanium – 25-40% of instructions are nops

Slower execution– CPI=1, but many instructions are nops

Page 70: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Handling Data Hazards:Detect & Stall

Detection– Compare regA & regB with DestReg of preceding insn.

• 3 bit comparators

Stall– Do not advance pipeline register for Fetch/Decode– Pass nop to Execute

Page 71: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Problems with Detect & Stall

CPI increases on every hazard

Are these stalls necessary? Not always!– The new value for R3 is in the EX/Mem register– Reroute the result to the nand

• Called “forwarding” or “bypassing”

Page 72: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Handling Data Hazards:Detect & Forward

Detection– Same as detect and stall, but…

• each possible hazard requires different forwarding paths

Forward– Add data paths for all possible sources– Add mux in front of ALU to select source

“bypassing logic” often a critical path in wide-issue machines– # paths grows quadratically with machine width

Page 73: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Handling Control Hazards

Avoidance (static)– No branches?– Convert branches to predication

• Control dependence becomes data dependence

Detect and Stall (dynamic)– Stop fetch until branch resolves

Speculate and squash (dynamic)– Keep going past branch, throw away instructions if wrong

Page 74: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Avoidance: if-conversion

if (a == b) { x++; y = n / d;}

sub t1 a, bjnz t1, PC+2add x x, #1div y n, d

sub t1 a, badd(t1) x x, #1div(t1) y n, d

sub t1 a, badd t2 x, #1div t3 n, dcmov(t1) x t2cmov(t1) y t3

Page 75: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Handling Control Hazards:Detect & Stall

Detection– In decode, check if opcode is branch or jump

Stall– Hold next instruction in Fetch– Pass noop to Decode

Page 76: CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)

CSE502: Computer Architecture

Problems with Detect & Stall

CPI increases on every branch

Are these stalls necessary? Not always!– Branch is only taken half the time– Assume branch is NOT taken

• Keep fetching, treat branch as noop• If wrong, make sure bad instructions don’t complete