the basics: pipelining j. nelson amaral university of alberta 1

The Basics: Pipelining

J. Nelson AmaralUniversity of Alberta

1

The Pipeline Concept

Bauer p. 322

3

Pipeline Throughput and Latency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

Consider the pipeline above with the indicateddelays. We want to know what is the pipelinethroughput and the pipeline latency.

Pipeline throughput: instructions completed per second.

Pipeline latency: how long does it take to execute a single instruction in the pipeline.

4

Pipeline Throughput and LatencyIF ID EX MEM WB

5 ns 4 ns 5 ns 10 ns 4 ns

Pipeline throughput: how often is an instruction completed?

€

T =1 instr

max lat(IF), lat(ID), lat(EX), lat(MEM), lat(WB)[ ]

=1 instr

max 5ns,4ns,5ns,10ns,4ns[ ]

=1 instr

10ns

Pipeline latency: how long does it take to execute an instruction in the pipeline?

nsnsnsnsnsns

WBlatMEMlatEXlatIDlatIFlatL

28410545

)()()()()(

Is this right?

5



Simply adding the latencies to compute the pipelinelatency, only would work for an isolated instruction

IF MEMIDI1 L(I1) = 28nsEX WBMEMIDIFI2 L(I2) = 33nsEX WB

MEMIDIFI3 L(I3) = 38nsEX WBMEMIDIFI4

L(I5) = 43nsEX WB

We are in trouble! The latency is not constant.This happens because this is an unbalancedpipeline. The solution is to make every stage

the same length as the longest one.

6



The slowest pipeline state also limits the latency!!

IF MEMIDI1

L(I1) = L(I2) = L(I3) = L(I4) = 50ns

EX WBIF MEMIDI2 L(I2) = 50nsEX WB

IF MEMID EX WBIF MEMID EX

0 10 20 30 40 50 60

I3I4

7



How long does it take to execute 20000 instructionsin this pipeline? (disregard bubbles caused bybranches, cache misses, and hazards)

How long would it take using the same moduleswithout pipelining?

snsnsExecTime pipe 2002000001020000

snsnsExecTime pipenon 5605600002820000

What is the speedup due to pipelining?

8



The speedup that we got from the pipeline is:

8.2 200

560

s

s

ExecTime

ExecTimeSpeedup

pipe

pipenonpipe

How can we improve this pipeline design?

We need to reduce the unbalance to increasethe clock speed.

9

Pipeline Throughput and LatencyIF ID EX MEM1 WB


Now we have one more pipeline stage. What is the throughput now?

MEM2

5 ns

nsnsL 3056

What is the new latency for a single instruction?

€

T =1 instr

max lat(IF), lat(ID), lat(EX), lat(MEM1), lat(MEM2), lat(WB)[ ]

=1 instr

max 5ns, 4ns, 5ns, 5ns, 5ns, 4ns[ ]

=1 instr

5ns

10


IF ID EX MEM1 WB


MEM2

5 ns

IF MEM1IDI1 EX WBMEM1IF MEM1IDI2 EX WBMEM1



IF MEM1IDI7 EX WBMEM1

11

Pipeline Throughput and LatencyIF ID EX MEM1 WB


MEM2

5 ns

snsnsExecTime pipe 100100000520000

How long does it take to execute 20000 instructionsin this pipeline? (disregard bubles caused bybranches, cache misses, etc, for now)

What is the speedup that we get from pipelining?

6.5 100

560

s

s

ExecTime

ExecTimeSpeedup

pipe

pipenonpipe

12


IF ID EX MEM1 WB


MEM2

5 ns

What have we learned from this example?

1. It is important to balance the delays in the stages of the pipeline

2. The throughput of a pipeline is 1/max(delay).

3. The latency is Nmax(delay), where N is the number of stages in the pipeline.

Execution Snapshot

Bauer p. 3313

Pipeline with Control Unit

Bauer p. 3414

Data Hazards and Forwarding

Example 1:

i: R7 ← R12 + R15

i+1: R8 ← R7 – R12

i+2: R15 ← R8 + R7

Read-After-Write (RAW)dependencies

(true dependencies)

Write-After-Read (WAR)dependencies

(anti dependencies)

Bauer p. 3515

Data Hazards and Forwarding

v

v

v

Bauer p. 3616

Forwarding

Bauer p. 3717

Load-ALU RAW DependencyExample 2:i: R6 ← Mem[R2]i+1: R7 ← R6 + R4

The data from the load is not available until the Mem/WB of instruction i,but it is needed at the ID/EX of instruction i+1

Cannot forwardback on time!

Bauer p. 3618

Bubble because of load

Bauer p. 3819

Priority on Forwarding

Example:

i: R10 ← R4 + R5

i+1: R10 ← R4 – R10

i+2: R8 ← R10 + R7

The RAW from i+1 to i+2must take priority over the RAW from i to i+2.

Bauer p. 3820

Forwarding from Mem/WB to Mem

Example:

i: R5 ← Mem[R6]

i+1: Mem[R8] ← R5

Bauer p. 3921

After the load, the contents of the Mem/WB registermust be forwarded to be written to memory (not onlyto R5).

Pipelining with Forwarding and Stall

Bauer p. 3822

Control Hazards (branches)

Bauer p. 4023

Control Hazards: Exceptions and Interruptions

• Exceptions can occur in any stage (except WB)– IF: page faults– ID: Illegal opcodes– EX: arithmetic exceptions– Mem: illegal address, page faults

• Interruptions:– I/O termination, time-outs– Power failures

Bauer p. 4024

Handling Exceptions/Interruptions

Save the Process State

Schedule Process Restart

Clear Exception Condition

Abort ProgramAbort Program “Correct”Exception“Correct”Exception

Perform Unrelated Task

Perform Unrelated Task

?

Bauer p. 4125

Precise Exceptions in a Pipeline• If an exceptions happens in instruction i:

• Instructions i-1, i-2, … complete normally and contribute to the saved state of the process• Instructions i, i+1, i+2, … become no-ops• After the exception is handled, execution re-starts at

instruction i– The PC saved is the PC of instruction i.

Bauer p. 4126

ii-1i-2

i+2i+1

⋅⋅⋅

⋅⋅⋅

Complete normally

no-opno-opno-opno-op

Exception happens here → ←Execution re-starts here

Implementing Precise Exceptions in the Pipeline

1. Flag the pipeline register at the right of the stage where exception was detected– This Flag moves along the pipeline

2. Set all control lines at a stage with the flag to transform the instruction into a no-op

3. Stop instruction fetching4. When the flag reaches the Mem/WB stage,

save the PC of that instruction as the exception PC

Bauer p. 4127

Program Order X Temporal Orderdivide-by-zero exception

page-fault exception

Which exception occurs first in time?

Which exception should be handled first?

Bauer p. 4128

Bauer p. 3829

Design Issues:Can’t avoid Load/ALU instr. bubbleBranch resolution in EX stage → Two-cycle branch penalty

Mem stage unused for ALU instr

Alternative Pipelining Design:Avoiding the load latency penalty

Example: i: R4 ← Mem[R8] i+1: R7 ← R4 + R5

Bauer p. 4330

Avoiding the load latency penaltyExample: i: R4 ← Mem[R8] i+1: R7 ← R4 + R5

Bauer p. 4331

Address Generation Latency PenaltyExample: i: R5 ← R6 + R7 i+1: R9 ← Mem[R5]

Can’t forward from future. Has to stall.

Bauer p. 4332

Other changesAG used for branch resolution

AG unused for ALU operations

Bauer p. 4333

Tradeoffs:

Bauer p. 4334

Avoids load/ALU bubble X additional ALU unitMove branch resolution to AG → same penaltyAG stage unused for ALU operationsStalls for ALU/Store instr. dependency

Which one is better?

MIPS

Intel 486

Bauer p. 4435

Pipelining Functional Units: the EX stage

• Parameters of interest:– number of stages– minimum number of cycles before two

independent (no RAW) instructions of the same type can enter the functional unit

Bauer p. 4436

Single-PrecisionFloating Point Representation

Most standard floating point representation use: 1 bit for the sign (positive or negative) 8 bits for the range (exponent field)23 bits for the precision (fraction field)

S E F2381

€

N =−1( )

S×1. fraction × 2exponent−127, 1 ≤ exponent ≤ 254

−1( )S

× 0. fraction × 2exponent−126, exponent = 0

⎧ ⎨ ⎪

⎩ ⎪

From: Patt and Patel, pp. 33P-H. p. 245 Bauer p. 45

exponent fractionsign

37

Special Floating Point Representations

In the 8-bit field of the exponent we can represent numbers from 0 to255. We studied how to read numbers with exponents from 0 to 254.What is the value represented when the exponent is 255 (i.e. 111111112)?

An exponent equal 255 = 111111112 in a floating point representationindicates a special value.

When the exponent is equal 255 = 111111112 and the fraction is 0,the value represented is infinity.

When the exponent is equal 255 = 111111112 and the fraction is non-zero, the value represented is Not a Number (NaN).

Hen/Patt, pp. 301P-H. p. 246 Bauer p. 4538

Stage 1

Stage 2-3

Stage 4

Floating Point Addition(S1, E1, F1) (S2, E2, F2)

E1 < E2E1 < E2

Insert 1 to left of F1 and to left of F2

S1 ≠ S2S1 ≠ S2

D = E1 – E2

F2 ← F2 << D

add mantissas

Normalize and round off

swap operandsyes

replace F2 by its 2-complementyes

Bauer p. 4639

the basics: pipelining j. nelson amaral university of alberta 1

Documents

pipeline latency

pipeline stage

unbalanced pipeline

pipeline design

ns mem2

ns exwb memid

mem1 id i7 exwb mem1

mem1 id i2 exwb mem1