the basics: pipelining j. nelson amaral university of alberta 1
Post on 21-Dec-2015
213 views
TRANSCRIPT
3
Pipeline Throughput and Latency
IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns
Consider the pipeline above with the indicateddelays. We want to know what is the pipelinethroughput and the pipeline latency.
Pipeline throughput: instructions completed per second.
Pipeline latency: how long does it take to execute a single instruction in the pipeline.
4
Pipeline Throughput and LatencyIF ID EX MEM WB
5 ns 4 ns 5 ns 10 ns 4 ns
Pipeline throughput: how often is an instruction completed?
€
T =1 instr
max lat(IF), lat(ID), lat(EX), lat(MEM), lat(WB)[ ]
=1 instr
max 5ns,4ns,5ns,10ns,4ns[ ]
=1 instr
10ns
Pipeline latency: how long does it take to execute an instruction in the pipeline?
nsnsnsnsnsns
WBlatMEMlatEXlatIDlatIFlatL
28410545
)()()()()(
Is this right?
5
Pipeline Throughput and Latency
IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns
Simply adding the latencies to compute the pipelinelatency, only would work for an isolated instruction
IF MEMIDI1 L(I1) = 28nsEX WBMEMIDIFI2 L(I2) = 33nsEX WB
MEMIDIFI3 L(I3) = 38nsEX WBMEMIDIFI4
L(I5) = 43nsEX WB
We are in trouble! The latency is not constant.This happens because this is an unbalancedpipeline. The solution is to make every stage
the same length as the longest one.
6
Pipeline Throughput and Latency
IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns
The slowest pipeline state also limits the latency!!
IF MEMIDI1
L(I1) = L(I2) = L(I3) = L(I4) = 50ns
EX WBIF MEMIDI2 L(I2) = 50nsEX WB
IF MEMID EX WBIF MEMID EX
0 10 20 30 40 50 60
I3I4
7
Pipeline Throughput and Latency
IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns
How long does it take to execute 20000 instructionsin this pipeline? (disregard bubbles caused bybranches, cache misses, and hazards)
How long would it take using the same moduleswithout pipelining?
snsnsExecTime pipe 2002000001020000
snsnsExecTime pipenon 5605600002820000
What is the speedup due to pipelining?
8
Pipeline Throughput and Latency
IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns
The speedup that we got from the pipeline is:
8.2 200
560
s
s
ExecTime
ExecTimeSpeedup
pipe
pipenonpipe
How can we improve this pipeline design?
We need to reduce the unbalance to increasethe clock speed.
9
Pipeline Throughput and LatencyIF ID EX MEM1 WB
5 ns 4 ns 5 ns 5 ns 4 ns
Now we have one more pipeline stage. What is the throughput now?
MEM2
5 ns
nsnsL 3056
What is the new latency for a single instruction?
€
T =1 instr
max lat(IF), lat(ID), lat(EX), lat(MEM1), lat(MEM2), lat(WB)[ ]
=1 instr
max 5ns, 4ns, 5ns, 5ns, 5ns, 4ns[ ]
=1 instr
5ns
10
Pipeline Throughput and Latency
IF ID EX MEM1 WB
5 ns 4 ns 5 ns 5 ns 4 ns
MEM2
5 ns
IF MEM1IDI1 EX WBMEM1IF MEM1IDI2 EX WBMEM1
IF MEM1IDI3 EX WBMEM1IF MEM1IDI4 EX WBMEM1
IF MEM1IDI5 EX WBMEM1IF MEM1IDI6 EX WBMEM1
IF MEM1IDI7 EX WBMEM1
11
Pipeline Throughput and LatencyIF ID EX MEM1 WB
5 ns 4 ns 5 ns 5 ns 4 ns
MEM2
5 ns
snsnsExecTime pipe 100100000520000
How long does it take to execute 20000 instructionsin this pipeline? (disregard bubles caused bybranches, cache misses, etc, for now)
What is the speedup that we get from pipelining?
6.5 100
560
s
s
ExecTime
ExecTimeSpeedup
pipe
pipenonpipe
12
Pipeline Throughput and Latency
IF ID EX MEM1 WB
5 ns 4 ns 5 ns 5 ns 4 ns
MEM2
5 ns
What have we learned from this example?
1. It is important to balance the delays in the stages of the pipeline
2. The throughput of a pipeline is 1/max(delay).
3. The latency is Nmax(delay), where N is the number of stages in the pipeline.
Data Hazards and Forwarding
Example 1:
i: R7 ← R12 + R15
i+1: R8 ← R7 – R12
i+2: R15 ← R8 + R7
Read-After-Write (RAW)dependencies
(true dependencies)
Write-After-Read (WAR)dependencies
(anti dependencies)
Bauer p. 3515
Load-ALU RAW DependencyExample 2:i: R6 ← Mem[R2]i+1: R7 ← R6 + R4
The data from the load is not available until the Mem/WB of instruction i,but it is needed at the ID/EX of instruction i+1
Cannot forwardback on time!
Bauer p. 3618
Priority on Forwarding
Example:
i: R10 ← R4 + R5
i+1: R10 ← R4 – R10
i+2: R8 ← R10 + R7
The RAW from i+1 to i+2must take priority over the RAW from i to i+2.
Bauer p. 3820
Forwarding from Mem/WB to Mem
Example:
i: R5 ← Mem[R6]
i+1: Mem[R8] ← R5
Bauer p. 3921
After the load, the contents of the Mem/WB registermust be forwarded to be written to memory (not onlyto R5).
Control Hazards: Exceptions and Interruptions
• Exceptions can occur in any stage (except WB)– IF: page faults– ID: Illegal opcodes– EX: arithmetic exceptions– Mem: illegal address, page faults
• Interruptions:– I/O termination, time-outs– Power failures
Bauer p. 4024
Handling Exceptions/Interruptions
Save the Process State
Schedule Process Restart
Clear Exception Condition
Abort ProgramAbort Program “Correct”Exception“Correct”Exception
Perform Unrelated Task
Perform Unrelated Task
?
Bauer p. 4125
Precise Exceptions in a Pipeline• If an exceptions happens in instruction i:
• Instructions i-1, i-2, … complete normally and contribute to the saved state of the process• Instructions i, i+1, i+2, … become no-ops• After the exception is handled, execution re-starts at
instruction i– The PC saved is the PC of instruction i.
Bauer p. 4126
ii-1i-2
i+2i+1
⋅⋅⋅
⋅⋅⋅
Complete normally
no-opno-opno-opno-op
Exception happens here → ←Execution re-starts here
Implementing Precise Exceptions in the Pipeline
1. Flag the pipeline register at the right of the stage where exception was detected– This Flag moves along the pipeline
2. Set all control lines at a stage with the flag to transform the instruction into a no-op
3. Stop instruction fetching4. When the flag reaches the Mem/WB stage,
save the PC of that instruction as the exception PC
Bauer p. 4127
Program Order X Temporal Orderdivide-by-zero exception
page-fault exception
Which exception occurs first in time?
Which exception should be handled first?
Bauer p. 4128
Bauer p. 3829
Design Issues:Can’t avoid Load/ALU instr. bubbleBranch resolution in EX stage → Two-cycle branch penalty
Mem stage unused for ALU instr
Alternative Pipelining Design:Avoiding the load latency penalty
Example: i: R4 ← Mem[R8] i+1: R7 ← R4 + R5
Bauer p. 4330
Address Generation Latency PenaltyExample: i: R5 ← R6 + R7 i+1: R9 ← Mem[R5]
Can’t forward from future. Has to stall.
Bauer p. 4332
Tradeoffs:
Bauer p. 4334
Avoids load/ALU bubble X additional ALU unitMove branch resolution to AG → same penaltyAG stage unused for ALU operationsStalls for ALU/Store instr. dependency
Pipelining Functional Units: the EX stage
• Parameters of interest:– number of stages– minimum number of cycles before two
independent (no RAW) instructions of the same type can enter the functional unit
Bauer p. 4436
Single-PrecisionFloating Point Representation
Most standard floating point representation use: 1 bit for the sign (positive or negative) 8 bits for the range (exponent field)23 bits for the precision (fraction field)
S E F2381
€
N =−1( )
S×1. fraction × 2exponent−127, 1 ≤ exponent ≤ 254
−1( )S
× 0. fraction × 2exponent−126, exponent = 0
⎧ ⎨ ⎪
⎩ ⎪
From: Patt and Patel, pp. 33P-H. p. 245 Bauer p. 45
exponent fractionsign
37
Special Floating Point Representations
In the 8-bit field of the exponent we can represent numbers from 0 to255. We studied how to read numbers with exponents from 0 to 254.What is the value represented when the exponent is 255 (i.e. 111111112)?
An exponent equal 255 = 111111112 in a floating point representationindicates a special value.
When the exponent is equal 255 = 111111112 and the fraction is 0,the value represented is infinity.
When the exponent is equal 255 = 111111112 and the fraction is non-zero, the value represented is Not a Number (NaN).
Hen/Patt, pp. 301P-H. p. 246 Bauer p. 4538