1 pipelining (chapter 8) tu-delft ti1400/12-pds course website:
DESCRIPTION
TU-Delft TI1400/12-PDS 3 Basic idea (2): Overlap F1E1 F2 F3 F4 E2 E3 E4 I1 I2 I3 I4 pipelined execution time Clock cycleTRANSCRIPT
![Page 1: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/1.jpg)
1
Pipelining(Chapter 8)
TU-DelftTI1400/12-PDS
http://www.pds.ewi.tudelft.nl/~iosup/Courses/2012_ti1400_8.ppt
Course website:http://www.pds.ewi.tudelft.nl/~iosup/Courses/2012_ti1400_results.htm
![Page 2: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/2.jpg)
TU-DelftTI1400/12-PDS
2
Basic idea (1)
F1 E1 F2 F3 F4E2 E3 E4I1 I2 I3 I4
sequential execution time
B1
Instructionfetchunit
Executionunit
buffer
![Page 3: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/3.jpg)
TU-DelftTI1400/12-PDS
3
Basic idea (2): Overlap
F1 E1
F2
F3
F4
E2
E3
E4
I1
I2
I3
I4
pipelined execution
time
1 2 3 4 5 Clock cycle
![Page 4: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/4.jpg)
TU-DelftTI1400/12-PDS
4
Instruction phases
• F Fetch instruction• D Decode instruction and fetch operands• O Perform operation• W Write result
![Page 5: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/5.jpg)
TU-DelftTI1400/12-PDS
5
Four-stage pipeline
F1 D1
F2
F3
F4
D2
D3
D4
I1
I2
I3
I4
pipelined execution
time
1 2 3 4 5 Clock cycleO1 W1
O2 W2
O3 W3
O4 W4
![Page 6: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/6.jpg)
TU-DelftTI1400/12-PDS
6
Hardware organization (1)
Fetchunit
B1
Decodeand
fetchoper.
B2
Operunit
B3
Writeunit
![Page 7: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/7.jpg)
TU-DelftTI1400/12-PDS
7
Hardware organization (2)
During cycle 4, the buffers contain:• B1:
- instruction I3• B2:
- the source operands of I2- the specification of the operation- the specification of the destination operand
• B3:- the result of the operation of I1- the specification of the destination operand
![Page 8: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/8.jpg)
TU-DelftTI1400/12-PDS
8
Hardware organization (3)
Fetchunit
B1
Decodeand
fetchoper.
B2
Operunit
B3
Writeunit
I3 Operands I2Operation I2
Result I1
![Page 9: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/9.jpg)
TU-DelftTI1400/12-PDS
9
Pipeline stall (1)
• Pipeline stall: delay in a stage of the pipeline due to an instruction
• Reasons for pipeline stall:- Cache miss- Long operation (for example, division)- Dependency between successive instructions- Branching
![Page 10: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/10.jpg)
TU-DelftTI1400/12-PDS
10
Pipeline stall (2): Cache miss
F1 D1
F2
F3
D2
D3
I1
I2
I3
time
1 2 3 4 5 Clock cycleO1 W1
O2 W2
O3 W3
6 7 8
Cache miss in I2
![Page 11: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/11.jpg)
TU-DelftTI1400/12-PDS
11
Pipeline stall (3): Cache miss
F1 F2
D2
F
D
O
1 2 3 4 5 Clock cycle
F2 F2
D3
6 7 8
W
D1
F2 F3
idle idle idle
O1 O2 O3idle idle idle
W1 W2 W3idle idle idle
Effect of cache miss in F2
![Page 12: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/12.jpg)
TU-DelftTI1400/12-PDS
12
Pipeline stall (4): Long operation
F2 D2I2 O2 W2
F3 D3I3 O3 W3
F4 D4I4 O4 W4
time
F1 D1I11 2 3 4 5 Clock cycle
O1 W16 7 8
![Page 13: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/13.jpg)
TU-DelftTI1400/12-PDS
13
Pipeline stall (5): Dependencies
• Instructions:ADD R1, 3(R1)ADD R4, 4(R1)
cannot be done in parallel• Instructions:
ADD R2, 3(R1)ADD R4, 4(R3)
can be done in parallel
![Page 14: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/14.jpg)
TU-DelftTI1400/12-PDS
14
Pipeline stall (6): Branch
time
Ii
Ik
Fi Ei
Fk Ek
(branch)
Pipeline stall due to Branch
only start fetching instructions after branch has beenexecuted
![Page 15: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/15.jpg)
TU-DelftTI1400/12-PDS
15
Data dependency (1): example
MUL R2,R3,R4 /* R4 destination */
ADD R5,R4,R6 /* R6 destination */
New value of R4 must be available before ADD instruction uses it
![Page 16: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/16.jpg)
TU-DelftTI1400/12-PDS
16
Data dependency (2): example
timeI1 F1 D1 O1 W1
F2 D2 O2 W2I2
W3F3 D3 O3I3
I4 F4 D4 O4 W4
MUL
ADD
Pipeline stall due to data dependence between W1 and D2
![Page 17: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/17.jpg)
TU-DelftTI1400/12-PDS
17
Branching: Instruction queue
Fetch
Dispatch Operation Write
instruction queue........
![Page 18: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/18.jpg)
TU-DelftTI1400/12-PDS
18
Idling at branch
time
Ij
Ij+1
Fj Ej
Fj+1
(branch)
Ik Fk Ek
idle
Ik+1 Fk+1 Ek+1
![Page 19: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/19.jpg)
TU-DelftTI1400/12-PDS
19
Branch with instruction queueI1 F1 E1
I3 F3 E3
I2 F2 E2
I4 F4
Ij Fj Ej
Ij+1 Fj+1 Ej+1
Ij+2 Fj+2 Ej+2
Ij+3 Fj+3 Ej+3
time
branch
Branch folding:execute a later branch instruction simultaneously(i.e., compute target)
I4 discarded
![Page 20: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/20.jpg)
TU-DelftTI1400/12-PDS
20
Delayed branch (1): reordering
LOOP Shift_left R1Decrement R2Branch_if>0 LOOP
NEXT Add R1,R3
LOOP Decrement R2Branch_if>0 LOOPShift_left R1
NEXT Add R1,R3
Original
Reordered alwaysexecuted
alwaysloose acycle
![Page 21: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/21.jpg)
TU-DelftTI1400/12-PDS
21
Delayed branch (2): execution timing
F EF E
F EF E
F EF E
F E
DecrementBranchShiftDecrementBranchShiftAdd
![Page 22: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/22.jpg)
TU-DelftTI1400/12-PDS
22
Branch prediction (1)
I1 F1 D1 E1 W1
F2
F3
F4
E2
D3 E3 X
D4 X
Fk Dk
Compare
Branch-if>I2
I3
I4
Effect of incorrect branch predictionIk
![Page 23: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/23.jpg)
TU-DelftTI1400/12-PDS
23
Branch prediction (2)
Possible implementation:- use a single bit- bit records previous choice of branch- bit tells from which location to fetch next
instructions
![Page 24: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/24.jpg)
TU-DelftTI1400/12-PDS
24
Data paths of CPU (1)Source 1Source 2
SRC1 SRC2
ALU
RSLT
Registerfile
Destination
Operand forwarding
![Page 25: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/25.jpg)
TU-DelftTI1400/12-PDS
25
Data paths of CPU (2)
Operation Write
SRC1SRC2 RSLT
forwarding data path
register fileALU
![Page 26: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/26.jpg)
TU-DelftTI1400/12-PDS
26
Pipelined operation
I1 F R1 + R3
F
Add
ShiftI2
I3
I4
R2
shift R3R3
F D O W
F D O WI1: Add R1, R2, R3I2: Shift_left R3
result of Add has tobe available
![Page 27: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/27.jpg)
TU-DelftTI1400/12-PDS
27
Short pipeline
I1 F R1 + R3R2
F D fwd,shift R3 -
F D O W
I2
I3
![Page 28: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/28.jpg)
TU-DelftTI1400/12-PDS
28
Long pipeline
F D O WI1 1 O2 O3
FI2
I3
D O1 O2 O3 Wfwd
F D O1 O2 O3 W
![Page 29: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/29.jpg)
TU-DelftTI1400/12-PDS
29
Compiler solution
I1: Add R1, R2, R3I2: Shift_left R3
I1: Add R1, R2, R3NOPNOP
I2: Shift_left R3
insert no-operations towait for result
![Page 30: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/30.jpg)
TU-DelftTI1400/12-PDS
30
Side effects
I2: ADD D1, D2
I3: ADDX D3, D4carry copy
Other form of (implicit) data dependency:instructions can have side effects that are usedby the next instruction
![Page 31: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/31.jpg)
TU-DelftTI1400/12-PDS
31
Complex addressing mode
F D X+[R1] [X+[R1]][[X+[R1]]] R2 D
F DD Dfwd,O
Load
Next instruct. DW
Load (X(R1)), R2
Cause pipe line stall
X in instruction
![Page 32: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/32.jpg)
TU-DelftTI1400/12-PDS
32
Simple addressing modes
F D X+[R1]
[X+[R1]]
[[X+[R1]]]
R2 DAdd
F DD
F DD
R2
R2
F DD Dfwd,O W
Load
Load
Next instruction
Add #X,R1,R2Load (R2),R2Load (R2),R2
Build up from simple instructions: same amount of time
![Page 33: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/33.jpg)
TU-DelftTI1400/12-PDS
33
Addressing modes• Requirements addressing modes with pipelining:
- operand access not more than one memory access
- only load and store instructions access memory- addressing modes do not have side effects
• Possible addressing modes:- register- register indirect- index
![Page 34: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/34.jpg)
TU-DelftTI1400/12-PDS
34
Condition codes (1)• Problems in RISC with condition codes
(CCs):- do instructions after reordering have access
to the right CC values?- are CCs already available at the next
instruction?• Solutions:
- compiler detection- no automatic use of CCs, only when explicitly
given in instruction
![Page 35: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/35.jpg)
TU-DelftTI1400/12-PDS
35
Explicit specification of CCs
Increment R5Add R2, R4Add-with-increment R1, R3
ADDI R5, R5, 1ADDC R4, R2, R4ADDE R3, R1, R3
double precisionaddition
PowerPC instructions (C: change carry flag, E: use carry flag)
![Page 36: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/36.jpg)
TU-DelftTI1400/12-PDS
36
Two execution units
Fetch
DispatchUnit
FP Unit
Write
instruction queue
IntegerUnit
........
![Page 37: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/37.jpg)
TU-DelftTI1400/12-PDS
37
Instruction flow (superscalar)
F1 D1 O1 W1I1 O1 O1
F2 D2 O2 W2
F3 D3 O3 O3 O3
W4F4 D4 O4
W3
Fadd
I2 Add
I3 Fsub
I4 SubSimultaneous execution of floating pointand integer operations
![Page 38: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/38.jpg)
TU-DelftTI1400/12-PDS
38
Completion in program order
D1 O1 W1I1 O1 O1
F2 D2 O2 W2
F3 D3 O3 O3 O3
W4F4 D4 O4
W3
Fadd
I2 Add
I3 Fsub
I4 Sub
F1
wait until previous instruction has completed
![Page 39: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/39.jpg)
TU-DelftTI1400/12-PDS
39
Consequences completion order
When an exception occurs:• writes not necessarily in order of
instructions: imprecise exceptions• writes in order: precise exceptions
![Page 40: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/40.jpg)
TU-DelftTI1400/12-PDS
40
PowerPC pipeline
Data cache Instr. cache
Instr. fetch Branch unit
Dispatcher
Instructionqueue
Completionqueue
LSUIU
FPU
store queue
![Page 41: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/41.jpg)
TU-DelftTI1400/12-PDS
41
Performance Effects (1)
• Execution time of a program: T• Dynamic instruction count: N• Number of cycles per instruction: S• Clock rate: R• Without pipelining: T = (N x S)
/ R• With an n-stage pipeline: T’ = T /
n ???
![Page 42: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/42.jpg)
TU-DelftTI1400/12-PDS
42
Performance Effects (2)• Cycle time: 2 ns (R is 500 MHz)• Cache hit (miss) ratio instructions: 0.95
(0.05)• Cache hit (miss) ratio data: 0.90 (0.10)• Fraction of instructions that need data
from memory: 0.30• Cache miss penalty: 17 cycles • Average extra delay per instruction:
(0.05 + 0.3 x 0.1) x 17 = 1.36 cycles, so slow down by a factor of more than
2!!
![Page 43: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/43.jpg)
TU-DelftTI1400/12-PDS
43
Performance Effects (3)
• On average, the fetch stage takes, due to instruction cache misses:
1 + (0.05 x 17) = 1.85 cycles• On average, the decode stage takes, due
to operand cache misses:1 + (0.3 x 0.1 x 17) = 1.51 cycles
• For a total additional cost of 1.36 cycles
![Page 44: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/44.jpg)
TU-DelftTI1400/12-PDS
44
Performance Effects (4)• If only one stage takes longer, the additional
time should be counted relative to one stage, not relative to the complete instruction:
• In other words: here, the pipeline is as slow as the slowest stage
F1 D1 O1 W1
F1 D1 O1 W1
![Page 45: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/45.jpg)
TU-DelftTI1400/12-PDS
45
Performance Effects (5)• Delay of 1 cycle every 4 instructions in only
one stage: average penalty: 0.25
• Average inter-completion time: (3x1 + 1x2)/4=1.25
F4 D4 O4 W4
F1 D1 O1 W1
F3 D3 O3 W3
F2 D2 O2 W2
F5 D5 O5 W5
![Page 46: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/46.jpg)
TU-DelftTI1400/12-PDS
46
Performance Effects (6)• Delays in two stages:
- k % of the instructions in one stage, penalty s cycles
- l % of the instructions in another stage, penalty t cycles
• Average inter-completion time:((100-k-l) x 1 + k(1+s) + l(1+t))/100 =
(100+ ks +lt)/100• In example (k=5, l=3, s=t=17): 2.36
![Page 47: 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:](https://reader036.vdocuments.site/reader036/viewer/2022062504/5a4d1b377f8b9ab05999d69b/html5/thumbnails/47.jpg)
TU-DelftTI1400/12-PDS
47
Performance Effects (7)• Large number of pipeline stages seems
advantageous, but: - more instructions simultaneously being
processed, so more opportunity for conflicts- branch penalty becomes larger- ALU is usually bottleneck, no use having smaller
time steps