1 pipelining (chapter 8) tu-delft ti1400/12-pds course website:

1

Pipelining(Chapter 8)

TU-DelftTI1400/12-PDS

http://www.pds.ewi.tudelft.nl/~iosup/Courses/2012_ti1400_8.ppt

Course website:http://www.pds.ewi.tudelft.nl/~iosup/Courses/2012_ti1400_results.htm


2

Basic idea (1)

F1 E1 F2 F3 F4E2 E3 E4I1 I2 I3 I4

sequential execution time

B1

Instructionfetchunit

Executionunit

buffer


3

Basic idea (2): Overlap

F1 E1

F2

F3

F4

E2

E3

E4

I1

I2

I3

I4

pipelined execution

time

1 2 3 4 5 Clock cycle


4

Instruction phases

• F Fetch instruction• D Decode instruction and fetch operands• O Perform operation• W Write result


5

Four-stage pipeline

F1 D1

F2

F3

F4

D2

D3

D4

I1

I2

I3

I4

pipelined execution

time

1 2 3 4 5 Clock cycleO1 W1

O2 W2

O3 W3

O4 W4


6

Hardware organization (1)

Fetchunit

B1

Decodeand

fetchoper.

B2

Operunit

B3

Writeunit


7


During cycle 4, the buffers contain:• B1:

- instruction I3• B2:

- the source operands of I2- the specification of the operation- the specification of the destination operand

• B3:- the result of the operation of I1- the specification of the destination operand


8


Fetchunit

B1

Decodeand

fetchoper.

B2

Operunit

B3

Writeunit

I3 Operands I2Operation I2

Result I1


9

Pipeline stall (1)

• Pipeline stall: delay in a stage of the pipeline due to an instruction

• Reasons for pipeline stall:- Cache miss- Long operation (for example, division)- Dependency between successive instructions- Branching


10

Pipeline stall (2): Cache miss

F1 D1

F2

F3

D2

D3

I1

I2

I3

time

1 2 3 4 5 Clock cycleO1 W1

O2 W2

O3 W3

6 7 8

Cache miss in I2


11

Pipeline stall (3): Cache miss

F1 F2

D2

F

D

O

1 2 3 4 5 Clock cycle

F2 F2

D3

6 7 8

W

D1

F2 F3

idle idle idle

O1 O2 O3idle idle idle

W1 W2 W3idle idle idle

Effect of cache miss in F2


12

Pipeline stall (4): Long operation

F2 D2I2 O2 W2

F3 D3I3 O3 W3

F4 D4I4 O4 W4

time

F1 D1I11 2 3 4 5 Clock cycle

O1 W16 7 8


13

Pipeline stall (5): Dependencies

• Instructions:ADD R1, 3(R1)ADD R4, 4(R1)

cannot be done in parallel• Instructions:

ADD R2, 3(R1)ADD R4, 4(R3)

can be done in parallel


14

Pipeline stall (6): Branch

time

Ii

Ik

Fi Ei

Fk Ek

(branch)

Pipeline stall due to Branch

only start fetching instructions after branch has beenexecuted


15

Data dependency (1): example

MUL R2,R3,R4 /* R4 destination */

ADD R5,R4,R6 /* R6 destination */

New value of R4 must be available before ADD instruction uses it


16

Data dependency (2): example

timeI1 F1 D1 O1 W1

F2 D2 O2 W2I2

W3F3 D3 O3I3

I4 F4 D4 O4 W4

MUL

ADD

Pipeline stall due to data dependence between W1 and D2


17

Branching: Instruction queue

Fetch

Dispatch Operation Write

instruction queue........


18

Idling at branch

time

Ij

Ij+1

Fj Ej

Fj+1

(branch)

Ik Fk Ek

idle

Ik+1 Fk+1 Ek+1


19

Branch with instruction queueI1 F1 E1

I3 F3 E3

I2 F2 E2

I4 F4

Ij Fj Ej

Ij+1 Fj+1 Ej+1

Ij+2 Fj+2 Ej+2

Ij+3 Fj+3 Ej+3

time

branch

Branch folding:execute a later branch instruction simultaneously(i.e., compute target)

I4 discarded


20

Delayed branch (1): reordering

LOOP Shift_left R1Decrement R2Branch_if>0 LOOP

NEXT Add R1,R3

LOOP Decrement R2Branch_if>0 LOOPShift_left R1

NEXT Add R1,R3

Original

Reordered alwaysexecuted

alwaysloose acycle


21

Delayed branch (2): execution timing

F EF E

F EF E

F EF E

F E

DecrementBranchShiftDecrementBranchShiftAdd


22

Branch prediction (1)

I1 F1 D1 E1 W1

F2

F3

F4

E2

D3 E3 X

D4 X

Fk Dk

Compare

Branch-if>I2

I3

I4

Effect of incorrect branch predictionIk


23

Branch prediction (2)

Possible implementation:- use a single bit- bit records previous choice of branch- bit tells from which location to fetch next

instructions


24

Data paths of CPU (1)Source 1Source 2

SRC1 SRC2

ALU

RSLT

Registerfile

Destination

Operand forwarding


25

Data paths of CPU (2)

Operation Write

SRC1SRC2 RSLT

forwarding data path

register fileALU


26

Pipelined operation

I1 F R1 + R3

F

Add

ShiftI2

I3

I4

R2

shift R3R3

F D O W

F D O WI1: Add R1, R2, R3I2: Shift_left R3

result of Add has tobe available


27

Short pipeline

I1 F R1 + R3R2

F D fwd,shift R3 -

F D O W

I2

I3


28

Long pipeline

F D O WI1 1 O2 O3

FI2

I3

D O1 O2 O3 Wfwd

F D O1 O2 O3 W


29

Compiler solution

I1: Add R1, R2, R3I2: Shift_left R3

I1: Add R1, R2, R3NOPNOP

I2: Shift_left R3

insert no-operations towait for result


30

Side effects

I2: ADD D1, D2

I3: ADDX D3, D4carry copy

Other form of (implicit) data dependency:instructions can have side effects that are usedby the next instruction


31

Complex addressing mode

F D X+[R1] [X+[R1]][[X+[R1]]] R2 D

F DD Dfwd,O

Load

Next instruct. DW

Load (X(R1)), R2

Cause pipe line stall

X in instruction


32

Simple addressing modes

F D X+[R1]

[X+[R1]]

[[X+[R1]]]

R2 DAdd

F DD

F DD

R2

R2

F DD Dfwd,O W

Load

Load

Next instruction

Add #X,R1,R2Load (R2),R2Load (R2),R2

Build up from simple instructions: same amount of time


33

Addressing modes• Requirements addressing modes with pipelining:

- operand access not more than one memory access

- only load and store instructions access memory- addressing modes do not have side effects

• Possible addressing modes:- register- register indirect- index


34

Condition codes (1)• Problems in RISC with condition codes

(CCs):- do instructions after reordering have access

to the right CC values?- are CCs already available at the next

instruction?• Solutions:

- compiler detection- no automatic use of CCs, only when explicitly

given in instruction


35

Explicit specification of CCs

Increment R5Add R2, R4Add-with-increment R1, R3

ADDI R5, R5, 1ADDC R4, R2, R4ADDE R3, R1, R3

double precisionaddition

PowerPC instructions (C: change carry flag, E: use carry flag)


36

Two execution units

Fetch

DispatchUnit

FP Unit

Write

instruction queue

IntegerUnit

........


37

Instruction flow (superscalar)

F1 D1 O1 W1I1 O1 O1

F2 D2 O2 W2

F3 D3 O3 O3 O3

W4F4 D4 O4

W3

Fadd

I2 Add

I3 Fsub

I4 SubSimultaneous execution of floating pointand integer operations


38

Completion in program order

D1 O1 W1I1 O1 O1

F2 D2 O2 W2

F3 D3 O3 O3 O3

W4F4 D4 O4

W3

Fadd

I2 Add

I3 Fsub

I4 Sub

F1

wait until previous instruction has completed


39

Consequences completion order

When an exception occurs:• writes not necessarily in order of

instructions: imprecise exceptions• writes in order: precise exceptions


40

PowerPC pipeline

Data cache Instr. cache

Instr. fetch Branch unit

Dispatcher

Instructionqueue

Completionqueue

LSUIU

FPU

store queue


41

Performance Effects (1)

• Execution time of a program: T• Dynamic instruction count: N• Number of cycles per instruction: S• Clock rate: R• Without pipelining: T = (N x S)

/ R• With an n-stage pipeline: T’ = T /

n ???


42

Performance Effects (2)• Cycle time: 2 ns (R is 500 MHz)• Cache hit (miss) ratio instructions: 0.95

(0.05)• Cache hit (miss) ratio data: 0.90 (0.10)• Fraction of instructions that need data

from memory: 0.30• Cache miss penalty: 17 cycles • Average extra delay per instruction:

(0.05 + 0.3 x 0.1) x 17 = 1.36 cycles, so slow down by a factor of more than

2!!


43

Performance Effects (3)

• On average, the fetch stage takes, due to instruction cache misses:

1 + (0.05 x 17) = 1.85 cycles• On average, the decode stage takes, due

to operand cache misses:1 + (0.3 x 0.1 x 17) = 1.51 cycles

• For a total additional cost of 1.36 cycles


44

Performance Effects (4)• If only one stage takes longer, the additional

time should be counted relative to one stage, not relative to the complete instruction:

• In other words: here, the pipeline is as slow as the slowest stage

F1 D1 O1 W1

F1 D1 O1 W1


45

Performance Effects (5)• Delay of 1 cycle every 4 instructions in only

one stage: average penalty: 0.25

• Average inter-completion time: (3x1 + 1x2)/4=1.25

F4 D4 O4 W4

F1 D1 O1 W1

F3 D3 O3 W3

F2 D2 O2 W2

F5 D5 O5 W5


46

Performance Effects (6)• Delays in two stages:

- k % of the instructions in one stage, penalty s cycles

- l % of the instructions in another stage, penalty t cycles

• Average inter-completion time:((100-k-l) x 1 + k(1+s) + l(1+t))/100 =

(100+ ks +lt)/100• In example (k=5, l=3, s=t=17): 2.36


47

Performance Effects (7)• Large number of pipeline stages seems

advantageous, but: - more instructions simultaneously being

processed, so more opportunity for conflicts- branch penalty becomes larger- ALU is usually bottleneck, no use having smaller

time steps

1 pipelining (chapter 8) tu-delft ti1400/12-pds course website:

Documents