cs 1104 help session iv five issues in pipelining colin tan, [email protected] s15-04-15

CS 1104Help Session IV

Five Issues in PipeliningColin Tan,

[email protected]

S15-04-15

Issue 1Pipeline Registers

• Instruction execution typically involves several stages– Fetch: Instructions are read from memory

– Decode: The instruction is interpreted, data is read from registers.

– Execute: The instruction is actually executed on the data

– Memory: Any data memory accesses are performed (e.g. read, write data memory)

– Writeback: Results are written to destination registers.


• Stages are built using combinational devices– The second you put something on the input of a

combinational device, the outputs change.

– These outputs form inputs to other stages, causing their outputs to change

– Hence instructions in the fetch stage affect every other stage in the CPU.

– Not possible to have multiple instructions at different stages, since each stage will now affect stages further down.

• Effect is that CPU can only execute 1 instruction at a time.


• To support pipelining, stages must be de-coupled from each other.– Add pipeline registers!

– Pipeline registers allow each stage to hold a different instruction, as they prevent one stage from affecting another stage until the appropriate time.

• Hence we can now execute 5 different instructions at 5 stages of the pipeline– 1 different instruction at each stage.

Issue 2Speedup

• Figure below shows a non-pipelined CPU executing 2 instructions

IF ID EX MEM WB

IF ID EX MEM WB

• Assuming each stage takes 1 cycles, each instruction will take 5 cycles (i.e. CPI=5)

Issue 2Speedup

• For pipelined case:– First instruction takes 5 cycles

– Subsequent instructions will take 1 cycle• The other 4 cycles of each subsequent instruction is amortized

by the previous instruction (see diagram).

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

Issue 2Speedup

• For N+1 instructions, 1st instruction takes 5 cycles, subsequent N instructions take 1 cycle.

• Total number of cycles is not 5+N.• Hence average number of cycles per instruction is:

CPI = (5+N)/(N+1)• As N tends to infinity, CPI tends to 1.• Compare with none-pipeline case, a 5-stage pipeline gives

a 5:1 speedup!• Ideally, an M stage pipeline will give an M times speedup.

Issue 3Hazards

• A hazard is a situation where computation is prevented from proceeding correctly.– Hazards can cause performance problems.– Even worse, hazards can cause computation to

be incorrect.– Hence hazards must be resolved

Issue 3AStructural Hazards

• Structural Hazards• Generally a shared resource (e.g. Memory, disk drive) can be

used by only 1 processor or pipeline stage at a time.

• If >1 processor or pipeline stage needs to use the resource, we have a structural hazard.

• E.g. if 2 processors want to use a bus, we have a structural hazard that must be resolved by arbitration (See I/O notes).

• Structural hazards are reduced in the MIPS by having separate instruction and data memory

– If they were in the same memory, it is possible that the IF stage may try to fetch instructions at the same time as the MEM stage accesses data => Structural Hazard results.

Issue 3BData Hazards

• Data Hazards• Caused by having >1 instruction executing concurrently.• Consider the following instructions:

add $1, $2, $3

add $4, $1, $1

IF ID EX MEM WB

IF ID EX MEM WB

add $1,$2,$3

add $4,$1,$1

• The first add instruction will update the contents of $1 in WB stage.• But the second instruction will read it $1 2 cycles earlier in the ID

stage! It will obviously read the old value of $1, and the add instruction will give the wrong result.


• The result that will be written to $1 in the WB stage first becomes available at the ALU in the EX stage.

• The result for the first add become available from the ALU just as the second instruction needs it.

• If we can just send this result over, can resolve hazard already! This is called “Forwarding”.

IF ID EX MEM WB

IF ID EX MEM WB

add $1,$2,$3

add $4,$1,$1


• Sometimes forwarding doesn’t quite work. Consider:

lw $1, 4($3)

add $4, $1, $1

• For the lw instruction the EX stage actually computes the result of 4 + $3 (i.e. the 4($3) portion of the instruction).

• This forms the fetch address. No use forwarding this to the add instruction!


• The result of the ALU stage (i.e. the fetch address of the lw instruction) gets sent to the memory system in the MEM stage, and the contents of that addresses becomes available at the end of the MEM stage.

• But the add instruction needs it at the start of the EX stage. We have this situation:


• The forwarding is being done backwards, meaning that at the point when the add instruction needs the data, the data is not yet available.

• Since it is not yet available, cannot forward!!

IF ID EX MEM WB

IF ID EX MEM WB

lw $1,4($3)

add $4,$1,$1


• This form of hazard is called a “load-use” hazard, and is the only type of data hazard that cannot be resolved by forwarding.

• The way to resolve this is stall the add instruction by 1 cycle.

IF ID EX MEM WB

IF ID EX MEM WB

lw $1, 4($3)

add $4,$1,$1 (stall)


• Stalling is a bad idea as the processor spends 1 cycle not doing anything.

• Can also find an independent instruction to place between the lw and add.

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

lw $1, 4($3)

add $4,$1,$1

sub $5,$7,$7


• Forwarding can be done either from the ALU stage or from the MEM stage.

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

add $1,$2,$3

add $4,$1, $5

add $6,$1, $7


• Hazards between adjacent instructions are resolved from the ALU stage (between add $1,$2, $3 and add $4,$1,$5)

• Hazards between instructions separated by another instructions (between add $1,$2,$3 and add $6,$1,$7) are resolved from the MEM stage.– This is because if we try to resolve this from the MEM

stage, the add $6,$1,$7 instruction will actually get the results of the previous (add $4,$1,$5) instruction instead (since it is the instruction that is in the EX stage)

Issue 3CControl Hazards

• In an unoptimized pipeline branch decisions are made after the EX stage– The EX stage is where the comparisons are made.– Hence the branching decision becomes available at the end of

the EX stage.

• Pipeline can be optimized by moving the comparisons to the ID stage– Comparisons are always made between register contents (e.g.

beq $1, $3, R).– The register contents are available by the end of the ID stage.

Issue 3CControl Hazards

• However still have a problem. E.g.L1: add $3, $1, $1

beq $3, $4, L1

sub $4, $3, $3

• Depending on whether the beq is taken or not, the next instruction to be fetched is either add (if the branch is taken) or sub (if the branch is not taken)

Issue 3CControl Hazard

• We don’t know which instruction to fetch until the end of the ID stage.

• But the IF stage must still fetch something!– Fetch add or fetch sub?

IF ID EX MEM WBbeq $3,$4,L1

sub $4,$3,$3or

add $3,$1,$1IF

Issue 3CControl Hazards - Solutions

• Always assume that the branch is taken:– The fetch stage will fetch the add instruction.

– By the time this fetch is complete, the outcome of the branching is known.

– If the branch is taken, the add instruction proceeds to completion.

– If the branch is not taken, the add instruction is flushed from the pipeline, and the sub instruction is fetched and executed. This causes a 1 cycle stall.


• Always assume that the branch is not taken– The IF stage will fetch the sub instruction first.

– By this time, the outcome of the branch will be known.

– If the branch is not taken, then the sub instruction executes to completion.

– If the branch is taken, then the sub instruction is flushed, and the add instruction is fetched and executed. Results in 1 cycle stall.


• Delayed BranchingL1: add $3, $1, $1

beq $3, $4, L1

sub $4, $3, $3

ori $5, $2, $3

• Just like the assume not taken strategy, the IF stage fetches the sub instruction. However, the sub instruction executes to completion regardless of the outcome of the branch.

Issue 3CControl Hazards-Solutions

– The outcome of the branching will now be known.

– If the branch is taken, the add instruction is fetched and executed.

– Otherwise the ori instruction is fetched and executed.

• This strategy is called “delayed branching” because the effect of the branch is not felt until after the sub instruction (i.e. 1 instruction later).

• The sub instruction here is called the delay slot or delay instruction, and it will always be executed regardless of the outcome of the branching.

Issue 4Instruction Scheduling

• We must prevent pipeline stalls in order to gain maximum pipeline performance.

• For example, for the load-use hazard, we must find an instruction to place between the lw instruction and the instruction that uses the lw results to prevent stalling.

• Also may need to place instructions into delay slots.

• This re-arrangement of instructions is called Instruction Scheduling.


• Basic criteria: – An instruction I3 to be placed between two instructions

I1 and I2 must be independent of both I1 and I2.lw $1, 0($3)

add $2, $1, $4

sub $4, $3, $2

In this example, the sub instruction modifies $4, which is used by the add instruction. Hence it cannot be moved between the lw and the add.


– An instruction I3 that is moved must not violate dependency orderings. For example:

add $1, $2, $3

sub $5, $1, $7

lw $4, 0($6)

ori $9, $4, $4• The add instruction cannot be moved between the lw and ori instructions

as it would violate the dependency ordering with the sub instruction.– I.e. the sub depends on the add, and moving the add after the sub would

cause incorrect computation of the sub.


• The nop instruction stands for “no operation”.• When the CPU reads and executes the nop

instruction, absolutely nothing happens, except that 1 cycle is wasted executing this instruction.

• The nop instruction can be used in delay slots or simply to waste time.


• Delay branch example: Suppose we have the following program, and suppose that branches are not delayed:

add $3, $4, $5

add $2, $3, $7

beq $2, $3, L1

sub $7, $2, $4

L1:

• In this program, the 2 add instructions will be executed regardless of the outcome of the branch, but the sub instruction will not be executed if the branch is taken.


• Suppose a hardware designer modifies the beq instruction to become a delayed branch.– The sub instruction is now in the delay slot, and will be executed

regardless of the outcome of the branch!– This is obviously not what the programmer originally intended.– To correct this, we must place an instruction that will be executed

regardless of the outcome of the branch into the delay slot. – Either of the 2 add instructions qualify, since they will be executed no

matter how the branch turns out.– BUT moving either of them into the delay slot will cause incorrect

computation• They will violate the dependency orderings between the first and second

add, and between the second add and the lw.


• But if we don’t move anything into the delay slot, the program will not execute correctly.

• Solution: Place a nop instruction there.add $3, $4, $5

add $2, $3, $7

beq $2, $3, L1

nop #delay slot here

sub $7, $2, $4

L1:

• The sub instruction now moves out of the delay slot.


• Loop Unrolling– Idea: If we loop 16 times to perform an operation, we

can duplicate that operation 4 times and loop only 4 times.

E.g.

for(int i=0; i<16; i++)

my[i] = my[i]+3;


• This loop can be unrolled to:for(int i=0; i<16; i=i+4)

{

my[i] = my[i]+3;

my[i+1]=my[i+1]+3;

my[i+2]=my[i+2]+3;

my[i+3]=my[i+3]+3;

}


• But why even bother doing this?– Loop unrolling actually generates more instructions!

• Previously we only had 1 instruction doing my[i]=my[i]+3

• Now we have 4 such instructions!

– Increasing the number of instructions gives us more flexibility in scheduling the code.

• This allows us to eliminate pipeline stalls etc. more effectively.

Issue 5Improving Performance

• Super-pipelines– Each pipeline stage is further broken down.

– Effectively each pipeline stage is in turn pipe-lined.

– E.g. if each stage can be further broken down into 2 sub-stages:

IF1 ID1 EX1 M1IF2 ID2 EX2 M2 WB1 WB2


• This allows us to accommodate more instructions in the pipeline.– Now we can have 10 instructions operating

simultaneously.– So now we can have a 10x speedup over the

non-pipeline architecture instead of just a 5x speedup!


• Unfortunately when things go wrong, penalties are higher:– E.g. a branch misprediction resulting in an IF-stage

flush will now cause 2 bubbles (in the IF1 and IF2 stages) instead of just 1.

– In a load-use stall, 2 bubbles must be inserted.


• Superscalar Architectures– Have 2 or more pipelines working at the same time!

• In a single pipeline, normally the best CPI possible is 1.

• With 2 pipelines, the average CPI goes down to 1/2!

– This will allow us to execute twice as many instructions per second.

– Real situation not that ideal• Instructions going to 2 different pipelines simultaneously must

be independent of each other– There is NO forwarding between pipelines!


• If CPU is unable to find independent instructions, then 1 pipeline will remain idle.

• Example of superscalar machine:– Pentium processor - 2 integer pipelines, 1 floating-point

pipeline, giving a total of 3 pipelines!

Summary

• Issue 1: Pipeline registers– These decouple stages so that different instructions can

exist in each stage.

– Allows us to execute multiple instructions in a single pipeline.

• Issue 2: Speed-up– Ideally, an N stage pipeline should give you an N times

speedup.

Summary

• Issue 3: Hazards– Structural hazards: Solved by having multiple

resources. E.g. Separate memory for instruction and data.

– Data hazards: Solved by forwarding or stalling.

– Control hazards: Solved by branch prediction or delayed branching.

Summary

• Issue 4: Instruction Scheduling– Instructions may need to be re-arranged to avoid

pipeline stalls (e.g. load-use hazards) or to ensure correct execution (e.g. filling in delayed slots).

– Loop unrolling gives extra instructions.• This in turn gives better scheduling opportunities.

Summary

• Issue 5: Improving Performance– Super-pipelines: Increases pipeline depth.

• A 5-stage pipeline becomes a 10-stage pipeline, improving performance by 10x instead of 5x.

• Also causes larger penalties.

– Super-scalar pipelines: Have multiple pipelines• Can increase instruction execution rate.

• Average CPI can actually fall below 1!

cs 1104 help session iv five issues in pipelining colin tan, [email protected] s15-04-15

Documents