csci2510 computer organization lecture 12: pipeliningmcyang/csci2510/2018f/lec12 pipelining.pdf ·...

CSCI2510 Computer Organization

Lecture 12: Pipelining

Ming-Chang YANG

[email protected]

Reading: Chap. 8 (5th Ed.)

mailto:[email protected]

Why We Need Pipelining?

• Real-life example:

Four loads of

laundry that need

to be washed,

dried, and folded.

– Washing: 30 min

– Drying: 40 min

– Folding: 20 min

• Without pipeline:

– 360 min in total

• With pipeline:

– 210 min in total!

CSCI2510 Lec12: Pipelining 2

https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/pipelining/index.html

Outline

• Sequential Execution vs Pipelining

• Pipeline Stall: Hazard

– Data Hazard

– Instruction Hazard

– Structural Hazard

• Superscalar Operation


Sequential Execution

• The processor fetches and executes instructions, one

after the other.

– Fi: Fetch steps for instruction Ii

– Ei: Execute steps for instruction Ii

• Execution of a program consists of a sequential

sequence of fetch and execute steps:

• How to improve the speed of execution?

– Use faster technologies to build CPU and memory ($$$).

– Arrange hardware to do multiple operations at a time ($).CSCI2510 Lec12: Pipelining 4

F1 E1 F2 E2 F3 E3

I1 I2 I3

Time

Separate HW & Interstage Buffer

• Consider a computer having two separate hardware

units:

– One hardware unit is for fetching instructions.

– The other hardware unit is for executing instructions.

• Interstage Buffer: Deposit the fetched instruction.

– Execution unit executes the deposited instruction.

– Fetch unit fetches the next instruction at the same time.


InstructionFetchUnit

ExecutionUnit

Interstage buffer

Instruction

Instr

uction

• Assume the computer is controlled by a clock.

– The fetch and execute steps of any instruction can be

completed in one clock cycle.

• Fetch and execute units form a two-stage pipeline:

– Both units are kept busy all the time.

– An interstage buffer is needed to hold the instruction.


Basic Idea of Instruction Pipelining (1/2)

F1 E 1

F2 E 2

F3 E 3

I 1

I 2

I 3

Instruction

Clock cycle 1 2 3 4 Time

• Parallelism is increased by overlapping the fetch and

execute steps.

– If executions sustain for a long time, the completion rate of

a two-stage pipelining will be twice.

• More is better? How about 4-stage pipeline?

– F: Fetch instruction from memory

– D: Decode instruction and fetch source operands

– E: Execute instruction

– W: Write the result


Basic Idea of Instruction Pipelining (2/2)

F

Fetch

instruction

D

Decode

instruction

4-Stage Pipeline (1/2)


F4I4

F1

F2

F3

I1

I2

I3

D1

D2

D3

D4

E1

E2

E3

E4

W1

W2

W3

W4

Instruction

Clock cycle 1 2 3 4 5 6 7

E

Execute

operation

W

Write

results

Interstage buffers

B1 B2 B3

Time

Class Exercise 12.1

• During clock cycle 4, what is the information hold by

the three interstage buffers (i.e., B1, B2, and B3)

respectivley?


Student ID:

Name:

Date:

F

Fetch

instruction

D

Decode

instruction

F4I4

F1

F2

F3

I1

I2

I3

D1

D2

D3

D4

E1

E2

E3

E4

W1

W2

W3

W4

Clock cycle 1 2 3 4 5 6 7

E

Execute

operation

W

Write

results

B1 B2 B3

Time

4-Stage Pipeline (2/2)

• The four hardware units perform their tasks

simultaneously without interfering others.

– The required information is passed from one unit to the

next through a interstage buffer.

• Each stage should be roughly the same maximum

clock period.

– Why? A unit that completes its task early is idle for the

remainder of the clock period.

• Question: What is the ideal speedup of an N-stage pipeline compared to the sequential execution?


Outline



– Data Hazard





Reality: Pipeline may Stall

• If a pipeline stage requires more than 1 cycle, others

have to wait (pipeline stalled)

– E.g. E2 requires three cycles to complete


F1

F2

F3

I1

I2

I3

E1

E2

E3

D1

D 2

D 3

W 1

W 2

W 3

Instruction

F4 D 4I4

Clock cycle 1 2 3 4 5 6 7 8 9

E4

F5I5 D 5

Time

E5

W 4

In cycles 5 and 6: Write, Decode and

Fetch units must wait and do nothing …

Stall & Hazard

• Hazard: Any condition that causes pipeline to stall.

• Another example: A cache miss occurs in F2:

Figure: Instruction execution steps in successive clock cycles.

Figure: Statuses of processor stages in successive clock cycles.


F1

F2

F3

I1

I2

I3

D1

D2

D3

E1

E2

E3

W1

W2

W3

Instruction

1 2 3 4 5 6 7 8 9Clock cycle

Time

1 2 3 4 5 6 7 8Clock cycle

Stage

F: Fetch

D: Decode

E: Execute

W: Write

F1 F2 F3

D1 D2 D3idle idle idle

E1 E2 E3idle idle idle

W1 W2idle idle idle

9

W3

F2 F2 F2

Time

Types of Hazards

• Data Hazard

– Either the source or the destination operands of an

instruction are not available when required.

• Instruction Hazard

– A delay in the availability of an instruction (this may

be a result of a miss in the cache).

• Structural Hazard

– Two instructions require the use of a given

hardware resource at the same time.


Outline



– Data Hazard





Data Hazard


I1 (Mul)

I2 (Add)

I3

Instruction

1 2 3 4 5 6 7 8 9Clock cycle

I4

F1

F2

F3

D1

D3

E1

E3

E2

W3

W1

D2-A W2

F4 D4 E4 W4

D2

Time

Pipeline is stalled for two cycles.

I1: A = 3 * A;

I2: B = 4 + A;

D: Decode and fetch

source operands

• A data hazard is a situation in which the pipeline is

stalled because the operands are delayed.

• Example:

– Dependent operations must be performed sequentially to

ensure the data consistency.

Class Exercise 12.2

• Please specify whether we will encounter data

hazards for the following instructions.


I1: A = 5 * C;

I2: B = 20 + C;

I1: C = A * B;

I2: E = C + D;

Software Solution to Data Hazard

• The compiler detects and introduces two-cycle delay

by inserting NOP (No-operation) instructions.

– Advantage: Simpler hardware, less cost

– Disadvantage: Larger code size, less flexibility, and

reduced performance


F1

F2

I1 (Mul)

I2 (Add)

D1 E1

E2

Instruction

1 2 3 4 5 6 7 8 9Clock cycle

W1

W2D2

Time

NOP

NOP

I1: A = 3 * A;

I2: B = 4 + A;

Question: Do we really avoid the pipline stalling?

Hardware Solution to Data Hazard (1/2)

• The data hazard arises because I2 is waiting for data

to be written in the register A.

• In fact, the result of I1 is available at the output of ALU.

• Delay is reduced if the result can be forwarded to E2.


F1

F2

F3

I1 (Mul)

I2 (Add)

I3

D1

D3

E1

E3

E2

W3

Instruction

1 2 3 4 5 6 7 8 9Clock cycle

W1

D2-A W2

F4 D4 E4 W4I4

D2

Time

D: Decode and fetch

source operands

I1: A = 3 * A;

I2: B = 4 + A;

Result of I1 is available here!

Hardware Solution to Data Hazard (2/2)

• Operand Forwarding: By introducing the forwarding

path, the execution of I2 can proceed without stalling.

– Disadvantage: Additional hardware cost


E: Execute(ALU)

W: Write(Register file)

SRC1,SRC2 RSLT

(b) Source and result registers

Register

file

SRC1 SRC2

RSLT

Destination

Source 1

Source 2

(a) Datapath (3 buses)

ALU

Port A Port B

Outline



– Data Hazard





Instruction Hazard

• Recall: The purpose of the instruction fetch unit is to

supply the execution units with instructions.

– F: Fetch instruction from memory

– D: Decode instruction and fetch source operands

– E: Execute instruction

– W: Write the result

• Instruction Hazard: The cases cause the pipeline to

stall, because of the delay of instructions.

1) Cache miss

2) Brach instruction (both unconditional and conditional)CSCI2510 Lec12: Pipelining 24

F

Fetch

instruction

D

Decode

instruction

E

Execute

operation

W

Write

results

B1 B2 B3

Instruction Hazard: Cache Miss

• The effect of a cache miss on the pipelined operation

is as follows:

– I1 is fetched from the cache in cycle 1.

– The fetch operation F2 for I2 results in a cache miss.

• The instruction fetch unit must suspend any further fetch requests until

F2 is completed.


F1

F2

I1

I2

I3

D1

D2

E1

E2

W1

W2

F3 D3 E3 W3

Instruction

1 2 3 4 5 6 7 8 9Clock cycleTime

F3

Postponed

• Branches may also cause the pipeline to stall.

– Branch Penalty: The time lost because of a branch inst.

– Branch penalty can be reduced by computing the branch

address earlier in Decode stage (rather than Execute stage)

• However, it still results in 1 cycle branch penalty to the pipeline.


F1 D1 E1 W1

I2 (Branch to Ik)

I1

1 2 3 4 5 6 7Clock cycle

F2 D2

Branch address computed in Execute stage

Branch Penalty: 2 clock cycles

E2

8

Time

F1 D1 E1 W1

I2 (Branch to Ik)

I1


F2 D2

Branch address computed in Decode stage

Branch Penalty: 1 clock cycle

Time

Fk Dk Ek

Fk+1 Dk+1

Ik

Ik+1

Wk

Ek+1

XF3I3

D3

F4 XI4

I3 and I4 must be

discarded

F3 X

Fk Dk Ek

Fk+ 1 Dk+ 1

I3

Ik

Ik+ 1

Wk

Ek+ 1

Only I3 is

discarded

Instruction Hazard: Unconditional Branch

Solution to Instruction Hazard

• Instruction Queue: The interstage buffer between

Fetch and Decode units can keep multiple instructions.

– Fetch unit gets and deposits one instruction at a time.

– Decode unit consumes one instruction at a time.


Instruction queue

E

Execute

operation

W

Write

results

D

Decode

instruction

F

Fetch

instruction

Interstage buffers

F4

W3E3

F2 D2 E2 W2

F3 D3

E4D4 W4F4

• F4, F5, F6, Fk, and Fk+1, are delayed.

• I1, I2, I3, I4, and Ik cannot complete in successive cycles.


Example: Without Instruction Queue

F1 D1 E1 E1 E1 W1

I5 (Branch to Ik)

I1

1 2 3 4 5 6 7 8 9Clock cycle

I2

I3

I4

I6

Ik

Ik+1

10Time

XF6

Fk Dk Ek

Fk+1 Dk+1

Wk

Ek+1

11 12

Instruction 1 takes 3

Execute cycles (i.e., 2-

cycle stall).

Instruction 4 is delayed.

Instruction 5 is a branch .

Instruction 6 is discarded.

F5 D5

Since there is no

instruction queue!

• I6 is still being discarded, but the instruction queue can avoid

delaying F4, F5, F6, Fk, and Fk+1 if the queue is not empty.

• I1, I2, I3, I4, and Ik can complete in successive cycles.CSCI2510 Lec12: Pipelining 29

Example: With Instruction Queue

F1 D1 E1 E1 E1 W1

I5 (Branch to Ik)

I1

1 2 3 4 5 6 7 8 9Clock cycle

I2

I3

I4

I6

Ik

Ik+1

10

1Queue length 1 1 12 3 2 1 1 1

Time

X

F4

W3E3

F2 D2 E2 W2

F3 D3

E4D4 W4

F5

F6

Fk Dk Ek

Fk+1 Dk+1

Wk

Ek+1

Keep

fetching

D5



cycle stall),

The queue length rises to

3 before cycle 6.


Instruction 6 is discarded,

after taking Branch.

The queue length drops to

1 before cycle 8.

Without vs With Instruction Queue

• With instruction queue, the branch instruction does

not increase the overall execution time (if the queue

is not empty).

– Since instructions can complete in successive clock cycles.

• Branch address is computed in parallel with other

instructions, so no cycles lost due to branch.

– This is called branch folding.

• Instruction queue is also possible to hide the effect of

cache miss (if the queue is not empty).


Class Exercise 12.3

• Please show how instruction queue can hide the

effect of cache miss (three cycles) caused by F4.


F1 D1 E1 E1 E1 W1

W3E3

I1

F2 D2

1 2 3 4 5 6 7 8 9Clock cycle

E2 W2

F3 D3

I2

I3

I4

10Time

11 12

F1 D1 E1 E1 E1 W1

W3E3

I1

F2 D2 E2 W2

F3 D3

I2

I3

I4

1 2 3 4 5 6 7 8 9Clock cycle 10Time

11 12

1 1 1Queue length

Without

Instruction

Queue

With

Instruction

Queue

All intermediate instructions

must be discarded …

• Branch folding is not working for conditional branches.

• Conditional branches may result in added hazard.

– Since the condition is based on the preceding instruction.

• Example:


Instruction Hazard: Conditional Branch

Add

LOOP Shift_left R1

Decrement

Branch=0

R2

LOOP

NEXT R1,R3

R2 is used as the

branch condition.

We need to wait for R2 to

determine whether to perform

the conditional branching.

F1 D1 E1 W1

I2 (Decrement)

I1


F3 D3I3

Time

F2 D2 E2 W2

(Shift)

(Branch if R2 = 0) D3-R2

Fk Dk EkIkWk

8 9 10

LOOP

Solution 1) Delayed Branch (1/2)

• The location following a

branch instruction is

called a branch delay slot.

• Delayed branching can

minimize the penalty by

– Placing useful instructions

in branch delay slots, and

– Internally re-ordering the

instructions.


Add

LOOP Shift_left R1

Decrement

Branch=0

R2

LOOP

NEXT R1,R3

(a) Original program loop

Add

LOOP

Shift_left R1

Decrement

Branch=0

R2

LOOP

NEXT R1,R3

(b) Internally Re-ordered instructions

(actual program logic NOT affected)

Branch Delay Slot

Solution 1) Delayed Branch (2/2)

• Delayed branching can minimize the branch penalty.


Instruction

1 2 3 4 5 6 7 8Clock cycle Time

F ENEXT: Add (Branch not taken) WD

9 10

ALU

Result

Forwarding

ALU

Result

Forwarding

F D

F Daddr

F E

Decrement

Branch=0?

Shift (delay slot)

E

W

W

D

(get branch address)

F E

F

F E

Decrement (Branch is taken)

Branch=0?

Shift (delay slot)

W

W

D

Daddr

D

(get branch address)

Solution 2) Branch Prediction (1/2)


F1

F2

I1 (Compare)

I2 (Branch>0)

D1 E1 W1

Instruction

E2

Clock cycle 1 2 3 4 5 6

D2 / P2

Time

I3 F3 D3 X(Branch Delay Slot)

F4

Fk Dk

XI4

Ik

Incorrect Prediction

Fk DkIk

Correct Prediction

• Attempt to predict

whether conditional

branch will take place.

– Delayed branch can

be applied together.

• Branch Prediction:

– If we get it right: no

lost cycles.

• Registers and memory

cannot be updated until

we know we got it right.

– If we get it wrong, just

cancel the instructions.

– Branch prediction can

be dynamic or static.

Solution 2) Branch Prediction (2/2)

• Static Branch Prediction

– The same choice is used every time the conditional branch

is encountered.

– For example, a branch instruction at the end of a loop

causes a branch to the start of the loop for every pass

through the loop except the last one.

• It is helpful to assume this branch will be taken under this case.

– A flexible approach is to have the compiler decide.

• Dynamic Branch Prediction

– The choice is influenced by the past behavior.

– For example, a simple prediction is to use the result of the

most recent execution of the branch instruction.


Outline



– Data Hazard





Structural Hazard

• A structural hazard is the situation when two

instructions require the use of a hardware resource at

the same time.

• The most common case is in accessing to memory.

– Case 1: One instruction is accessing memory during the

Execute or Write stage; while another is being fetched.

– Solution 1: Many processors use separate instruction and

data caches to avoid this delay.

– Case 2: Another example is when two instructions require

access to the register file at the same time.

– Solution 2: Let the register file have more input/output ports.

• In general, the structural hazard can be avoided by

providing sufficient hardware resources ($$$).CSCI2510 Lec12: Pipelining 39

Class Exercise 12.4

• What is the cause of the following structure hazard?


Outline



– Data Hazard





Superscalar Operation (1/2)

• Superscalar: Execute multiple instructions at any

time via multiple processing units (i.e., we can

execute more than one instruction per cycle)


W : Writeresults

Decode /

Dispatchunit

Instruction

queue

Floating-pointunit

Integerunit

F : Instructionfetch unit

Fetch two instructions

at a time

Decode two

instructions

at a time

I1 (FracAdd)

Instruction

Clock cycle 1 2 3 4 5 6

Time

F1 D1 E1A E1B E1C W1

I2 (Add) F2 D2 E2 W2

Superscalar Operation (2/2)

• Superscalar operation may result in out-of-order

execution, and cause data consistency issue.

– In our previous example, I1 and I2 are dispatched in the

same order as they appear.

– However, their execution is completed out of order.

– To guarantee a consistent state when out-of-order

execution occur, the results of the execution of instructions

must be written in program order strictly .

• The out-of-order execution is also a common

technique to make use of instruction cycles by re-

ordering instructions.

– E.g., Delayed branching reorders the instructions to

minimize the branch penalty.CSCI2510 Lec12: Pipelining 44

Out-of-Order Execution

R1 mem[r0] /* Instruction 1 */

R2 R1 + R2 /* Instruction 2 */

R5 R5 + 1 /* Instruction 3 */

R6 R6 – R3 /* Instruction 4 */

• Instruction 1 results in a cache miss, and a cache

miss can stall entire processor for 20-30 cycles.

• Instruction 2 cannot be executed since it needs R1.

• In instruction queue, look ahead and find instructions

3 and 4 to execute first (reordering).

R1 mem[r0] /* Instruction 1 */

R5 R5 + 1 /* Instruction 3 */

R6 R6 – R3 /* Instruction 4 */

R2 R1 + R2 /* Instruction 2 */


• I6 is still being discarded, but the instruction queue can avoid

delaying F4, F5, F6, Fk, and Fk+1 if the queue is not empty.

• I1, I2, I3, I4, and Ik can complete in successive cycles.CSCI2510 Lec12: Pipelining 46

Recall: With Instruction Queue

F1 D1 E1 E1 E1 W1

I5 (Branch to Ik)

I1

1 2 3 4 5 6 7 8 9Clock cycle

I2

I3

I4

I6

Ik

Ik+1

10

1Queue length 1 1 12 3 2 1 1 1

Time

X

F4

W3E3

F2 D2 E2 W2

F3 D3

E4D4 W4

F5

F6

Fk Dk Ek

Fk+1 Dk+1

Wk

Ek+1

Keep

fetching

D5



cycle stall),

The queue length rises to

3 before cycle 6.


Instruction 6 is discarded,

after taking Branch.

The queue length drops to

1 before cycle 8.

Decode two

instructionsat a time

Summary



– Data Hazard





csci2510 computer organization lecture 12: pipeliningmcyang/csci2510/2018f/lec12 pipelining.pdf ·...

Documents