pipelining datapath adapted from the lecture notes of dr. john kubiatowicz (uc berkeley) and hank...

47
Pipelining Datapath Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley) and Hank Walker (TAMU)

Post on 21-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Pipelining Datapath

Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley)

and Hank Walker (TAMU)

Pipelining is Natural!• Laundry Example

• Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold

• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

A B C D

Sequential Laundry

• Sequential laundry takes 6 hours for 4 loads

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

Pipelined Laundry: Start work ASAP

• Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20

Pipelining Lessons• Latency vs. Throughput• Question

– What is the latency in both cases ?– What is the throughput in both cases ?

Pipelining doesn’t help latency of single task, it helps

throughput of entire workload

A

B

C

D

30 40 40 40 40 20

Pipelining Lessons [contd…]• Question

– What is the fastest operation in the example ?– What is the slowest operation in the example

Pipeline rate limited by slowest pipeline stage

A

B

C

D

30 40 40 40 40 20

Pipelining Lessons [contd…]

A

B

C

D

30 40 40 40 40 20

Multiple tasks operating simultaneously using different resources

Pipelining Lessons [contd…]• Question

– Would the speedup increase if we had more steps ?

A

B

C

D

30 40 40 40 40 20

Potential Speedup = Number of pipe stages

Pipelining Lessons [contd…]• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

• Question– Will it affect if “Folder” also took 40 minutes

Unbalanced lengths of pipe stages reduces speedup

Pipelining Lessons [contd…]

A

B

C

D

30 40 40 40 40 20

Time to “fill” pipeline and time to “drain” it reduces speedup

Five Stages of an Instruction

• Ifetch: Instruction Fetch– Fetch the instruction from the Instruction Memory

• Reg/Dec: Registers Fetch and Instruction Decode• Exec: Calculate the memory address• Mem: Read the data from the Data Memory• Wr: Write the data back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Ifetch Reg/Dec Exec Mem WrLoad

Conventional Pipelined Execution Representation

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WBProgram Flow

Time

Example

Example [contd…]

• Timepipeline = Timenon-pipeline / Pipe stages

– Assumptions• Stages are perfectly balanced• Ideal conditions

• Ideally, speedup = 8/5 = 1.6• Most cases are not ideal !!!

Example [contd…]

• Speedup in this case = 24/14 = 1.7

• Lets add 1000 more instructions– Time (non-pipelined) = 1000 x 8 + 24 ns = 8000 ns– Time (pipelined) = 1000 x 2 + 14 ns = 2014 ns– Speedup = 8000 / 2014 = 3.98 = 4 (approx) = 8/2

Instruction throughput is important metric (as opposed to individual instruction)as real programs execute billions of instructions in practical case !!!

Pipeline Hazards

• Structural HazardIFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WBProgram Flow

Pipeline Hazard [contd…]

• Control Hazard

• Example– add $4, $5, $6– beq $1, $2, 40– lw $3, 300($0)

Pipleline Hazard [contd…]

• Data Hazards

• Example– add $s0, $t0, $t1– sub $t2, $s0, $t3

Summary Pipelining Lessons• Pipelining doesn’t help

latency of single task, it helps throughput of entire workload

• Pipeline rate limited by slowest pipeline stage

• Multiple tasks operating simultaneously using different resources

• Potential speedup = Number pipe stages

• Unbalanced lengths of pipe stages reduces speedup

• Time to “fill” pipeline and time to “drain” it reduces speedup

• Stall for Dependences

A

B

C

D

6 PM 7 8 9

Task

Order

Time

30 40 40 40 40 20

Summary of Pipeline Hazards

• Structural Hazards– Hardware design

• Control Hazard– Decision based on results

• Data Hazard– Data Dependency

Control Signals for existing Datapath

The Right to Left Control can lead to hazards

Place registers between each step

Example

10 lw r1, r2(35)

14 addI r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

Start: Fetch 10

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

SReg

File

IR

Inst

. M

em

D

Dec

ode

MemCtrl

WB Ctrl

M

rs rt im

10 lw r1, r2(35)

14 addI r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

IF

PC

Nex

t P

C

10

=

n n n n

Fetch 14, Decode 10

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

SReg

File

IR

Inst

. M

em

D

Dec

ode

MemCtrl

WB Ctrl

M

2 rt im

10 lw r1, r2(35)

14 addI r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

lw r

1, r

2(35

)

ID

IF

PC

Nex

t P

C

14

=

n n n

Fetch 20, Decode 14, Exec 10

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

r2

B

SReg

File

IR

Inst

. M

em

D

Dec

ode

MemCtrl

WB Ctrl

M

2 rt 35

10 lw r1, r2(35)

14 addI r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

lw r

1

add

I r2,

r2,

3

EX

PC

Nex

t P

C

20

=

n n

Fetch 24, Decode 20, Exec 14, Mem 10

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

r2

B

r2+

35

Reg

File

IR

Inst

. M

em

D

Dec

ode

MemCtrl

WB Ctrl

M

4 5 3

10 lw r1, r2(35)

14 addI r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

lw r

1

sub

r3,

r4,

r5

add

I r2,

r2,

3

ID

IF

EX

M

PC

Nex

t P

C

24

=

n

Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

r4

r5

r2+

3

Reg

File

IR

Inst

. M

em

D

Dec

ode

MemCtrl

WB Ctrl

M[r

2+35

]6 7

10 lw r1, r2(35)

14 addI r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

lw r

1

beq

r6,

r7

100

add

I r2

sub

r3

ID

IF

EX

M WB

PC

Nex

t P

C

30

=

Fetch 100, Dcd 30, Ex 24, Mem 20, WB 14

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

r6

r7

r2+

3

Reg

File

IR

Inst

. M

em

D

Dec

ode

MemCtrl

WB Ctrl

r1=

M[r

2+35

]

9 xx

10 lw r1, r2(35)

14 addI r2, r2, 3

20 sub r3, r4, r5

24 beq r6, r7, 100

30 ori r8, r9, 17

34 add r10, r11, r12

100 and r13, r14, 15

beq

add

I r2

sub

r3

r4-r

5

100

ori

r8,

r9

17

ID

IF

EX

M WB

PC

Nex

t P

C

100

=

Pipelining Load Instruction

• The five independent functional units in the pipeline datapath are:

– Instruction Memory for the Ifetch stage

– Register File’s Read ports (bus A and busB) for the Reg/Dec stage

– ALU for the Exec stage

– Data Memory for the Mem stage

– Register File’s Write port (bus W) for the Wr stage

Clock

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Ifetch Reg/Dec Exec Mem Wr1st lw

Ifetch Reg/Dec Exec Mem Wr2nd lw

Ifetch Reg/Dec Exec Mem Wr3rd lw

Pipelining the R Instruction

• Ifetch: Instruction Fetch

– Fetch the instruction from the Instruction Memory

• Reg/Dec: Registers Fetch and Instruction Decode

• Exec:

– ALU operates on the two register operands

– Update PC

• Wr: Write the ALU output back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4

Ifetch Reg/Dec Exec WrR-type

Pipelingng Both L and R type

• We have pipeline conflict or structural hazard:– Two instructions try to write to the register file at

the same time!– Only one write port

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9

Ifetch Reg/Dec Exec WrR-type

Ifetch Reg/Dec Exec WrR-type

Ifetch Reg/Dec Exec Mem WrLoad

Ifetch Reg/Dec Exec WrR-type

Ifetch Reg/Dec Exec WrR-type

Ops! We have a problem!

Important Observations• Each functional unit can only be used once per

instruction

• Each functional unit must be used at the same stage for all instructions:– Load uses Register File’s Write Port during its 5th

stage

– R-type uses Register File’s Write Port during its 4th stage

Ifetch Reg/Dec Exec Mem WrLoad

1 2 3 4 5

Ifetch Reg/Dec Exec WrR-type

1 2 3 4

Solution• Delay R-type’s register write by one cycle:

– Now R-type instructions also use Reg File’s write port at Stage 5

– Mem stage is a NOOP stage: nothing is being done.

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9

Ifetch Reg/Dec Mem Wr

Ifetch Reg/Dec Mem WrR-type

Ifetch Reg/Dec Exec Mem WrLoad

Ifetch Reg/Dec Mem WrR-type

Ifetch Reg/Dec Mem WrR-type

Ifetch Reg/Dec Exec WrR-type Mem

Exec

Exec

Exec

Exec

1 2 3 4 5

Datapath (Without Pipeline)IR <- Mem[PC]; PC <– PC+4;

A <- R[rs]; B<– R[rt]

S <– A + B;

R[rd] <– S;

S <– A + SX;

M <– Mem[S]

R[rd] <– M;

S <– A or ZX;

R[rt] <– S;

S <– A + SX;

Mem[S] <- B

If CondPC < PC+SX;

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

SReg

File

Equ

al

PC

Nex

t P

C

IR

Inst

. M

em

D

M

Datapath (With Pipeline)IR <- Mem[PC]; PC <– PC+4;

A <- R[rs]; B<– R[rt]

S <– A + B;

R[rd] <– M;

S <– A + SX;

M <– Mem[S]

R[rd] <– M;

S <– A or ZX;

R[rt] <– M;

S <– A + SX;

Mem[S] <- B

if Cond PC < PC+SX;

M <– S

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

SReg

File

Equ

al

PC

Nex

t P

C

IR

Inst

. M

em

D

M

M <– S

Mem

Structural Hazard and Solution

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4A

LUMem Reg Mem Reg

AL

UMem Reg Mem Reg

AL

UMem Reg Mem RegA

LUReg Mem Reg

AL

UMem Reg Mem Reg

Control Hazard - #1 Stall

• Stall: wait until decision is clear

• Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) => slow

Instr.

Order

Time (clock cycles)

Add

Beq

Load

AL

UMem Reg Mem Reg

AL

UMem Reg Mem RegA

LUReg Mem RegMem

Lostpotential

Control Hazard – #2 Predict

• Predict: guess one direction then back up if wrong• Impact: 0 lost cycles per branch instruction if right,

1 if wrong (right 50% of time)• More dynamic scheme: history of 1 branch

Instr.

Order

Time (clock cycles)

Add

Beq

Load

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

Mem

AL

UReg Mem Reg

Control Hazard - #3 Delayed Branch

• Delayed Branch: Redefine branch behavior (takes place after next instruction)

• Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time)

Instr.

Order

Time (clock cycles)

Add

Beq

Misc

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

Mem

AL

UReg Mem Reg

Load Mem

AL

UReg Mem Reg

Data Hazards (RAW)

• Dependencies backwards in time are hazards

Instr.

Order

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

IF

ID/RF EX MEM WBAL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

Im

AL

UReg Dm Reg

Data Hazards [contd…]• “Forward” result from one stage to another

Instr.

Order

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

IF

ID/RF

EX MEM WBAL

UIm Reg Dm Reg

AL

UIm Reg Dm RegA

LUIm Reg Dm Reg

Im

AL

UReg Dm Reg

AL

UIm Reg Dm Reg

Data Hazards [contd…]

Reg

• Dependencies backwards in time are hazards

• Can’t solve with forwarding: • Must delay/stall instruction dependent on loads

Time (clock cycles)

lw r1,0(r2)

sub r4,r1,r3

IF

ID/RF EX MEM WBAL

UIm Reg Dm

AL

UIm Reg Dm RegStall

Hazard Detection

I-Fet ch DCD MemOpFetch OpFetch Exec Store

IFetch DCD ° ° °StructuralHazard

I-Fet ch DCD OpFetch Jump

IFetch DCD ° ° °

Control Hazard

IF DCD EX Mem WB

IF DCD OF Ex Mem

RAW (read after write) Data Hazard

WAW Data Hazard (write after write)

IF DCD OF Ex RS WAR Data Hazard (write after read)

IF DCD EX Mem WB

IF DCD EX Mem WB

Hazard Detection• Suppose instruction i is about to be issued and a

predecessor instruction j is in the instruction pipeline.

• A RAW hazard exists on register if Rregs( i ) Wregs( j )

• A WAW hazard exists on register if Wregs( i ) Wregs( j )

• A WAR hazard exists on register if Wregs( i ) Rregs( j )

Window on execution:Only pending instructions cancause hazardsInst J

Inst INew Inst

InstructionMovement:

Computing CPI

2211

typetypetypetypestall

stallbase

freqSTALLfreqSTALLCPI

CPICPICPI

• Start with Base CPI

• Add stalls

•Suppose: –CPIbase=1

–Freqbranch=20%, freqload=30%

–Suppose branches always cause 1 cycle stall

–Loads cause a 2 cycle stall

•Then: CPI = 1 + (10.20)+(2 0.30)= 1.8

Summary

• Control Signals need to be propagated

• Insert Registers between every stage to “remember” and “propagate” values

• Solutions to Control Hazard are Stall, Predict and Delayed Branch

• Solutions to Data Hazard is “Forwarding”

• Effective CPI = CPIideal + CPIstall