ca-4 multi cycle processor

8/2/2019 CA-4 Multi Cycle Processor

1/50

BKTP.HCM

2010

dce

KIN TRC MY TNH

CE2010Khoa Khoa hc v K thut My tnh

BM K thut My tnh

inh c Anh Vhttp://www.cse.hcmut.edu.vn/~anhvu
http://www.cse.hcmut.edu.vn/~anhvuhttp://www.cse.hcmut.edu.vn/~anhvu


2/50

2010

dce

2Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

Chapter 4

The Processor


3/50

2010

dce


What is Pipelining?

A way of speeding up execution of instructions

Key idea: overlap execution of multipleinstructions

Analogy: doing your laundry1. Run load through washer

2. Run load through dryer

3. Fold clothes

4. Put away clothes

5. Go to 1

Observation: we can start another load as soonas we finish step 1!


4/50

2010

dce


The Laundry Analogy

Ann, Brian, Cathy, Dave

each have one load of clothesto wash, dry, and fold

Washer takes 30 minutes

Dryer takes 30 minutes

Folder takes 30 minutes

Stasher takes 30 minutesto put clothes into drawers

A B C D


5/50

2010

dce


If we do laundry sequentially...

Time Required: 8 hours for 4 loads

30Tas

k

Orde

r

TimeA

30 30 3030

B

30 3030

C

30 30 3030

D

30 30 3030

6 PM 7 8 9 10 11 12 1 2 AM


6/50

2010

dce


To Pipeline, We Overlap Tasks

Time Required: 3.5 Hours for 4 Loads Latency remains 2 hours

Throughput improves by factor of 2.3 (decreases for more loads)

12 2 AM6 PM 7 8 9 10 11 1Time30

A

C

D

B

30 30 3030 30 30Tas

k

Order


7/50

2010

dce


Pipelining a Digital System

Key idea: break big computation up into pieces

Separate each piece with a pipeline register

1ns

200ps 200ps 200ps 200ps 200ps

Pipeline

Register


8/50

2010

dce


Pipelining a Digital System

Why do this?

Because it's faster for repeated computations

1ns

Non-pipelined:

1 operation finishes

every 1ns

200ps 200ps 200ps 200ps 200ps

Pipelined:

1 operation finishes

every 200ps


9/50

2010

dce


Comments about pipelining

Pipelining increases throughput, but not

latency Answer available every 200ps, BUT

A single computation still takes 1ns

Limitations:

Computations must be divisible into stage size

Pipeline registers add overhead


10/50

2010

dce


IFetch: Instruction Fetch and Update PC Dec: Instruction Decode, Register Read, Sign

Extend Offset Exec: Execute R-type; Calculate MemoryAddress; Branch Comparison; Branch and JumpCompletion

Mem: Memory Read; Memory Write Completion;R-type Completion (RegFile write)

WB: Memory Read Completion (RegFile write)

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

IFetch Dec Exec Mem WBlw

INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

The Five Steps of the Load Instruction


11/50

2010

dce


Review Single-Cycle Processor

IFInstruction Fetch

IDInstruction Decode

EXExecute/ Address Calc.

MEMMemory Access

WBWrite Back

Review: Single-Cycle Processor All 5 steps done in a single clock cycle

Dedicated hardware required for each step


12/50

2010

dce


Pipelining - Key Idea

Question:

What happens if we break execution into multiplecycles, but keep the extra hardware?

Answer: In the best case, we can start executing a new

instruction on each clock cycle

this is pipelining Pipelining stages: IF - Instruction Fetch ID - Instruction Decode

EX - Execute / Address Calculation MEM - Memory Access (read / write) WB - Write Back (results into register file)


13/50

2010

dce


Basic Pipelined Processor

IF/ID ID/EX EX/MEM MEM/WB

Pipeline Registers


14/50

2010

dce


Comments about Pipelining

The good news

Multiple instructions are being processed at same time

This works because stages are isolated by registers

Best case speedup of N

The bad news Instructions interfere with each other hazards

Example: different instructions may need the same piece ofhardware (e.g., memory) in same clock cycle

Example: instruction may require a result produced by anearlier instruction that is not yet complete

Worst case: must suspend execution stall


15/50

2010

dce


Multicycle Datapath Approach

Let an instruction take more than 1 clock cycle to complete

Break up instructions into steps where each steptakes a cyclewhile trying to balance the amount of work to be done in each step

restrict each cycle to use only one major functional unit

Not every instruction takes thesamenumber of clock cycles

In addition to faster clock rates, multicycle allows functionalunits that can be used more than once per instruction aslong as they are used on differentclock cycles, as a result only need one memory but only one memory access per cycle

need only one ALU/adder but only one ALU operation percycle


16/50

2010

dce


At the end of a cycle Store values needed in a later cycle by the current instruction in an internal register (not visible

to the programmer). All (except IR) hold data only between a pair of adjacent clock cycles (nowrite control signal needed)

IR Instruction Register MDR Memory Data Register

A, B regfile read data registers ALUout ALU output register

Data used by subsequent instructions are stored in programmer visible registers(i.e., register file, PC, or memory)

Our Multicycle Datapath Approach

Address

Read Data(Instr. or Data)

MemoryPC

Write Data

Read Addr 1

Read Addr 2Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU

Write Data

IR

MDR

A

B ALUout


17/50

2010

dce


Clocking the Multicycle DatapathSystem Clock

MemWrite RegWrite

clock cycle

Address

Read Data

(Instr. or Data)

Memory

PC

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU

Write Data

IR

MD

R

A

BALUout

The Multicycle Datapath with Control


18/50

2010

dce


The Multicycle Datapath with ControlSignals

Address

Read Data

(Instr. or Data)

MemoryPC

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU

Write Data

IR

MDR

A

B

ALUout

SignExtend

Shiftleft 2 ALU

control

Shiftleft 2

ALUOp

Control

IRWriteMemtoReg

MemWriteMemRead

IorD

PCWrite

PCWriteCond

RegDstRegWrite

ALUSrcAALUSrcB

zero

PCSource

1

1

1

1

1

1

0

0

0

0

0

0

2

2

3

4

Instr[5-0]

Instr[25-0]

PC[31-28]

Instr[15-0]

Instr[31-26]

32

28

M l i l C l U i


19/50

2010

dce


Multicycle Control Unit

Multicycle datapath control signals are not determined solely by thebits in the instruction

e.g., op code bits tell what operation the ALU should be doing, but notwhat instruction cycle is to be done next

Must use a finite state machine (FSM) for control a set of states (current state stored in State Register)

next state function (determined

by current state and the input) output function (determined bycurrent state and the input) Combinational

control logic

State RegInst

Opcode

Datapath

control

points

Next State. . .

. . .

.

.

.

u t cyc e vantages


20/50

2010

dce


Uses the clock cycle efficiently the clock cycle is

timed to accommodate the slowest instruction step

Multicycle implementations allow functional units to beused more than once per instruction as long as theyare used on different clock cycles

but

Requires additional internal state registers, moremuxes, and more complicated (FSM) control

Clk

Cycle 1

IFetch Dec Exec Mem WB

Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

IFetch Dec Exec Mem

lw sw

IFetch

R-type

u t cyc e vantagesDisadvantages

dceSi l C l M lti l C l Ti i


21/50

2010


Clk Cycle 1

Multiple Cycle Implementation:

IFetch Dec Exec Mem WB

Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

IFetch Dec Exec Memlw sw

IFetchR-type

Clk

Single Cycle Implementation:

lw sw Waste

Cycle 1 Cycle 2

multicycle clock slowerthan 1/5th of single cycle

clock due to stateregister overhead

Single Cycle vs. Multiple Cycle Timing

dceH C W M k It F t ?


22/50

2010


How Can We Make It Faster?

Start fetching and executing the next instruction before thecurrent one has completed Pipelining (all?) modern processors are pipelined for

performance Remember theperformance equation:

CPU time = CPI * CC * IC

Under idealconditions and with a large number ofinstructions, the speedup from pipelining is approximatelyequal to the number of pipe stages A five stage pipeline is nearly five times faster because the CC

is nearly five times faster

Fetch (and execute) more than one instruction at a time Superscalar processing stay tuned

dceA Pi li d MIPS P


23/50

2010


A Pipelined MIPS Processor

Start the next instruction before the current one has completed improves throughput total amount of work done in a given time

instruction latency (execution time, delay time, response time timefrom the start of an instruction to its completion) is notreduced

clock cycle (pipeline stage time) is limited by the slowest stage for some stages dont need the whole clock cycle (e.g., WB) for some instructions, some stages are wasted cycles (i.e., nothing is

done during that cycle for that instruction)

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

IFetch Dec Exec Mem WBlw

Cycle 7Cycle 6 Cycle 8

sw IFetch Dec Exec Mem WB

R-type IFetch Dec Exec Mem WB

2010

dceSingle C cle s Pi eline


24/50

2010


Single Cycle vs. Pipeline

To complete an entire instruction in the pipelined casetakes 1000 ps (as compared to 800 ps for the single cyclecase). Why ?

How long does each take to complete 1,000,000 adds ?

lw IFetch Dec Exec Mem WB

Pipeline Implementation (CC = 200 ps):

IFetch Dec Exec Mem WBsw

IFetch Dec Exec Mem WBR-type

Clk

Single Cycle Implementation (CC = 800 ps):

lw sw Waste

Cycle 1 Cycle 2

400 ps

2010

dcePipeline Performance


25/50

2010


Pipeline Performance

Assume time for stages is

100ps for register read or write 200ps for other stages

Compare pipelined datapath with single-cycledatapath

Instr Instr fetch Registerread

ALU op Memoryaccess

Registerwrite

Total time

lw 200ps 100 ps 200ps 200ps 100 ps 800ps

sw 200ps 100 ps 200ps 200ps 700ps

R-format 200ps 100 ps 200ps 100 ps 600ps

beq 200ps 100 ps 200ps 500ps

2010

dcePipeline Performance


26/50

2010


Pipeline Performance

Single-cycle (Tc= 800ps)

Pipelined (Tc= 200ps)

2010

dcePipeline Speedup


27/50

2010


Pipeline Speedup

If all stages are balanced

i.e., all take the same time

If not balanced, speedup is less

Speedup due to increased throughput

Latency (time for each instruction) does notdecrease

nonpipelined

pipelined

Time between instructionsTime between instructions

Number of stages

2010

dcePipelining the MIPS ISA


28/50

2010


Pipelining the MIPS ISA

What makes it easy

all instructions are the same length (32 bits) can fetch in the 1st stage and decode in the 2nd stage

few instruction formats (three) with symmetry acrossformats

can begin reading register file in 2nd stage

memory operations occur only in loads and stores can use the execute stage to calculate memory addresses

each instruction writes at most one result (i.e.,changes the machine state) and does it in the last few

pipeline stages (MEM or WB) operands must be aligned in memory so a single data

transfer takes only one data memory access

2010

dce

MIPS Pipeline Datapath Additions/Mods


29/50


State registers between each pipeline stage to isolate themIF: IFetch ID: Dec EX: Execute MEM:

MemAccess

WB:

WriteBack

Read

Address

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

16 32

ALU

Shift

left 2

Add

DataMemory

Address

Write Data

Read

Data

IF/ID

SignExtend

ID/EX EX/MEM

MEM/WB

System Clock

MIPS Pipeline Datapath Additions/Mods

2010

dceMIPS Pipeline Control Path Modifications


30/50


MIPS Pipeline Control Path Modifications All control signals can be determined during Decode

and held in thestate registers between pipeline stages

Read

Address

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

16 32

ALU

Shift

left 2

Add

DataMemory

Address

Write Data

Read

Data

IF/ID

SignExtend

ID/EXEX/MEM

MEM/WB

Control

ALU

cntrl

RegWrite

MemRead

MemtoReg

RegDst

ALUOp

ALUSrc

Branch

PCSrc

2010

dcePipeline Control


31/50


Pipeline Control

IF Stage: read Instr Memory (always

asserted) and write PC (on System Clock) ID Stage: no optional control signals to set

EX Stage MEM Stage WB Stage

RegDst

ALUOp1

ALUOp0

ALUSrc

Brch MemRead

MemWrite

RegWrite

MemtoReg

R 1 1 0 0 0 0 0 1 0

lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 X

beq X 0 1 0 1 0 0 0 X

2010

dceGraphically Representing MIPS Pipeline


32/50


Can help with answering questions like: How many cycles does it take to execute this

code?

What is the ALU doing during cycle 4?

Is there a hazard, why does it occur, and how canit be fixed?

Graphically Representing MIPS Pipeline

EX MEM WBIDIF

2010dce

Why Pipeline? For Performance!


33/50


I

n

s

t

r.

O

r

d

e

r

Time (clock cycles)

Inst 0

Inst 1

Inst 2

Inst 4

Inst 3

Once the pipelineis full, one

instruction iscompleted everycycle, so CPI = 1

Time to fill the pipeline

Why Pipeline? For Performance!

EX MEM WBIDIF

EX MEM WBIDIF

EX MEM WBIDIF

EX MEM WBIDIF

EX MEM WBIDIF

2010dce

Can Pipelining Get Us Into Trouble?


34/50


Yes: Pipeline Hazards Situations that prevent starting the next instruction in the next cycle

structural hazards: attempt to use the same resource by two differentinstructions at the same time

data hazards: attempt to use data before it is ready An instructions source operand(s) are produced by a prior instruction still in

the pipeline

control hazards: attempt to make a decision about program control flow

before the condition has been evaluated and the new PC targetaddress calculated

branch and jump instructions, exceptions

Can usually resolve hazards by waiting

pipeline control must detect the hazard and take action to resolve hazards

Can Pipelining Get Us Into Trouble?

2010dce

Structural hazard Conflict for use of a resource


35/50


I

n

s

t

r.

O

r

d

er

Time (clock cycles)

lw

Inst 1

Inst 2

Inst 4

Inst 3

Structural hazard Conflict for use of a resource

Fix with separate instr and data memories (I$ and D$)

Reading data frommemory

Reading instructionfrom memory

EX MEM WBIDIF

EX MEM WBIDIF

EX MEM WBIDIF

EX MEM WBIDIF

EXMEM WBIDIF

Instruction fetchwould have to

stall for that cycle

2010dce

How About Register File Access?


36/50


How About Register File Access?

I

n

s

t

r.

O

r

d

er

Time (clock cycles)

Inst 1

Inst 2

Fix register file

access hazard bydoing reads in thesecond half of thecycle and writes in

the first half

add$1,

add $2,$1,

clock edge that controlsregister writing

clock edge that controls loadingof pipeline state registers

EX

MEM WBIDIF

EX MEM WBIDIF

EX

MEM WBIDIF

EX MEM WBIDIF

can mot thoi gian de doc vao thanhghi. Neu muon thuc hien nhu ben,ta phai co xung clock lon hon 2 lantime can thiet

2010dce

Register Usage Can Cause Data Hazards


37/50


Read before write data hazard

I

n

s

t

r.

O

r

d

er

add$1,

sub $4,$1,$5

and $6,$1,$7

xor $4,$1,$5

or $8,$1,$9

Register Usage Can Cause Data Hazards

Dependencies backward in time cause hazards

EX MEM WBIDIF

EX MEM WBIDIF

EX MEM WBIDIF

EX MEM WBIDIF

EXMEM WBIDIF

2010dce

One Way to Fix a Data Hazard


38/50


I

n

s

t

r.

O

r

d

er

add$1,

sub $4,$1,$5

and $6,$1,$7

EX MEM WBIDIF

EX MEM WBIDIF

EXMEM WBIDIF

stall

stall

Can fix datahazard by

waitingstallbut impacts CPI

One Way to Fix a Data Hazard

2010dce

Forwarding (aka Bypassing)


39/50


Forwarding (aka Bypassing)

forwarding results as soon as they are

available to where they are needed Use result when it is computed

Dont wait for it to be stored in a register

Requires extra connections in the datapath

add$s0,$t0,$t1

sub $t2,$s0,$t3

EX MEM WBIDIF

EX MEM WBIDIF

de khoai phai cho lenh load word, ta dua thang ket qua tuex xuong lenh sau (ky thuat forwarding)

- can them nhung duong ket noi dat biet- chi dung duoc mot so truong hop

2010dce

Loads Can Cause Data Hazards


40/50



Load-use data hazard

EX MEM WBIDIF

Loads Can Cause Data Hazards

I

n

s

t

r.

O

r

d

er

lw $1,4($2)

sub $4,$1,$5

and $6,$1,$7

xor $4,$1,$5

or $8,$1,$9

EX MEM WBIDIF

EX MEM WBIDIF

EX MEM WBIDIF

EXMEM WBIDIF

2010dce

Load-Use Data Hazard


41/50


Load Use Data Hazard

Cant always avoid stalls by forwarding

If value not computed when needed Cant forward backward in time!

lw $s0, 20($t1)

sub $t2,$s0,$t3

EX MEM WBIDIF

EXMEM WBIDIF

su dung ky thuat: cho va forwardingket hop nhieu phuong phap khac nhau de toi uu chuong trinh

hazard do truy suat du lieu

2010dce

Code Scheduling to Avoid Stalls


42/50


g

Reorder code to avoid use of load result in the

next instruction C code for A = B + E; C = B + F;

lw $t1, 0($t0)

lw $t2, 4($t0)

add $t3, $t1, $t2

sw $t3, 12($t0)

lw $t4, 8($t0)

add $t5, $t1, $t4

sw $t5, 16($t0)

stall

lw $t1, 0($t0)

lw $t2, 4($t0)

lw $t4, 8($t0)

add $t3, $t1, $t2

sw $t3, 12($t0)

add $t5, $t1, $t4

sw $t5, 16($t0)

11 cycles13 cycles

stall

de tranh cho, ta sap xep lai cac cau lenh cua chuong trinh

stall: doi 1 chu ky[cach nhau 1 chu ky

cach nhau 2 chu ky,khong doi

2010dce

Branch Instructions Cause Control Hazards


43/50



I

n

s

t

r.

O

r

d

e

r

lw

Inst 4

Inst 3

beqEX MEM WBIDIF

EX MEM WBIDIF

EX MEM WBIDIF

EX MEM WBIDIF

2010dce

Stall on Branch


44/50


Wait until branch outcome determined before

fetching next instruction

beq $s1, $s2, L

sub $t2,$t1,$t3

EX MEM WBIDIF

EXMEM WBIDIF

- cach don gian nhat la cho

2010dce Branch Prediction


45/50


Longer pipelines cant readily determine

branch outcome early Stall penalty becomes unacceptable

Predict outcome of branch

Only stall if prediction is wrong In MIPS pipeline

Can predict branches not taken

Fetch instruction after branch, with no delay

- ky thuat thuong dung la doan- cham nhat cung bang truong hop cho- neu doan dung thi khong can phai cho

- lenh Mips doan voi dieu kien khong say ratuc la lenh ngay sau beq- Tien doan tinh (khong phai tien doan dong

neu doan dung, thi coi nhu khong mat chu ky nao ca

2010dce MIPS with Predict Not Taken


46/50


beq $s1, $s2, L

sub $t2,$t1,$t3

EX MEM WBIDIF

EX MEM WBIDIF

beq $s1, $s2, L

add $t2,$t1,$t3

EX MEM WBIDIF

EX MEM WBIDIF

Predictioncorrect

Predictionincorrect

2010dce More-Realistic Branch Prediction


47/50


Static branch prediction

Based on typical branch behavior Example: loop and if-statement branches

Predict backward branches taken

Predict forward branches not taken

Dynamic branch prediction Hardware measures actual branch behavior e.g., record recent history of each branch

Assume future behavior will continue the trend When wrong, stall while re-fetching, and update history

tien doan tinh

tien doan dong

khong dua vao nhung thong tin truoc do

nghia la luc nao cung chon 1 trong 2:taken hoat not taken

co chien thuat:- dua vao 2 bit lich su: neu 2 lan minh tien doan sai thi se thay doi quan diem

dua vao nhung lan tien doan truoc do

2010dce Other Pipeline Structures Are Possible


48/50


What about the (slow) multiply operation? Make the clock twice as slow or

let it take two cycles (since it doesnt use the DM stage)

What if the data memory access is twice as slow as the instructionmemory? make the clock twice as slow or let data memory access take two cycles (and keep the same clock

rate)

MUL

EX MEM WBIDIF

EX MEM1 WBIDIF MEM2

lenh nhan thuc hien cham hon lenh congnen lenh nhan se thuc hien cham hon

ta lua chon giua viec gian xung clock hayde lenh nhan nhieu chu ki

cho lenh dung nhieu chu ky

cho lenh mem thanh 2 chu ky lien tiep

lenh load thuong rat cham so voi cac lenh khac

2010dce Other Sample Pipeline Alternatives


49/50


ARM7

XScaleALU

IM1 IM2 DM1 RegDM2

IM Reg EX

PC updateIM access

decodereg

access

ALU opDM access

shift/rotate

commit result

(write back)

Reg SHFT

PC update

BTB access

start IM access

IM access

decode

reg 1 access

shift/rotate

reg 2 access

ALU op

start DM access

exception

DM write

reg write

theo cau truc Reg: tap lenh don gian, cau truc (thiet ke) don gian

2010dce Summary


50/50


All modern day processors use pipelining

Pipelining doesnt help latency of single task, it helpsthroughput of entire workload

Potential speedup: a CPI of 1 and fast a CC

Pipeline rate limited by slowest pipeline stage

Unbalanced pipe stages makes for inefficiencies The time to fill pipeline and time to drain it can impact

speedup for deep pipelines and short code runs

Must detect and resolve hazards

Stalling negatively affects CPI (makes CPI less than theideal of 1)

nap lenh goi dau

can bang cac giai doan : rd, sd, mem ...

lap day cac giai doan. Neu ong day lien tuc thi hieu suat toi da

ca-4 multi cycle processor

Documents