instruction fetch - university of albertabryanhu/cmpe382_winter2010/cmpe382-02-01-processor2.pdf ·...

13
31 January, 2010 Chapter 4 — The Processor 1 Building a Datapath Datapath Elements that process data and addresses i th CPU §4.3 Building a Datapin the CPU Registers, ALUs, mux’s, memories, … We will build a MIPS datapath incrementally Refining the overview design ath Chapter 4 — The Processor — 14 Instruction Fetch 32 bit Increment by 4 for next instruction Chapter 4 — The Processor — 15 32-bit register instruction

Upload: tranmien

Post on 01-Jun-2019

216 views

Category:

Documents


0 download

TRANSCRIPT

31 January, 2010

Chapter 4 — The Processor 1

Building a DatapathDatapath

Elements that process data and addressesi th CPU

§4.3 Building a Datapa

in the CPURegisters, ALUs, mux’s, memories, …

We will build a MIPS datapath incrementally

Refining the overview design

ath

Chapter 4 — The Processor — 14

g g

Instruction Fetch

32 bit

Increment by 4 for next instruction

Chapter 4 — The Processor — 15

32-bit register

instruction

31 January, 2010

Chapter 4 — The Processor 2

R-Format InstructionsRead two register operandsPerform arithmetic/logical operationWrite register result

Chapter 4 — The Processor — 16

Load/Store InstructionsRead register operandsCalculate address using 16-bit offset

Use ALU but sign-extend offsetUse ALU, but sign-extend offsetLoad: Read memory and update registerStore: Write register value to memory

Chapter 4 — The Processor — 17

31 January, 2010

Chapter 4 — The Processor 3

Branch InstructionsRead register operandsCompare operands

Use ALU, subtract and check Zero outputCalculate target address

Sign-extend displacementShift left 2 places (word displacement)Add t PC 4

Chapter 4 — The Processor — 18

Add to PC + 4Already calculated by instruction fetch

Branch Instructions

Justre-routes

wireswires

Chapter 4 — The Processor — 19

Sign-bit wire replicated

31 January, 2010

Chapter 4 — The Processor 4

Composing the ElementsFirst-cut data path does an instruction in one clock cycle

Each datapath element can only do one function at a timeHence, we need separate instruction and data memories

Use multiplexers where alternate data

Chapter 4 — The Processor — 20

psources are used for different instructions

R-Type/Load/Store Datapath

Chapter 4 — The Processor — 21

31 January, 2010

Chapter 4 — The Processor 5

Full Datapath

Chapter 4 — The Processor — 22

ALU ControlALU used for

Load/Store: F = add

§4.4 A Simple Im

plem

Branch: F = subtractR-type: F depends on funct field

entation Schem

eALU control Function0000 AND0001 OR

Chapter 4 — The Processor — 23

0010 add0110 subtract0111 set-on-less-than1100 NOR

31 January, 2010

Chapter 4 — The Processor 6

ALU ControlAssume 2-bit ALUOp derived from opcode

Combinational logic derives ALU control

opcode ALUOp Operation funct ALU function ALU controllw 00 load word XXXXXX add 0010

sw 00 store word XXXXXX add 0010beq 01 branch equal XXXXXX subtract 0110R-type 10 add 100000 add 0010

subtract 100010 subtract 0110

Chapter 4 — The Processor — 24

AND 100100 AND 0000OR 100101 OR 0001

set-on-less-than 101010 set-on-less-than 0111

The Main Control UnitControl signals derived from instruction

0 rs rt rd shamt functR type 0 rs rt rd shamt funct31:26 5:025:21 20:16 15:11 10:6

35 or 43 rs rt address31:26 25:21 20:16 15:0

4 rs rt address

R-type

Load/Store

Branch

Chapter 4 — The Processor — 25

31:26 25:21 20:16 15:0

opcode always read

read, except for load

write for R-type

and load

sign-extend and add

31 January, 2010

Chapter 4 — The Processor 7

Datapath With Control

Chapter 4 — The Processor — 26

R-Type Instruction

Chapter 4 — The Processor — 27

31 January, 2010

Chapter 4 — The Processor 8

Load Instruction

Chapter 4 — The Processor — 28

Branch-on-Equal Instruction

Chapter 4 — The Processor — 29

31 January, 2010

Chapter 4 — The Processor 9

Implementing Jumps

2 address31:26 25:0

Jump

Jump uses word addressUpdate PC with concatenation of

Top 4 bits of old PC26-bit jump address

Chapter 4 — The Processor — 30

00Target address = PC31…28 : (address × 4)

Need an extra control signal decoded from opcode

Datapath With Jumps Added

Chapter 4 — The Processor — 31

31 January, 2010

Chapter 4 — The Processor 10

Performance IssuesLongest delay determines clock period

Critical path: load instructionInstruction memory → register file → ALU →data memory → register file

Not feasible to vary period for different instructionsViolates design principle

Chapter 4 — The Processor — 32

Violates design principleMaking the common case fast

We will improve performance by pipelining

Pipelining AnalogyPipelined laundry: overlapping execution

Parallelism improves performance

§4.5 An O

verview of PP

ipeliningFour loads:Speedup= 8/3.5 = 2.3

Non-stop:

Chapter 4 — The Processor — 33

Speedup= 2n/(0.5n + 1.5) ≈ 4= number of stages

31 January, 2010

Chapter 4 — The Processor 11

MIPS PipelineFive stages, one step per stage

1. IF: Instruction fetch from memory2. ID: Instruction decode & register read3. EX: Execute operation or calculate address4. MEM: Access memory operand5. WB: Write result back to register

Chapter 4 — The Processor — 34

Pipeline PerformanceAssume time for stages is

100ps for register read or write200ps for other stages200ps for other stages

Compare pipelined datapath with single-cycle datapath

Instr Instr fetch Register read

ALU op Memory access

Register write

Total time

Chapter 4 — The Processor — 35

lw 200ps 100 ps 200ps 200ps 100 ps 800ps

sw 200ps 100 ps 200ps 200ps 700ps

R-format 200ps 100 ps 200ps 100 ps 600ps

beq 200ps 100 ps 200ps 500ps

31 January, 2010

Chapter 4 — The Processor 12

Pipeline PerformanceSingle-cycle (Tc= 800ps)

Pipelined (Tc= 200ps)

Chapter 4 — The Processor — 36

Pipeline SpeedupIf all stages are balanced

i.e., all take the same timeTime between instructionspipelined= Time between instructionsnonpipelined

Number of stages

If not balanced, speedup is lessSpeedup due to increased throughput

Chapter 4 — The Processor — 37

Speedup due to increased throughputLatency (time for each instruction) does not decrease

31 January, 2010

Chapter 4 — The Processor 13

Pipelining and ISA DesignMIPS ISA designed for pipelining

All instructions are 32-bitsEasier to fetch and decode in one cycleEasier to fetch and decode in one cyclec.f. x86: 1- to 17-byte instructions

Few and regular instruction formatsCan decode and read registers in one step

Load/store addressingCan calculate address in 3rd stage, access memory

Chapter 4 — The Processor — 38

g , yin 4th stage

Alignment of memory operandsMemory access takes only one cycle