ece 154a introduction to computerstrukov/ece154afall2013/viewgraphs/mips.pdf · • choices and...
TRANSCRIPT
ECE 154A Introduction to ComputerIntroduction to Computer
ArchitectureDmitri Strukov
MIPS Instruction Set Architecture &MIPS Instruction Set Architecture & Single Cycle Datapath and Control
OutlineOutline
• Admin• Choices and Basic Design Principles• RISC Architecture• Major Types of Instructions
– Arithmetic InstructionsL d St I t ti– Load Store Instructions
– Control Instructions• Datapath & ControlDatapath & Control
– Single Cycle Implementation– Multi Cycle Implementation
AdminAdmin
‐ Website with lecture slides:Website with lecture slides:http://www.ece.ucsb.edu/~strukov/ece154aFall2013/ece154A.htm2013/ece154A.htm
‐ Check the reading assignments on the web‐ HW#1 is online and due next Monday 11 pm‐ HW#1 is online and due next Monday 11 pm (hw box)
‐ Tentative midterms dates:‐ Tentative midterms dates:‐ #1: October 29th
‐ #2: November 19th‐ #2: November 19
Simple Computer Store‐program (Von‐Neumann) computer
Algorithm for F = A x B + C / D
Step 1: Temp1 = A x B
/
Memoryaddresses
Step 2: Temp2 = C / D
Step 3: F = Temp1 + Temp2Control data
Datapathoperation
Performance = 1 / Exec Time = 1 / CCT x IC x CPI
Read A and B from
Read C and D from
Read temp1 and
Load first instructio
n to
memory, compute temp1, write
temp1 to
Load second instructio
memory, compute temp2, write
temp2 to
Load second instructio
temp2 from
memory, compute F, write F to
time
n to control
temp1 to memory n to
control
temp2 to memory n to
control
write F to memory
Improving PerformanceMemoryaddresses
• Performance depends on
Control data
• Performance depends on– Algorithm: affects IC, possibly
CPI– Programming language: affects
Datapathoperation
– Programming language: affects IC, CPI
– Compiler: affects IC, CPI– Instruction set architecture:Instruction set architecture:
affects IC, CPI, CCT
Performance = 1 / Exec Time = 1 / CCT x IC x CPI
Control
• Instruction is encoded as a Memoryaddresses
sequence of bits (in memory)– E.g. instruction may have
encoded the type operation with data and memory addresses of
Control datadata and memory addresses of data
• Control circuitry decodes instruction into a set of
Datapathoperation
– decodes instruction into a set of or sequence of (for multicycleimplementation) of control signals, which orchestrates the gdata movement and date processing
– Should also take care of choosing the next instruction (FSM)the next instruction (FSM)
Control: Program Counter
• Program counter =Memoryaddresses
Program counter = register with currently executed instruction Control dataexecuted instruction
Wi h i lDatapath
operation
• With sequential execution
1
Store Program ArchitectureStore Program Architecture
• Where do we store program?Where do we store program?• Two options:
– In separate memory or hardwiredIn separate memory or hardwired(Harvard architecture)
– In the same memory where the data are kept( t hit t )
Memoryaddresses
(store program architecture)
Cons and pros?
Control data
Cons and pros?Datapath
operation
Store Program ArchitectureStore Program Architecture
• Where do we store code?• Two options:
– In separate memory(Harvard architecture)‐ Higher throughput / faster read
– In the same memory where the data are kept(store program architecture)‐ Can modify code from the program (plus in
Memoryaddresses
y p g (pgeneral but could be a problem for security)
‐ More efficient use of memory ‐ Better CCT and/or CPI
Control data
Datapathoperation
Two Key Principles of Machine Design
Memory
1. Instructions are represented as numbers and, as such, are indistinguishable from data
Accounting prg(machine code)
2. Programs are stored in alterable memory (that can be read or written to)
just like data
C compiler (machine code)
jPayroll data
Source code in CSource code in C for Acct prg
Operands of InstructionsOperands of Instructions
• Choices?C o ces?– Directly on memory
• have instructions in which addresses of input data and t t d t di tl ifi d dd i ioutput data are directly specified as addresses in main
memory
– Only with small (local) memory called register file (so‐called LOAD‐STORE ARCHITECTURE)
• have instructions in which addresses of input data and output data are directly specified as addresses in local p y pmemory
• have additional instruction which move data between local and main memoryy
Load Store ArchitectureBEFORE
Read A and B from
Read C and D from
Read temp1 and
Memoryaddresses
Load first
instruction to control
from memory, compute temp1, write
temp1 to memory
Load second instruction to control
from memory, compute temp2, write
temp2 to memory
Load second instruction to control
temp2 from
memory, compute F, write F
to
Control data
UCSB | ECE 154A | Fall 2013 timecontrol memory control memory control memory
Datapathoperation
LOAD‐STORE ARCHITECTURE
Memoryaddresses
Load first
Read A and B from
Write temp1 to
compute temp1, and write
Control data
data
time
first instruction to control
from memory to local memory
temp1 to main
memory
and write temp1 to local
memory RF
Datapathoperation
Algorithm for F = A x B + C / DStep 1: Temp1 = A x BStep 2: Temp2 = C / DStep 3: F = Temp1 + Temp2
Load Store Architecture: Effect on P f ?Performance?
Memoryaddresses
Load first
Read A and B from
Write temp1 to
compute temp1, and write
Controldata
data
time
first instruction to control
from memory to local memory
temp1 to main
memory
and write temp1 to local
memory
Datapathoperation
Algorithm for F = A x B + C / DStep 1: Temp1 = A x BStep 2: Temp2 = C / DStep 3: F = Temp1 + Temp2
Load Store Architecture: Effect on P f ?Performance?
‐ IC is worse‐ CPI x CCT is better
‐ Large memory = large delay (CCT or CPI)g y g y ( )‐ Temporal locality of data
‐ better code density (smaller opfields)
Load first
Read A and B from
Write temp1 to
compute temp1, and write
y ( p )Memory
addresses
time
first instruction to control
from memory to local memory
temp1 to main
memory
and write temp1 to local
memoryControl data
data
RF
Algorithm for F = A x B + C / DStep 1: Temp1 = A x BStep 2: Temp2 = C / DStep 3: F = Temp1 + Temp2
Datapathoperation
Load Store Architecture: Effect on P f ?Performance?
‐ IC is worse‐ CPI x CCT is better
‐ Large memory = large delay (CCT or CPI)g y g y ( )‐ Temporal locality of data
‐ better code density (smaller opfields)
Load first
Read A and B from
Write temp1 to
compute temp1, and write
y ( p )Memory
addresses
time
first instruction to control
from memory to local memory
temp1 to main
memory
and write temp1 to local
memoryControl data
data
RF
Do not have to do this step!Algorithm for F = A x B + C / DStep 1: Temp1 = A x BStep 2: Temp2 = C / DStep 3: F = Temp1 + Temp2
Datapathoperation
Do not have to do this step!
Operands of Instructions: VariationsOperands of Instructions: Variations
• Accumulator architectureAccumulator architecture
‐ Results of operations are always stored in special (accumulator) registerspecial (accumulator) register
• Stack architecture
‐ Datapath always operates with recent data (which are at the top of a stack)
‐ Cons and pros?Cons and pros?
Choice of Instructions?Choice of Instructions?
• Fixed length vs flexible lengthFixed length vs. flexible length
• Length of instruction, i.e. few vs. many1 i t ti i h (OISC)– 1 instruction is enough (OISC)
• Subtract and Branch if Less than or Equal to zero
Choice of Instructions?Choice of Instructions?
• Fixed length = simpler design– Easy decoding (faster CCT) …– …but could be sparser code (higher IC)
CISC (complex instruction set computing) Examples: x86 (Intel Atom, Intel Core, AMD Opteron), Motorola p ( , , p ),
68k, VAXvs.
RISC (reduced instruction set computing) Examples: MIPS (focus of this class, Sony PlayStation 2), ARM (
Apple A5x (ipad), Qualcomm snapdragon, Cortex‐A9 (Microsoft surface), Nvidea Tegra)
How Many Bits in One Register?
• 8‐bit Intel 8080 processor (1974)8 bit Intel 8080 processor (1974)Memory
addresses
data
• 32‐bit for mobile and 64‐bit for high Control
Datapath
data
ti
performance processors today– Could be much larger for vector processors
operation
MIPS (RISC) Design Principlesl f l• Simplicity favors regularity
– fixed size instructions– small number of instruction formats– opcode always the first 6 bits
• Smaller is fasterli it d i t ti t– limited instruction set
– limited number of registers in register file– limited number of addressing modes (TBD)
• Make the common case fast– arithmetic operands from the register file (load‐store machine))
– allow instructions to contain immediate operands (TBD)
• Good design demands good compromises• Good design demands good compromises– three instruction formats
MIPS‐32 ISA• Instruction Categories
– Computational: Arith, Shift, Logical– Memory transfer: Load/Store R0 ‐ R31
Registers
Memory transfer: Load/Store – Control: Jump and Branch– Others:
• Floating Point• Floating Point– coprocessor
• Memory Management• Special
PCHI
LOSpecial
3 Instruction Formats: all 32 bits wide
op
op
rs rt rd sa funct
rs rt immediate
R format
I format
op jump target J format
MIPS Register FileRegister File
H ld hi 32 bi ig
src1 addr
dd
32 bits
src1data
325
5
• Holds thirty‐two 32‐bit registers– Two read ports and– One write port
src2 addr
dst addr
write datasrc2data
32locations
32
5
5
32 Registers arewrite data data
Faster than main memory‐ But register files with more locations are slower (e.g., a 64 word file could b h l h f l )
write control
be as much as 50% slower than a 32 word file)
‐ Read/write port increase impacts speed quadratically
Easier for a compiler to use‐ e.g., (A*B) – (C*D) – (E*F) can do multiplies in any order vs. stack
Can hold variables so that‐ code density improves (since register are named with fewer bits than a memory location)
Aside: MIPS Register ConventionName Register
NumberUsage Preserve
on call?$zero 0 constant 0 (hardware) n a$zero 0 constant 0 (hardware) n.a.$at 1 reserved for assembler n.a.$v0 - $v1 2-3 returned values no$a0 - $a3 4-7 arguments yes$t0 - $t7 8-15 temporaries no$ $$s0 - $s7 16-23 saved values yes$t8 - $t9 24-25 temporaries no$gp 28 global pointer yes$gp 28 global pointer yes$sp 29 stack pointer yes$fp 30 frame pointer yesp p y$ra 31 return addr (hardware) yes
Memory OperandsMemory Operands
• To apply computational operationspp y p p– Load values from memory into registers– Store result from register to memory
• Memory is byte addressed (for historic reasons)• Memory is byte addressed (for historic reasons)– Each address identifies an 8‐bit byte
• Words are aligned in memory– Address must be a multiple of 4 (last two bits are always 0)
MIPS Instruction Fields
• MIPS fields are given names to make them easier to refer to
op rs rt rd shamt funct
op 6‐bits opcode that specifies the operation
rs 5 bits register file address of the first source operandrs 5‐bits register file address of the first source operand
rt 5‐bits register file address of the second source operand
rd 5‐bits register file address of the result’s destination
shamt 5‐bits shift amount (for shift instructions)
funct 6‐bits function code augmenting the opcode
Levels of RepresentationHigh Level Language
Program (e.g., C)
Compiler
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
focus of discussion now
ldr r0, [r2]ldr r1, [r2, #4]str r1, [r2]str r0, [r2, #4]
Assembly Language Program (e.g.,ARM)
Compiler
AssemblerMachine Language
Program (ARM)
Assembler
Machine
0000 1001 1100 0110 1010 1111 0101 10001010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111
Hardware Architecture Description (e.g., block diagrams)
Machine Interpretation
( g , g )
Architecture Implementation
Logic Circuit Description(Circuit Schematic Diagrams)
Dan Garcia
MIPS Arithmetic Instructions
• MIPS assembly language arithmetic statementadd $t0 $s1 $s2add $t0, $s1, $s2sub $t0, $s1, $s2
Each arithmetic instruction performs one operation
Each specifies exactly three operands that are all contained in the datapath’s register file ($t0,$s1,$s2)
destination source1 op source2
Instruction Format (R format)
0 17 18 8 0 0 220 17 18 8 0 0x22
MIPS Arithmetic Instructions• MIPS assembly language arithmetic statement
add $t0, $s1, $s2sub $t0, $s1, $s2
Each arithmetic instruction performs one operationp p
Each specifies exactly three operands that are all contained in the datapath’s register file ($t0,$s1,$s2)
destination source1 op source2
Instruction Format (R format) Instruction Format (R format)
0 17 18 8 0 0x22
MIPS Shift Operations• Need operations to pack and unpack 8‐bit characters into 32‐bit words
• Shifts move all the bits in a word left or rightsll $t2, $s0, 8 #$t2 = $s0 << 8 bitssll $t2, $s0, 8 #$t2 $s0 << 8 bitssrl $t2, $s0, 8 #$t2 = $s0 >> 8 bits
• Instruction Format (R format)( )
0 16 10 8 0x00
Such shifts are called logical because they fill with zeros Notice that a 5‐bit shamt field is enough to shift a 32‐bit value 25 –1 31 bit iti1 or 31 bit positions
also have sllv, srlv, and srav
MIPS Logical Operations• There are a number of bit‐wise logical operations in the MIPS
ISA
d $t0 $t1 $t2 #$t0 $t1 & $t2and $t0, $t1, $t2 #$t0 = $t1 & $t2
or $t0, $t1, $t2 #$t0 = $t1 | $t2
nor $t0, $t1, $t2 #$t0 = not($t1 | $t2)
• Instruction Format (R format)
andi $t0, $t1, 0xFF00 #$t0 = $t1 & ff00
i $t0 $t1 0 FF00 #$t0 $t1 | ff00
0 9 10 8 0 0x24
ori $t0, $t1, 0xFF00 #$t0 = $t1 | ff00
• Instruction Format (I format)
0x0D 9 8 0xFF00
MIPS Immediate Instructions S ll t t d ft i t i l d Small constants are used often in typical code
Possible approaches?t “t i l t t ” i d l d th put “typical constants” in memory and load them
create hard‐wired registers (like $zero) for constants like 1
have special instructions that contain constants !
addi $sp, $sp, 4 #$sp = $sp + 4slti $t0, $s2, 15 #$t0 = 1 if $s2<15
M hi f t (I f t)• Machine format (I format):
0x0A 18 8 0x0F
Best approach: the constant is kept inside the instruction itself! Best approach: the constant is kept inside the instruction itself! Immediate format limits values to the range +215–1 to ‐215
Note that how the constant are treated is determined by the type of instruction e.g. in addi constant is two’s complement ; addiu constant is unsigned number
Review: Unsigned Binary Integers• Given an n‐bit number
0121 00
11
2n2n
1n1n 2x2x2x2xx
Range: 0 to +2n – 1 Range: 0 to +2 1
Example0000 0000 0000 0000 0000 0000 0000 1011 0000 0000 0000 0000 0000 0000 0000 10112= 0 + … + 1×23 + 0×22 +1×21 +1×20
= 0 + … + 8 + 0 + 2 + 1 = 1110
Using 32 bits 0 to +4,294,967,295, , ,
Review: 2s‐Complement Signed Integers
• Given an n‐bit number0121 0
01
12n
2n1n
1n 2x2x2x2xx
Range: –2n – 1 to +2n – 1 – 1 Range: 2 to +2 1
Example1111 1111 1111 1111 1111 1111 1111 1100 1111 1111 1111 1111 1111 1111 1111 11002= –1×231 + 1×230 + … + 1×22 +0×21 +0×20
= –2,147,483,648 + 2,147,483,644 = –410
Using 32 bits –2,147,483,648 to +2,147,483,647, , , , , ,
Review: 2s‐Complement Signed Integers
• Bit 31 is sign bitg– 1 for negative numbers– 0 for non‐negative numbers
• ( 2n – 1) can’t be represented• –(–2n 1) can t be represented• Non‐negative numbers have the same unsigned and 2s‐complement representation
• Some specific numbers– 0: 0000 0000 … 0000
1: 1111 1111 1111– –1: 1111 1111 … 1111– Most‐negative: 1000 0000 … 0000– Most‐positive: 0111 1111 … 1111
Review: Signed Negation• Complement and add 1
Complement means 1→ 0 0→ 1– Complement means 1 → 0, 0 → 1
11111...111xx 2
x1x
2
Example: negate +2 +2 = 0000 0000 … 001022 –2 = 1111 1111 … 11012 + 1
= 1111 1111 … 111022
Sign Extension g
• Representing a number using more bitsp g g– Preserve the numeric value
• In MIPS instruction setaddi: extend immediate value– addi: extend immediate value
– lb, lh: extend loaded byte/halfword (will discuss later)– beq, bne: extend the displacement (will discuss later)
• Replicate the sign bit to the left– c.f. unsigned values: extend with 0s
• Examples: 8 bit to 16 bit• Examples: 8‐bit to 16‐bit– +2: 0000 0010 => 0000 0000 0000 0010– –2: 1111 1110 => 1111 1111 1111 1110
MIPS Memory Access Instructions
• MIPS has two basic data transfer instructions for accessing memorylw $t0, 4($s3) #load word from memorysw $t0, 8($s3) #store word to memory
• The data is loaded into (lw) or stored from (sw) a register in the register file
The memory address – a 32 bit address – is formed by adding y y gthe contents of the base address register to the offset value A 16‐bit field meaning access is limited to memory locations within a region of 213 or 8,192 words (215 or 32,768 bytes) of the address in g ythe base register
L d/S I i F (I f )
Machine Language ‐ Load Instruction• Load/Store Instruction Format (I format):
lw $t0, 24($s3)
35 19 8 2410
Memory0xf f f f f f f f2410 + $s3 =
$s3 0x12004094
. . . 0001 1000+ . . . 1001 0100. . . 1010 1100 =
0x120040ac$t0
0 000000040x000000080x0000000c
$s3. . . 1010 1100 0x120040ac
data word address (hex)0x000000000x00000004
Byte Addresses
• Most architectures address individual bytes in memory– Alignment restriction ‐ the memory address of a wordmust be on natural word boundaries (a multiple of 4 inmust be on natural word boundaries (a multiple of 4 in MIPS‐32)
• Big Endian: leftmost byte is word address/IBM 360/370, Motorola 68k, MIPS, Sparc, HP PA
• Little Endian: rightmost byte is word addressIntel 80x86, DEC Vax, DEC Alpha (Windows NT), , p ( )
3 2 1 0little endian byte 0
msb lsb
0 1 2 3big endian byte 0big endian byte 0
Aside: Loading and Storing Bytesd l b• MIPS provides special instructions to move bytes
lb $t0, 1($s3) #load byte from memorysb $t0, 6($s3) #store byte to memory$ , ($ ) # y y
0x28 19 8 16 bit offset
Wh t 8 bit t l d d d t d? What 8 bits get loaded and stored?
load byte places the byte from memory in the rightmost 8 bits of the destination register
‐ what happens to the other bits in the register?
store byte takes the byte from the rightmost 8 bits of a register and writes it to a byte in memorywrites it to a byte in memory
‐ what happens to the other bits in the memory word?
S ll ( 16 bit) t t b l d d ith ddi i t ti
Aside: How About Larger Constants?• Small (<16 bit) constants can be loaded with addi instruction
• We'd also like to be able to load a 32 bit constant into a register, for this we must use two instructions
• a new "load upper immediate" instructionlui $t0, 1010101010101010
• Then must get the lower order bits right, use ori $t0, $t0, 1010101010101010
16 0 8 101010101010101022
1010101010101010 00000000000000001010101010101010
0000000000000000 1010101010101010
0000000000000000
1010101010101010 1010101010101010
MIPS Control Flow Instructions• MIPS conditional branch instructions:bne $s0, $s1, Lbl #go to Lbl if $s0$s1 gbeq $s0, $s1, Lbl #go to Lbl if $s0=$s1
Ex: if (i j) h i + j– Ex: if (i==j) h = i + j;bne $s0, $s1, Lbl1add $s3, $s0, $s1
Lbl1: ...
Instruction Format (I format): Instruction Format (I format):
0x05 16 17 16 bit offset
How is the branch destination address specified?
Specifying Branch Destinations(l k l d ) dd d h b ff• Use a register (like in lw and sw) added to the 16‐bit offset
– which register? Instruction Address Register (the PC)• its use is automatically implied by instruction
h b l d d d ( ) h h ld h dd f h• PC might be already updated (PC+4) so that it holds the address of the next instruction
– limits the branch distance to ‐215 to +215‐1 (word) instructions from the (instruction after the) branch instruction, but mostfrom the (instruction after the) branch instruction, but most branches are local anyway
from the low order 16 bits of the branch instruction
offset
16
sign‐extend
PCAdd
32 323232
00
branch dstaddress
Add
32
3232 ?
Add4 32
h b h b h k d f
In Support of Branch Instructions• We have beq, bne, but what about other kinds of branches (e.g., branch‐if‐less‐than)? For this, we need yet another instruction, slt
• Set on less than instruction:slt $t0, $s0, $s1 # if $s0 < $s1 then
# $ 0 1 l# $t0 = 1 else # $t0 = 0
• Instruction format (R format):
• Alternate versions of slt0 16 17 8 0x24
slti $t0, $s0, 25 # if $s0 < 25 then $t0=1 ...
sltu $t0, $s0, $s1# if $s0 < $s1 then $t0=1 ...
sltiu $t0, $s0, 25# if $s0 < 25 then $t0=1 ...
2
Aside: More Branch Instructions• Can use slt, beq, bne, and the fixed value of 0 in register $zero to create other conditions– less than blt $s1, $s2, Label
slt $at, $s1, $s2 #$at set to 1 if
– less than or equal to ble $s1, $s2, Label
bne $at, $zero, Label #$s1 < $s2
q , ,– greater than bgt $s1, $s2, Label– great than or equal to bge $s1, $s2, Label
Such branches are included in the instruction set as pseudo instructions ‐ recognized (and expanded) by the assembler Its why the assembler needs a reserved register ($at)
l h d l b h
Other Control Flow Instructions• MIPS also has an unconditional branch instruction or jump instruction:
j label #go to label Instruction Format (J Format):
0x02 26‐bit address
from the low order 26 bits of the jump instruction26
432
00
PC 32
Aside: Branching Far Away
• What if the branch destination is further away than can be captured in 16 bits?be captured in 16 bits?
The assembler comes to the rescue – it inserts an unconditional jump to the branch target and inverts the condition
beq $s0 $s1 L1beq $s0, $s1, L1
becomes
bne $s0, $s1, L2j L1
L2:L2:
Another Example: If Statements
• C code:if (i==j) f = g+h;else f = g-h;
– f, g, … in $s0, $s1, …• Compiled MIPS code:p
bne $s3, $s4, Elseadd $s0, $s1, $s2j ij Exit
Else: sub $s0, $s1, $s2Exit: …
Assembler calculates addresses
Another Way of Describing What Instructions DoAssembly instruction What it does (Verilog‐like format)
add Rd, Rs, Rt RF[Rd] = RF[Rs] + RF[Rt]
addi Rt, Rs, Imm RF[Rt] = RF[Rs] + se Imm
and Rd Rs Rt RF[Rd] = RF[Rs] AND RF[Rt]and Rd, Rs, Rt RF[Rd] RF[Rs] AND RF[Rt]
andi Rt, Rs, Imm RF[Rt] = RF [Rs] AND ze Imm
sll Rd, Rt, sa RF[Rd] = RF[Rt] << sa
ll Rd Rt R RF[Rd] RF[Rt] RF[R ]sllv Rd, Rt, Rs RF[Rd] = RF[Rt] << RF[Rs]
sra Rd, Rt, sa RF[Rd] = RF[Rt] >> sa (padding with msb)
srl Rd, Rt, sa RF[Rd] = RF[Rt] >> sa (padding with 0)
lb Rt, offset(Rs) RF[Rt] = se (Mem[RF[Rs] + se Offset])
lbu Rt, offset(Rs) RF[Rt] = ze (Mem[RF[Rs] + se Offset])
lui Rt, Imm RF[Rt] = Imm <<16 | 0x0000
lw Rt, offset(Rs) RF[Rt] = Mem[RF[Rs] + se Offset]
sw Rt, offset(Rs) Mem[RF[Rs] + se Offset] = RF[Rt]
beq Rs, Rt, Label If (RF[Rs] == RF[Rt] ) then PC = PC + 4 + se (Imm <<2)beq Rs, Rt, Label If (RF[Rs] RF[Rt] ) then PC PC 4 se (Imm 2)
j Label PC = PC(31:28) I Imm << 2
slti Rt, Rs, Imm If (RF[Rs] < se Imm) then RF[Rt] = 1 else RF[Rt] = 0
MIPS Addressing ModesAddressing Instruction Other elements involved OperandAddressing Instruction Other elements involved Operand
Implied
I di t
Some place in the machine
ExtendImmediate
Register
Extend, if required
Reg f ile Reg spec Reg data
Base Memory
Add Reg file
Mem addr
Constant offset
Reg base Reg data
Mem data
PC-relative Add
PC
Constant offset
Memory
Mem addr Mem
data Incremented
Schematic representation of addressing modes in MIPS.
Pseudodirect Memory
Mem data
PC Mem addr
p g
More Elaborate Addressing Modes
Addressing Instruction Other elements involved Operand
Memory Add
Reg f ile Mem addr Mem
data Index reg
Indexed x := B[i]
Base reg
Memory Reg f ile
Mem addr Mem
data
Increment amount
Base reg
Update (with base) Incre-
ment
x := Mem[p]p := p + 1
Update (with indexed) Memory Add
Reg f ile Mem addr Mem
data Index reg Base reg
Increment
x := B[i]i := i + 1
Mem data PC
Mem addrMemory
Indirect
amount
Memory
Increment
t := Mem[p]x := Mem[t]
Schematic representation of more elaborate addressing
Mem addr, 2nd access
Mem data, 2nd access
This part maybe replaced with any other form of address specif ication x := Mem[Mem[p]]
Schematic representation of more elaborate addressing modes not supported in MIPS.
C to Assembly for Loop Statements• C code:while (save[i] == k) i += 1;while (save[i] == k) i += 1;
– i in $s3, k in $s5, address of save in $s6• Compiled MIPS code:Compiled MIPS code:Loop: sll $t1, $s3, 2 #t1 = i*4
add $t1, $t1, $s6lw $t0 0($t1)lw $t0, 0($t1)bne $t0, $s5, Exitaddi $s3, $s3, 1j Loopj Loop
Exit: …
‐ There are multiple ways of translating c code to assembly!‐ The fewer instructions count the faster execution time (neglecting other
complications like the effect on CPI)!
Assembly to Binary for Loop Example
• Loop code from earlier exampleLoop code from earlier example– Assume Loop at location 80000
Loop: sll $t1, $s3, 2 80000 0 0 19 9 4 0
add $t1, $t1, $s6 80004 0 9 22 9 0 32
lw $t0, 0($t1) 80008 35 9 8 0
bne $t0, $s5, Exit 80012 5 8 21 2
addi $s3, $s3, 1 80016 8 19 19 1
j Loop 80020 2 20000
Exit: 80024Exit: … 80024
More Examples: Okay For‐Loop Code
C code: for(i = 1; i <= 10; i++) A[i] = A[i] + 1;
Okay assembly code: assume i: $s1, base of A: $s2
addi $s1, $0, 1 # i = 1, ,LOOP: slti $t0, $s1, 11 # LOOP: if(i < 11) $t0 = 1 else $t0 = 0
beq $t0, $0, END # if ($t0==0) goto END; else {sll $t0, $s1, 2 # $t0 = i *4;add $t0 $s2 $t0 # $t0 = $t0 + addr A;add $t0, $s2, $t0 # $t0 = $t0 + addr A;lw $t1, 0($t0) # $t1 = A[i];addi $t1, $t1, 1 # $t1 = $t1 + 1;sw $t1, 0($t0) # A[i] = $t1;addi $s1, $s1, 1 # i = i + 1;j LOOP # goto LOOP }
END: # END:
2 control flow instructions + 7 other instructions in the loop
More Examples: Better For‐Loop Code
C code: for(i = 1; i <= 10; i++) A[i] = A[i] + 1;
Better assembly code: assume i:$s1, base of A:$s2
addi $s1 $0 1 # i 1;addi $s1, $0, 1 # i = 1;addi $t0, $s2, 4 # $t0 = addr A + 4; (* pointer to A[1] *)addi $t2, $0, 11 # $t2 = 11;
LOOP: lw $t1 0($t0) # do { $t1 = A[i];LOOP: lw $t1, 0($t0) # do { $t1 = A[i];addi $t1, $t1, 1 # $t1 = $t1 + 1;sw $t1, 0($t0) # A[i] = $t1;addi $s1, $s1, 1 # i = i + 1;$ , $ , ;addi $t0, $t0, 4 # $t0 = $t0 + 4; } bne $s1, $t2, LOOP # while (i != 11);
1 control flow instructions + 5 other instructions in the loop
More Examples: Even Better For‐loop CodeCode
C code: for(i = 1; i <= 10; i++) A[i] = A[i] + 1;
Even better assembly code: assume i: $s1, base of A: $s2
addi $t0, $s2, 4 # $t0 = addr A + 4; (* pointer to a[1] *)$ , $ , $ ; ( p [ ] )addi $t2, $t0, 40 # $t2 = $t0 + 40; (* pointer to a[11] *)
LOOP: lw $t1, 0($t0) # do { $t1 = A[i];addi $t1, $t1, 1 # $t1 = $t1 + 1;sw $t1, 0($t0) # A[i] = $t1;addi $t0, $t0, 4 # $t0 = $t0 + 4; } bne $t0, $t2, LOOP # while ($t2 != $t0);addi $s1, $0, 11 # i = 11
(note that in this case the variable i is not used at all The last line is just to make C code functionally(note that in this case the variable i is not used at all. The last line is just to make C code functionally equivalent to assembly code, since in C variable i will be equal to 11 after the completion of the loop)
1 control flow instructions + 4 other instructions in the loop
Our implementation of the MIPS is simplifiedProcessor Datapath and Control Our implementation of the MIPS is simplified
memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j
Generic implementation use the program counter (PC) to supply
the instruction address and fetch the instruction from memory (and update the PC)
FetchPC = PC+4
DecodeExec
decode the instruction (and read registers) execute the instruction
All instructions (except j) use the ALU after reading the All instructions (except j) use the ALU after reading the registers
CSE431 Chapter 4A.61 Irwin, PSU, 2008
How? memory-reference? arithmetic? control flow?
Review: Clocking Methodologies The clocking methodology defines when data in a state
Review: Clocking Methodology The clocking methodology defines when data in a state
element is valid and stable relative to the clock State elements - a memory element such as a register Edge-triggered – all state changes occur on a clock edge
Typical execution read contents of state elements -> send values through g
combinational logic -> write results to one or more state elementsState
element1
Stateelement
2
Combinationallogic
1 2
clock
one clock cycle
Assumes state elements are written on every clock cycle; if not, need explicit write control signal
CSE431 Chapter 4A.62 Irwin, PSU, 2008
write occurs only when both the write control is asserted and the clock edge occurs
Fetching Instructions Fetching instructions involves
Fetching Instruction Fetching instructions involves
reading the instruction from the Instruction Memory updating the PC value to be the address of the next
(sequential) instruction(sequential) instruction
Addclock
InstructionMemory
dd
4Fetch
PC = PC+4
clock
ReadAddress
Instruction
Memory
PCDecodeExec
PC is updated every clock cycle, so it does not need an explicit write control signal just a clock signal
CSE431 Chapter 4A.63 Irwin, PSU, 2008
Reading from the Instruction Memory is a combinational activity, so it doesn’t need an explicit read control signal
Decoding Instructions Decoding instructions involves
Decoding Instruction Decoding instructions involves
sending the fetched instruction’s opcode and function field bits to the control unit
ControlUnit
FetchPC = PC+4
Read Addr 1Register Read
DecodeExec
and Instruction
Write Data
Read Addr 2
Write Addr
Register
FileData 1
ReadData 2
reading two values from the Register File
CSE431 Chapter 4A.64 Irwin, PSU, 2008
- Register File addresses are contained in the instruction
Executing R Format Operations R format operations (add sub slt and or)Executing R Format Instruction R format operations (add, sub, slt, and, or)
R-type:31 25 20 15 5 0
op rs rt rd functshamt
10
perform operation (op and funct) on values in rs and rt store the result back into the Register File (into location rd)
p
R d Add 1
ALU controlRegWrite
Instruction
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadALU
overflowzero
FetchPC = PC+4
DecodeExecWrite Data
Data 2
N t th t R i t Fil i t itt l ( )
CSE431 Chapter 4A.65 Irwin, PSU, 2008
Note that Register File is not written every cycle (e.g. sw), so we need an explicit write control signal for the Register File
Executing Load and Store Operations Load and store operations involvesExecuting Load Instruction Load and store operations involves
compute memory address by adding the base register (read from the Register File during decode) to the 16-bit signed-extended offset field in the instructionoffset field in the instruction
store value (read from the Register File during decode) written to the Data Memory
load value read from the Data Memory written to the Register load value, read from the Data Memory, written to the Register File
R d Add 1
overflowzero
ALU controlRegWrite MemWrite
Instruction
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadALU
zero
DataMemory
Address
W it D t
Read Data
Write DataData 2 Write Data
Sign MemRead
CSE431 Chapter 4A.66 Irwin, PSU, 2008
Extend16 32
Executing Branch Operations Branch operations involvesExecuting Branch Instruction Branch operations involves
compare the operands read from the Register File during decode for equality (zero ALU output)
compute the branch target address by adding the updated PC to compute the branch target address by adding the updated PC to the 16-bit signed-extended offset field in the instr
AddAdd
Branchtarget
ALU control
Shiftleft 2
4 Add targetaddress
I t ti
Read Addr 1
Read Addr 2Register Read
Data 1
zero
PC
(to branch control logic)
Instruction
Write Data
Read Addr 2
Write AddrFile
ReadData 2
ALU
CSE431 Chapter 4A.67 Irwin, PSU, 2008
SignExtend16 32
Executing Jump Operations Jump operation involvesExecuting Jump Instruction Jump operation involves
replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits
Add
44
Read Instruction
InstructionMemory
PC
Shiftleft 2
Jumpaddress
4
28
AddressInstructionPC
26
CSE431 Chapter 4A.68 Irwin, PSU, 2008
Creating a Single Datapath from the PartsCreating a Single Datapath Assemble the datapath segments and add control lines
and multiplexors as needed Single cycle design fetch decode and execute each Single cycle design – fetch, decode and execute each
instructions in one clock cycle no datapath resource can be used more than once per
instruction, so some must be duplicated (e.g., several adders) multiplexors needed at the input of shared elements with
control lines to do the selection write signals to control writing to the Register File and Data
Memory
Cycle time is determined by length of the longest path
CSE431 Chapter 4A.69 Irwin, PSU, 2008
Fetch, R, and Memory Access PortionsFetch, R, and Memory Access Portions
MemtoReg
Instruction
Add
4
Read Addr 1Read
ovfzero
ALU controlRegWrite
Address
MemWriteALUSrc
ReadAddress
Instruction
st uct oMemory
PC
W it D t
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALUData
Memory
Address
Write Data
Read Data
Write Data
MemReadSign
Extend16 32Extend16 32
CSE431 Chapter 4A.70 Irwin, PSU, 2008
Adding the Control Selecting the operations to perform (ALU Register FileAdding Control Selecting the operations to perform (ALU, Register File
and Memory read/write) Controlling the flow of data (multiplexor inputs)
31 25 20 15 0
R-type:31 25 20 15 5 0
op rs rt rd functshamt
10
Ob tiI-Type: op rs rt address offset
31 25 20 15 0 Observations op field always
in bits 31-26 31 25 0 addr of registers
to be read are always specified by the
fi ld (bit 25 21) d t fi ld (bit 20 16) f l d i th b
J-type: op target address
rs field (bits 25-21) and rt field (bits 20-16); for lw and sw rs is the base register
addr. of register to be written is in one of two places – in rt (bits 20-16) for lw; in rd (bits 15 11) for R type instructions
CSE431 Chapter 4A.71 Irwin, PSU, 2008
for lw; in rd (bits 15-11) for R-type instructions
offset for beq, lw, and sw always in bits 15-0
Single Cycle Datapath with Control UnitSingle Cycle Datapath & Control
Add
4 Shiftleft 2
Add
PCSrc
0
1
MemWrite
MemReadMemtoReg
ALUSrc
left 2ALUOp
ControlUnit
Instr[31-26]
Branch
Instruction Read Addr 1R i t Read
ovf
RegWrite
Address
RegDst
Instr[25-21]
ReadAddress
Instr[31-0]
Memory
PC
Write Data
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
zeroData
Memory
Address
Write Data
Read Data 1
1
1
00
0
Instr[20-16]
Instr[15 Write Data
SignExtend16 32
ALUcontrol
1
Instr[15-0]
-11]
CSE431 Chapter 4A.72 Irwin, PSU, 2008
Instr[5-0]
R-type Instruction Data/Control FlowR-Type Instruction Datapath & Control Flow
Add
4 Shiftleft 2
Add
PCSrc
0
1
MemWrite
MemReadMemtoReg
ALUSrc
left 2ALUOp
ControlUnit
Instr[31-26]
Branch
Instruction Read Addr 1R i t Read
ovf
RegWrite
Address
RegDst
Instr[25-21]
ReadAddress
Instr[31-0]
Memory
PC
Write Data
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
zeroData
Memory
Address
Write Data
Read Data 1
1
1
00
0
Instr[20-16]
Instr[15 Write Data
SignExtend16 32
ALUcontrol
1
Instr[15-0]
-11]
CSE431 Chapter 4A.73 Irwin, PSU, 2008
Instr[5-0]
Load Word Instruction Data/Control FlowLoad Word Instruction Datapath & Control Flow
Add
4 Shiftleft 2
Add
PCSrc
0
1
MemWrite
MemReadMemtoReg
ALUSrc
left 2ALUOp
ControlUnit
Instr[31-26]
Branch
Instruction Read Addr 1R i t Read
ovf
RegWrite
Address
RegDst
Instr[25-21]
ReadAddress
Instr[31-0]
Memory
PC
Write Data
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
zeroData
Memory
Address
Write Data
Read Data 1
1
1
00
0
Instr[20-16]
Instr[15 Write Data
SignExtend16 32
ALUcontrol
1
Instr[15-0]
-11]
CSE431 Chapter 4A.74 Irwin, PSU, 2008
Instr[5-0]
Branch Instruction Data/Control FlowBranch Instruction Datapath & Control Flow
Add
4 Shiftleft 2
Add
PCSrc
0
1
MemWrite
MemReadMemtoReg
ALUSrc
left 2ALUOp
ControlUnit
Instr[31-26]
Branch
Instruction Read Addr 1R i t Read
ovf
RegWrite
Address
RegDst
Instr[25-21]
ReadAddress
Instr[31-0]
Memory
PC
Write Data
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
zeroData
Memory
Address
Write Data
Read Data 1
1
1
00
0
Instr[20-16]
Instr[15 Write Data
SignExtend16 32
ALUcontrol
1
Instr[15-0]
-11]
CSE431 Chapter 4A.75 Irwin, PSU, 2008
Instr[5-0]
Adding the Jump Operation 1Instr[25-0]
Adding Jump Instruction
Add
4 ShiftAdd
0
1
Shiftleft 2
0
132
[ ]
26PC+4[31-28]
28
MemWrite
MemReadMemtoReg
ALUSrc
left 2 PCSrcALUOp
ControlUnit
Instr[31-26]
BranchJump
I t ti Read Addr 1
ovf
RegWrite
ALUSrc
RegDst
Instr[25-21]
ReadAddress
Instr[31-0]
InstructionMemory
PC
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
zeroData
Memory
Address
Write Data
Read Data 1
1
00
0
Instr[20-16]
I t [15 Write DataData 2 Write Data
SignExtend16 32
ALUcontrol
10
Instr[15-0]
Instr[15 -11]
CSE431 Chapter 4A.76 Irwin, PSU, 2008
16 32 control
Instr[5-0]
Instruction Critical Paths What is the clock cycle time assuming negligible
Instruction Critical Path What is the clock cycle time assuming negligible delays for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except:
Instruction and Data Memory (200 ps) ALU and adders (200 ps) Register File access (reads or writes) (100 ps)
Instr. I Mem Reg Rd ALU Op D Mem Reg Wr TotalR 200 100 200 100 600
Register File access (reads or writes) (100 ps)
R-typeload
200 100 200 100 600
200 100 200 200 100 800storebeq
200 100 200 200 700200 100 200 500
CSE431 Chapter 4A.77 Irwin, PSU, 2008
jump 200 200
Single Cycle Disadvantages & AdvantagesU th l k l i ffi i tl th l k l t
Single Cycle Implementation Cons and Pros Uses the clock cycle inefficiently – the clock cycle must
be timed to accommodate the slowest instruction especially problematic for more complex instructions like
floating point multiply
ClkCycle 1 Cycle 2
Clk
lw sw Waste
May be wasteful of area since some functional units (e g adders) must be duplicated since they can not be(e.g., adders) must be duplicated since they can not be shared during a clock cycle
but
CSE431 Chapter 4A.78 Irwin, PSU, 2008
Is simple and easy to understand
Multi‐Cycle DatapathMain Idea: Break execution of instruction into smaller steps (cycles) and let instruction to p ( y )execute in variable number of cycles
• Note that CCT is smaller but CPI is now larger as compared to single cycle• Instruction execution time is not longer defined by the slowest instruction
Issues with multicycle datapathIssues with multicycle datapath• Equal amount of work per each cycle . Typical steps are IF – Instruction Fetch, ID –Instruction Decode and Register Fetch, EXE ‐ Execution, MEM – Memory Transfer, WB – Write Back to Regsters
Clock
• More complicated control
Instr 1 Instr 4 Instr 3 Instr 2
Time needed
Time allotted
Clock
Time
Instr 2 Instr 1 Instr 3 Instr 4 3 cycles 3 cycles 4 cycles 5 cycles
Timesaved
Time needed
Time allotted
Multi Cycle Implementation
• FSM in control to implement variable cycleFSM in control to implement variable cycle time execution
• Option #1: Use the same datapath• Option #1: Use the same datapath
• Option #2: Reuse of resources, i.e. one ALU for b h PC i d ibranch, PC increment, and execution stage, (also one memory on the figure next)– Cons and pros (smaller area so could be faster but need extra registers to keep intermediate values)
Option #2: Datapath and FSM Example
Note the extra registers to be able to reuse one ALU and memoryFor example, R‐type instruction use the same ALU at EXE stage and WB to calculate PC+ 4