chapter 7 digital design and computer architecture, 2 nd edition chapter 7 david money harris and...
TRANSCRIPT
Chapter 7 <1>
MIC
ROAR
CHIT
ECTU
RE
Digital Design and Computer Architecture, 2nd Edition
Chapter 7
David Money Harris and Sarah L. Harris
Chapter 7 <2>
MIC
ROAR
CHIT
ECTU
RE Chapter 7 :: Topics
• Introduction• Performance Analysis• Single-Cycle Processor• Pipelined Processor• Exceptions• Advanced Microarchitecture
Chapter 7 <3>
MIC
ROAR
CHIT
ECTU
RE• Microarchitecture: the
implementation of an architecture in hardware
• Processor:– Datapath: functional blocks– Control: control signals
Physics
Devices
AnalogCircuits
DigitalCircuits
Logic
Micro-architecture
Architecture
OperatingSystems
ApplicationSoftware
electrons
transistorsdiodes
amplifiersfilters
AND gatesNOT gates
addersmemories
datapathscontrollers
instructionsregisters
device drivers
programs
Introduction
Chapter 7 <4>
MIC
ROAR
CHIT
ECTU
RE• Multiple implementations for a single
architecture:– Single-cycle: Each instruction executes in a
single cycle– Multicycle: Instructions are broken into series of
shorter steps Each instruction executes in n cycles, where n varys according to the instr.
– Pipelined: Each instruction broken up into series of steps & multiple instructions execute at once (Note: AMD and Intel pipelines are different, for the same IA-32 architecture (a.k.a. x86 ISA)
Microarchitecture
Chapter 7 <5>
MIC
ROAR
CHIT
ECTU
RE• Program execution timeExecution Time = (#instructions)(cycles/instruction)(seconds/cycle)
• Definitions:– IC: Instruction Count (= #instructions)– CPI: Cycles/Instruction– clock period: seconds/cycle– IPC: Instructions/Cycle (= 1/CPI)
• Challenge is to satisfy constraints of:– Cost– Power– Performance
Processor Performance
Chapter 7 <6>
MIC
ROAR
CHIT
ECTU
RE• Consider subset of MIPS instructions:
– R-type instructions: and, or, add, sub, slt– Memory instructions: lw, sw– Branch instructions: beq
MIPS Processor
Chapter 7 <7>
MIC
ROAR
CHIT
ECTU
RE• Determines everything about a processor:
– PC and special registers– Register File– Memory
Architectural State
Chapter 7 <8>
MIC
ROAR
CHIT
ECTU
RE
CLK
A RD
InstructionMemory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RD
DataMemory
WD
WEPCPC'
CLK
32 3232 32
32
32
32 32
32
32
5
5
5
MIPS State Elements
Plus the HI and LO registers
Chapter 7 <9>
MIC
ROAR
CHIT
ECTU
RE• Datapath—design it 1st, to make the
instruction actions possible• Control—design it 2nd, to make them
happen
Single-Cycle MIPS Processor
Chapter 7 <10>
MIC
ROAR
CHIT
ECTU
RESTEP 1: Fetch instruction
IM[PC]
CLK
A RD
InstructionMemory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RD
DataMemory
WD
WEPCPC'
Instr
CLK
Single-Cycle Datapath: lw fetch
Chapter 7 <11>
MIC
ROAR
CHIT
ECTU
RESTEP 2: Read source operands from RF
RF[rs] or RF[Instr(25:21)]
Instr
CLK
A RD
InstructionMemory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RD
DataMemory
WD
WEPCPC'
25:21
CLK
Single-Cycle Datapath: lw Register Read
Chapter 7 <12>
MIC
ROAR
CHIT
ECTU
RESTEP 3: Sign-extend the immediate SignExt(immed)
SignImm
CLK
A RD
InstructionMemory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
A RD
DataMemory
WD
WEPCPC' Instr
25:21
15:0
CLK
Single-Cycle Datapath: lw Immediate
Chapter 7 <13>
MIC
ROAR
CHIT
ECTU
RESTEP 4: Compute the memory address
addr = RF[rs] + SignExt(immed)
SignImm
CLK
A RD
InstructionMemory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
A RD
DataMemory
WD
WEPCPC' Instr
25:21
15:0
SrcB
ALUResult
SrcA Zero
CLK
ALUControl2:0
ALU
010
Single-Cycle Datapath: lw address
Chapter 7 <14>
MIC
ROAR
CHIT
ECTU
RE• STEP 5: Read data from memory and write
it back to register file: RF[rt] DM[addr]
A1
A3
WD3
RD2
RD1WE3
A2
SignImm
CLK
A RD
InstructionMemory
CLK
Sign Extend
RegisterFile
A RD
DataMemory
WD
WEPCPC' Instr
25:21
15:0
SrcB20:16
ALUResult ReadData
SrcA
RegWrite
Zero
CLK
ALUControl2:0
ALU
0101
Single-Cycle Datapath: lw Memory Read
Chapter 7 <15>
MIC
ROAR
CHIT
ECTU
RESTEP 6: Determine address of next instruction PC PC + 4
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
A RD
DataMemory
WD
WEPCPC' Instr
25:21
15:0
SrcB20:16
ALUResult ReadData
SrcA
PCPlus4
Result
RegWrite
Zero
CLK
ALUControl2:0
ALU
0101
Single-Cycle Datapath: lw PC Increment
Chapter 7 <16>
MIC
ROAR
CHIT
ECTU
REIM[PC]RF[rt] DM[RF[rs] + SignExt(immed)]PC PC + 4
Full RTL Expression for lw
Chapter 7 <17>
MIC
ROAR
CHIT
ECTU
REWrite data in rt to memory: DM[addr]RF[rt]
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
A RD
DataMemory
WD
WEPCPC' Instr
25:21
20:16
15:0
SrcB20:16
ALUResult ReadData
WriteData
SrcA
PCPlus4
Result
MemWriteRegWrite
Zero
CLK
ALUControl2:0
ALU
10100
Single-Cycle Datapath: sw
Chapter 7 <18>
MIC
ROAR
CHIT
ECTU
RE• Read from rs and rt• Write ALUResult to register file• Write to rd (instead of rt) RF[rd] RF[rs] op RF[rt]
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PCPC' Instr25:21
20:16
15:0
SrcB
20:16
15:11
ALUResult ReadData
WriteData
SrcA
PCPlus4WriteReg4:0
Result
RegDst MemWrite MemtoRegALUSrcRegWrite
Zero
CLK
ALUControl2:0
ALU
0varies1 001
Single-Cycle Datapath: R-Type
Chapter 7 <19>
MIC
ROAR
CHIT
ECTU
RE• Determine whether values in rs and rt are equal• Calculate branch target address: BTA = PC + 4 + SignExt(immed)<< 2 # <<2 = 4x
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PC0
1
PC' Instr25:21
20:16
15:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
RegDst Branch MemWrite MemtoRegALUSrcRegWrite
Zero
PCSrc
CLK
ALUControl2:0
ALU
01100 x0x 1
Single-Cycle Datapath: beq
Chapter 7 <20>
MIC
ROAR
CHIT
ECTU
REIM[PC]if (RF[rs] - RF[rt] == 0) PC BTAelse PC PC + 4
RTL Expression for beq
Chapter 7 <21>
MIC
ROAR
CHIT
ECTU
RE
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PC0
1PC' Instr
25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
Branch
MemWrite
MemtoReg
ALUSrc
RegWrite
Op
Funct
ControlUnit
Zero
PCSrc
CLK
ALUControl2:0
ALU
Single-Cycle Processor
Chapter 7 <22>
MIC
ROAR
CHIT
ECTU
RE
RegDst
Branch
MemWrite
MemtoReg
ALUSrcOpcode5:0
ControlUnit
ALUControl2:0Funct5:0
MainDecoder
ALUOp1:0
ALUDecoder
RegWrite
Single-Cycle Control
Chapter 7 <23>
MIC
ROAR
CHIT
ECTU
RE
ALU
N N
N
3
A B
Y
F
F2:0 Function
000 A & B
001 A | B
010 A + B
011 not used
100 A & ~B
101 A | ~B
110 A - B
111 SLT
Review: ALU
Chapter 7 <24>
MIC
ROAR
CHIT
ECTU
RE
+
2 01
A B
Cout
Y
3
01
F2
F1:0
[N-1] S
NN
N
N
N NNN
N
2Z
ero
Extend
Review: ALU
Chapter 7 <25>
MIC
ROAR
CHIT
ECTU
REALUOp1:0 Meaning
00 Add (for lw, sw)
01 Subtract (for beq)
10 Look at funct (R-type)
11 Not Used
ALUOp1:0 funct ALUControl2:0
00 X 010 (Add)
X1 X 110 (Subtract)
1X 100000 (add) 010 (Add)
1X 100010 (sub) 110 (Subtract)
1X 100100 (and) 000 (And)
1X 100101 (or) 001 (Or)
1X 101010 (slt) 111 (SLT)
Control Unit: ALU Decoder
Chapter 7 <26>
MIC
ROAR
CHIT
ECTU
REInstruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0
R-type 000000
lw 100011
sw 101011
beq 000100
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PC0
1PC' Instr
25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
Branch
MemWrite
MemtoReg
ALUSrc
RegWrite
Op
Funct
ControlUnit
Zero
PCSrc
CLK
ALUControl2:0
ALU
Control Unit: Main Decoder
Chapter 7 <27>
MIC
ROAR
CHIT
ECTU
RE
Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0
R-type 000000 1 1 0 0 0 0 10lw 100011 1 0 1 0 0 1 00sw 101011 0 X 1 0 1 X 00beq 000100 0 X 0 1 0 X 01
Control Unit: Main Decoder
Chapter 7 <28>
MIC
ROAR
CHIT
ECTU
RE
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PC0
1PC' Instr
25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
Branch
MemWrite
MemtoReg
ALUSrc
RegWrite
Op
Funct
ControlUnit
Zero
PCSrc
CLK
ALUControl2:0
ALU
0010
01
0
0
1
0
Single-Cycle Datapath: or
Chapter 7 <29>
MIC
ROAR
CHIT
ECTU
RE
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PC0
1PC' Instr
25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
Branch
MemWrite
MemtoReg
ALUSrc
RegWrite
Op
Funct
ControlUnit
Zero
PCSrc
CLK
ALUControl2:0
ALU
No change to datapath
Extended Functionality: addi
Chapter 7 <30>
MIC
ROAR
CHIT
ECTU
REInstruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0
R-type 000000 1 1 0 0 0 0 10
lw 100011 1 0 1 0 0 1 00
sw 101011 0 X 1 0 1 X 00
beq 000100 0 X 0 1 0 X 01
addi 001000
Main Decoder table: addi
Chapter 7 <31>
MIC
ROAR
CHIT
ECTU
REInstruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0
R-type 000000 1 1 0 0 0 0 10
lw 100011 1 0 1 0 0 1 00
sw 101011 0 X 1 0 1 X 00
beq 000100 0 X 0 1 0 X 01
addi 001000 1 0 1 0 0 0 00
Main Decoder table: addi
Chapter 7 <32>
MIC
ROAR
CHIT
ECTU
RE
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PC0
1PC'
Instr25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
Branch
MemWrite
MemtoReg
ALUSrc
RegWrite
Op
Funct
ControlUnit
Zero
PCSrc
CLK
ALUControl2:0
ALU
0
1
25:0 <<2
27:0 31:28
PCJump
Jump
Extended Functionality: j
Chapter 7 <33>
MIC
ROAR
CHIT
ECTU
RE
Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0 Jump
R-type 000000 1 1 0 0 0 0 10 0
lw 100011 1 0 1 0 0 1 00 0
sw 101011 0 X 1 0 1 X 00 0
beq 000100 0 X 0 1 0 X 01 0
j 000010
Main Decoder table: j
Chapter 7 <34>
MIC
ROAR
CHIT
ECTU
RE
Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0 Jump
R-type 000000 1 1 0 0 0 0 10 0
lw 100011 1 0 1 0 0 1 00 0
sw 101011 0 X 1 0 1 X 00 0
beq 000100 0 X 0 1 0 X 01 0
j 000010 0 X X X 0 X XX 1
Main Decoder table: j
Chapter 7 <35>
MIC
ROAR
CHIT
ECTU
RE
Program Execution Time = (#instructions)(cycles/instruction)(seconds/cycle) = IC x CPI x TC
Review: Processor Performance
Chapter 7 <36>
MIC
ROAR
CHIT
ECTU
RE
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PC0
1PC' Instr
25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
Branch
MemWrite
MemtoReg
ALUSrc
RegWrite
Op
Funct
ControlUnit
Zero
PCSrc
CLK
ALUControl2:0
ALU1
0100
1
0
1
0 0
TC limited by critical path (lw)
Single-Cycle Performance
Chapter 7 <37>
MIC
ROAR
CHIT
ECTU
RE• Single-cycle critical path: Tc = tpcq_PC + tmem + max(tRFread, tsext + tmux) + tALU + tmem
+ tmux + tRFsetup
• Typically, limiting paths are: – memory, ALU, register file – Tc = tpcq_PC + 2tmem + tRFread + tmux + tALU + tRFsetup
Single-Cycle Performance
Chapter 7 <38>
MIC
ROAR
CHIT
ECTU
REElement Parameter Delay (ps)Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
Tc = ?
Single-Cycle Performance Example
Chapter 7 <39>
MIC
ROAR
CHIT
ECTU
REElement Parameter Delay (ps)Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
Tc = tpcq_PC + 2tmem + tRFread + tmux + tALU + tRFsetup
= [30 + 2(250) + 150 + 25 + 200 + 20] ps = 925 ps [fclk = 1/0.925 GHz = 1.08 GHz]
Single-Cycle Performance Example
Chapter 7 <40>
MIC
ROAR
CHIT
ECTU
REProgram with IC = 100 billion instructions:
Execution Time = IC x CPI x TC
= (100 × 109)(1)(925 × 10-12 s) = 92.5 seconds
Single-Cycle Performance Example
Chapter 7 <41>
MIC
ROAR
CHIT
ECTU
REPros and cons of single-cycle implementation: + simple design + 1 cycle per every instruction - slow cycle time
limited by longest instruction (lw) - HW: 2 adders + ALU; 2 memories
Evaluation of Single-Cycle Processor
Chapter 7 <42>
MIC
ROAR
CHIT
ECTU
RE
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PC0
1PC' Instr
25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
Branch
MemWrite
MemtoReg
ALUSrc
RegWrite
Op
Funct
ControlUnit
Zero
PCSrc
CLK
ALUControl2:0
ALU
0
1
25:0 <<2
27:0 31:28
PCJump
Jump
Review: Single-Cycle Processor
Chapter 7 <43>
MIC
ROAR
CHIT
ECTU
RE//------------------------------------------------// [email protected] 9 November 2005// Top level system including MIPS and memories//------------------------------------------------
module top (input clk, reset, output [31:0] writedata, dataadr, output memwrite);
wire [31:0] pc, instr, readdata;
// instantiate processor and memories mips mips (clk, reset, pc, instr, memwrite, dataadr, writedata, readdata); imem imem (pc[7:2], instr); dmem dmem (clk, memwrite, dataadr, writedata, readdata);
endmodule
Verilog Model
Chapter 7 <44>
MIC
ROAR
CHIT
ECTU
RE//------------------------------------------------// [email protected] 23 October 2005// External data memory used by MIPS single-cycle processor//------------------------------------------------module dmem (input clk, we, input [31:0] a, wd, output [31:0] rd);
reg [31:0] RAM[63:0]; assign rd = RAM[a[31:2]]; // word-aligned read
always @(posedge clk) if (we) RAM[a[31:2]] <= wd; // word-aligned writeendmodule
Verilog Model of Data Memory
Chapter 7 <45>
MIC
ROAR
CHIT
ECTU
REmodule imem (input [5:0] addr, output reg [31:0] instr);
// imem is modeled as a lookup table, a stored-program byte-addressable ROMalways@(addr) case ({addr, 2'b00})
// address instruction// --------- --------------
8'h00: instr = 32'h20020005;8'h04: instr = 32'h2003000c;8'h08: instr = 32'h2067fff7;8'h0c: instr = 32'h00e22025;8'h10: instr = 32'h00642824;8'h14: instr = 32'h00a42820;8'h18: instr = 32'h10a7000a;8'h1c: instr = 32'h0064202a;8'h20: instr = 32'h10800001;
default: instr = {32{1'bx}}; // unknown instruction endcase
endmodule
Verilog Model of Instr. Memory
Chapter 7 <46>
MIC
ROAR
CHIT
ECTU
REmodule imem (input [5:0] addr, output [31:0] instr);
reg [31:0] RAM[63:0];
// imem is RAM, loaded from memfile.dat file with hex values at startup initial begin $readmemh("memfile.dat", RAM); end
assign instr = RAM[addr]; // instr at RAM[addr] is read out
endmodule
// imem can be created with CoreGen for Xilinx synthesis
Alternate Model of Instr. Memory
Chapter 7 <47>
MIC
ROAR
CHIT
ECTU
RE// single-cycle MIPS processormodule mips (input clk, reset, output [31:0] pc, input [31:0] instr, output memwrite, output [31:0] aluout, writedata, input [31:0] readdata);
wire memtoreg, pcsrc, zero, alusrc, regdst, regwrite, jump; wire [2:0] alucontrol;
controller c (instr[31:26], instr[5:0], zero, memtoreg, memwrite, pcsrc, alusrc, regdst, regwrite, jump, alucontrol);
datapath dp (clk, reset, memtoreg, pcsrc, alusrc, regdst, regwrite, jump, alucontrol, zero, pc, instr, aluout, writedata, readdata);
endmodule
Verilog Model of MIPS Processor
Chapter 7 <48>
MIC
ROAR
CHIT
ECTU
REmodule controller ( input [5:0] op, funct, input zero, output memtoreg, memwrite, output pcsrc, alusrc, output regdst, regwrite, output jump, output [2:0] alucontrol);
wire [1:0] aluop; wire branch;
maindec md (op, regwrite, regdst, alusrc, branch, memwrite, memtoreg, aluop, jump);
aludec ad (funct, aluop, alucontrol);
assign pcsrc = branch & zero;
endmodule
Verilog Model of Controller
Chapter 7 <49>
MIC
ROAR
CHIT
ECTU
REmodule maindec (input [5:0] op, output regwrite, regdst, alusrc, branch,
output memwrite, memtoreg, output [1:0] aluop, output jump); reg [8:0] controls;
assign {regwrite, regdst, alusrc, branch, memwrite, memtoreg, aluop, jump} = controls;
always @(*) case(op) 6'b000000: controls = 9'b110000100; //Rtype 6'b100011: controls = 9'b101001000; //LW 6'b101011: controls = 9'b001010000; //SW 6'b000100: controls = 9'b000100010; //BEQ 6'b001000: controls = 9'b101000000; //ADDI 6'b000010: controls = 9'b000000001; //J default: controls = 9'bxxxxxxxxx; //??? endcaseendmodule
Verilog Model of Main Decoder
Chapter 7 <50>
MIC
ROAR
CHIT
ECTU
REmodule aludec (input [5:0] funct, input [1:0] aluop, output reg [2:0] alucontrol); always @(*) case(aluop) 2'b00: alucontrol = 3'b010; // add 2'b01: alucontrol = 3'b110; // sub default: case(funct) // RTYPE 6'b100000: alucontrol = 3'b010; // ADD 6'b100010: alucontrol = 3'b110; // SUB 6'b100100: alucontrol = 3'b000; // AND 6'b100101: alucontrol = 3'b001; // OR 6'b101010: alucontrol = 3'b111; // SLT default: alucontrol = 3'bxxx; // ??? endcase endcaseendmodule
Verilog Model of ALU Decoder
Chapter 7 <51>
MIC
ROAR
CHIT
ECTU
REmodule datapath (input clk, reset, memtoreg, pcsrc, alusrc, regdst, input regwrite, jump, input [2:0] alucontrol, output zero, output [31:0] pc, input [31:0] instr, output [31:0] aluout, writedata, input [31:0] readdata);
wire [4:0] writereg; wire [31:0] pcnext, pcnextbr, pcplus4, pcbranch; wire [31:0] signimm, signimmsh, srca, srcb, result; // next PC logic flopr #(32) pcreg(clk, reset, pcnext, pc); adder pcadd1(pc, 32'b100, pcplus4); sl2 immsh(signimm, signimmsh); adder pcadd2(pcplus4, signimmsh, pcbranch); mux2 #(32) pcbrmux(pcplus4, pcbranch, pcsrc, pcnextbr); mux2 #(32) pcmux(pcnextbr, {pcplus4[31:28], instr[25:0], 2'b00}, jump, pcnext);
Verilog Model of Datapath
Chapter 7 <52>
MIC
ROAR
CHIT
ECTU
RE// register file logic regfile rf (clk, regwrite, instr[25:21], instr[20:16], writereg, result, srca, writedata);
mux2 #(5) wrmux (instr[20:16], instr[15:11], regdst, writereg); mux2 #(32) resmux (aluout, readdata, memtoreg, result); signext se (instr[15:0], signimm);
// ALU logic mux2 #(32) srcbmux (writedata, signimm, alusrc, srcb); alu alu (srca, srcb, alucontrol, aluout, zero);
endmodule
Verilog Model of Datapath (con’t)
Chapter 7 <53>
MIC
ROAR
CHIT
ECTU
REmodule regfile (input clk, we3, input [4:0] ra1, ra2, wa3, input [31:0] wd3, output [31:0] rd1, rd2);
reg [31:0] rf [31:0];
// three ported register file: read two ports combinationally // write third port on rising edge of clock. Register 0 hardwired to 0
always @(posedge clk) if (we3) rf [wa3] <= wd3;
assign rd1 = (ra1 != 0) ? rf [ra1] : 0; assign rd2 = (ra2 != 0) ? rf[ ra2] : 0;
endmodule
Verilog Model of Register File
Chapter 7 <54>
MIC
ROAR
CHIT
ECTU
REmodule adder (input [31:0] a, b, output [31:0] y); assign y = a + b;endmodule
module sl2 (input [31:0] a, output [31:0] y);// shift left by 2 assign y = {a[29:0], 2'b00}; endmodule
module signext (input [15:0] a, output [31:0] y); assign y = {{16{a[15]}}, a};endmodule
Verilog Models of Other Parts
Chapter 7 <55>
MIC
ROAR
CHIT
ECTU
REmodule flopr #(parameter WIDTH = 8) (input clk, reset, input [WIDTH-1:0] d, output reg [WIDTH-1:0] q); always @(posedge clk, posedge reset) if (reset) q <= 0; else q <= d;endmodule
module flopenr #(parameter WIDTH = 8) (input clk, reset, en, input [WIDTH-1:0] d, output reg [WIDTH-1:0] q); always @(posedge clk, posedge reset) if (reset) q <= 0; else if (en) q <= d;endmodule
module mux2 #(parameter WIDTH = 8) (input [WIDTH-1:0] d0, d1, input s, output [WIDTH-1:0] y); assign y = s ? d1 : d0; endmodule
Verilog for Parameterized Parts
Chapter 7 <56>
MIC
ROAR
CHIT
ECTU
RE• Unscheduled function call to exception handler• Caused by:
– Hardware, also called an interrupt, e.g. keyboard– Software, also called traps, e.g. undefined instruction
• When exception occurs, the processor:– Records cause of exception (Cause register)– Jumps to exception handler (0x80000180)– Returns to program (EPC register)
Review: Exceptions
Chapter 7 <57>
MIC
ROAR
CHIT
ECTU
RE Example Exception
Chapter 7 <58>
MIC
ROAR
CHIT
ECTU
RE• Not part of register file; in Coprocessor 0
– Cause• Records cause of exception• Coprocessor 0 register 13
– EPC (Exception PC)• Records PC where exception occurred• Coprocessor 0 register 14
• Move from Coprocessor 0– mfc0 $t0, Cause (=mfc0 $t0,$13)– Moves contents of Cause into $t0
00000 $t0 (8) Cause (13) 00000000000
mfc0
31:26 25:21 20:16 15:11 10:0
010000
Review: Exception Registers
Chapter 7 <59>
MIC
ROAR
CHIT
ECTU
REException Cause
Hardware Interrupt 0x00000000
System Call 0x00000020
Breakpoint / Divide by 0 0x00000024
Undefined Instruction 0x00000028
Arithmetic Overflow 0x00000030
Extend single-cycle MIPS processor to handle last two types of exceptions
Review: Exception Causes
Chapter 7 <60>
MIC
ROAR
CHIT
ECTU
RE Exception RTLs
Undefined InstructionIM[PC]. . . # problem in decoding (bad op or func) Cause 40 # = 0x28EPC PCPC 0x80000180 #Exception handler address
Arithmetic OverflowIM[PC]. . . # ALU operation overflowsCause 48 # = 0x30EPC PCPC 0x80000180 #Exception handler address
mfc0 instruction (e.g. mfc0 $t1, $13)IM[PC]RF[rt] RFc0[rd]PC PC + 4
Chapter 7 <61>
MIC
ROAR
CHIT
ECTU
RE
SignImm
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1PC0
1
PC' Instr25:21
20:16
15:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
RegDst BranchMemWrite MemtoReg ALUSrcARegWrite
Zero
PCSrc1:0
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
0
1Data
CLK
CLK
A
B00
01
10
11
4
CLK
ENEN
ALUSrcB1:0IRWriteIorD PCWrite
PCEn
<<2
25:0 (jump)
31:28
27:0
PCJump
00
01
10
11
0x8000 0180
Overflow
CLK
EN
EPCWrite
CLK
EN
CauseWrite
0
1
IntCause
0x30
0x28EPC
Cause
Exception Hardware: EPC & Cause
Never mind the multi-cycle datapath, focus on the exception hardware.
Chapter 7 <62>
MIC
ROAR
CHIT
ECTU
RE
SignImm
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1PC0
1
PC' Instr25:21
20:16
15:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
RegDst BranchMemWrite MemtoReg1:0 ALUSrcARegWrite
Zero
PCSrc1:0
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
0001
Data
CLK
CLK
A
B00
01
10
11
4
CLK
ENEN
ALUSrcB1:0IRWriteIorD PCWrite
PCEn
<<2
25:0 (jump)
31:28
27:0
PCJump
00
01
10
11
0x8000 0180
CLK
EN
EPCWrite
CLK
EN
CauseWrite
0
1
IntCause
0x30
0x28EPC
Cause
Overflow
...
01101
01110
...15:11
10
C0
Exception Hardware: mfc0
Never mind the multi-cycle datapath, focus on the exception hardware.
Chapter 7 <63>
MIC
ROAR
CHIT
ECTU
RE• Temporal parallelism• Divide single-cycle processor into 5 stages:
– Fetch– Decode– Execute– Memory– Writeback
• Add pipeline registers between stages
Pipelined MIPS Processor
Chapter 7 <64>
MIC
ROAR
CHIT
ECTU
RE
Time (ps)Instr
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead / Write
WriteReg
1
2
0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 1500 1600 1700 1800 19001000
Instr
1
2
3
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead / Write
WriteReg
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead/Write
WriteReg
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead/Write
WriteReg
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead/Write
WriteReg
Single-Cycle
Pipelined
Single-Cycle vs. Pipelined
Chapter 7 <65>
MIC
ROAR
CHIT
ECTU
RE
Time (cycles)
lw $s2, 40($0) RF 40
$0RF
$s2+ DM
RF $t2
$t1RF
$s3+ DM
RF $s5
$s1RF
$s4- DM
RF $t6
$t5RF
$s5& DM
RF 20
$s1RF
$s6+ DM
RF $t4
$t3RF
$s7| DM
add $s3, $t1, $t2
sub $s4, $s1, $s5
and $s5, $t5, $t6
sw $s6, 20($s1)
or $s7, $t3, $t4
1 2 3 4 5 6 7 8 9 10
add
IM
IM
IM
IM
IM
IMlw
sub
and
sw
or
Pipelined Processor Abstraction
Chapter 7 <66>
MIC
ROAR
CHIT
ECTU
RE
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PCF0
1PC' InstrD
25:21
20:16
15:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
ResultW
PCPlus4EPCPlus4F
ZeroM
CLK CLK
ALU
WriteRegE4:0
CLK
CLK
CLK
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PC0
1PC' Instr
25:21
20:16
15:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
Zero
CLK
ALU
Fetch Decode Execute Memory Writeback
Single-Cycle & Pipelined Datapath
Chapter 7 <67>
MIC
ROAR
CHIT
ECTU
RE
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PCF0
1PC' InstrD
25:21
20:16
15:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4EPCPlus4F
ZeroM
CLK CLK
WriteRegW4:0
ALU
WriteRegE4:0
CLK
CLK
CLK
Fetch Decode Execute Memory Writeback
WriteReg must arrive at same time as Result
Corrected Pipelined Datapath
Chapter 7 <68>
MIC
ROAR
CHIT
ECTU
RE
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4EPCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
ZeroM
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
BranchE BranchM
RegDstE
ALUSrcE
WriteRegE4:0
• Same control unit as single-cycle processor• Control delayed to proper pipeline stage
Pipelined Processor Control
Chapter 7 <69>
MIC
ROAR
CHIT
ECTU
RE• When an instruction depends on result from
instruction that hasn’t completed• Types:
– Data hazard: register value not yet written back to register file
– Control hazard: next instruction not decided yet (caused by branches) or target address not calculated yet (jumps and branches)
Pipeline Hazards
Chapter 7 <70>
MIC
ROAR
CHIT
ECTU
RE
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMadd
or
sub
Data Hazard
Chapter 7 <71>
MIC
ROAR
CHIT
ECTU
RE
2 SW fixes• Insert nops in code at compile time• Rearrange code at compile time 2 HW fixes• Stall the processor at run time• Forward data at run time
Handling Data Hazards
Chapter 7 <72>
MIC
ROAR
CHIT
ECTU
RE
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMadd
or
sub
nop
nop
RF RFDMnopIM
RF RFDMnopIM
9 10
• Insert enough nops for result to be ready• Or move independent useful instructions forward
Compile-Time Hazard Elimination
Chapter 7 <73>
MIC
ROAR
CHIT
ECTU
RE
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMadd
or
sub
Data Forwarding
Chapter 7 <74>
MIC
ROAR
CHIT
ECTU
RE
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE
1
0
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
AL
U
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
SignImmD
For
wa
rdA
E
For
wa
rdB
E
20:16RtE
RsD
RdD
RtD
Reg
Wri
teM
Reg
Wri
teW
Hazard Unit
PCPlus4E
BranchE BranchM
ZeroM
Data Forwarding
Chapter 7 <75>
MIC
ROAR
CHIT
ECTU
RE• Forward to Execute stage from either:
– Memory stage or– Writeback stage
• Forwarding logic for ForwardAE:
if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then ForwardAE = 10
else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW)
then ForwardAE = 01 else ForwardAE = 00
Forwarding logic for ForwardBE same, but replace rsE with rtE
Data Forwarding
Chapter 7 <76>
MIC
ROAR
CHIT
ECTU
RE
Time (cycles)
lw $s0, 40($0) RF 40
$0RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMlw
or
sub
Trouble!
StallingForwarding on a load-use hazard isn’t possible!
Chapter 7 <77>
MIC
ROAR
CHIT
ECTU
RE
Time (cycles)
lw $s0, 40($0) RF 40
$0RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMlw
or
sub
9
RF $s1
$s0
IMor
Stall
StallingThe HW solution is to stall the pipeline
Chapter 7 <78>
MIC
ROAR
CHIT
ECTU
RE
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE
1
0
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
SignImmD
Sta
llF
Sta
llD
For
war
dAE
For
war
dBE
20:16RtE
RsD
RdD
RtD
Reg
Writ
eM
Reg
Writ
eW
Mem
toR
egE
Hazard Unit
Flu
shE
PCPlus4E
BranchE BranchM
ZeroM
EN
EN
CLR
Stalling Hardware
Chapter 7 <79>
MIC
ROAR
CHIT
ECTU
RElwstall = ((rsD==rtE) OR (rtD==rtE)) AND MemtoRegE
StallF = StallD = FlushE = lwstall
• By flushing the Execute stage, and stalling Fetch and Decode stages, the instruction flushed will simply be repeated in then next clock cycle, but this time with correct (forwarded) data!
Stalling Logic
Chapter 7 <80>
MIC
ROAR
CHIT
ECTU
RE• beq:
– branch not determined until 4th stage of pipeline– Instructions after the branch are fetched before the
branch occurs– These instructions must be flushed if branch happens
• Branch misprediction penalty– the # of instruction flushed, when branch is taken– may be reduced by determining branch earlier
Control Hazards
Chapter 7 <81>
MIC
ROAR
CHIT
ECTU
RE
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE
1
0
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
AL
U
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
SignImmD
Sta
llF
Sta
llD
For
wa
rdA
E
For
wa
rdB
E
20:16RtE
RsD
RdD
RtD
Reg
Wri
teM
Reg
Wri
teW
Me
mto
Reg
E
Hazard Unit
Flu
shE
PCPlus4E
BranchE BranchM
ZeroM
EN
EN
CL
R
Control Hazards: Original Pipeline
Chapter 7 <82>
MIC
ROAR
CHIT
ECTU
RE
Time (cycles)
beq $t1, $t2, 40 RF $t2
$t1RF- DM
RF $s1
$s0RF& DM
RF $s0
$s4RF| DM
RF $s5
$s0RF- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMlw
or
sub
20
24
28
2C
30
...
...
9
Flushthese
instructions
64 slt $t3, $s2, $s3 RF $s3
$s2RF
$t3slt DMIM
slt
Control Hazards
Chapter 7 <83>
MIC
ROAR
CHIT
ECTU
RE
EqualD
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE
1
0
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchD
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcD
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
=
SignImmD
Sta
llF
Sta
llD
For
war
dAE
For
war
dBE
20:16RtE
RsD
RdE
RtD
Reg
Writ
eM
Reg
Writ
eW
Mem
toR
egE
Hazard Unit
Flu
shE
EN
EN
CLR
CLR
But: introduced another data hazard in Decode stage!
Early Branch Resolution
Chapter 7 <84>
MIC
ROAR
CHIT
ECTU
RE
Time (cycles)
beq $t1, $t2, 40 RF $t2
$t1RF- DM
RF $s1
$s0RF& DMand $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
andIM
IMlw20
24
28
2C
30
...
...
9
Flushthis
instruction
64 slt $t3, $s2, $s3 RF $s3
$s2RF
$t3slt DMIMslt
Early Branch Resolution
Chapter 7 <85>
MIC
ROAR
CHIT
ECTU
RE
EqualD
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE
1
0
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchD
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcD
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
0
1
0
1
=
SignImmD
Sta
llF
Sta
llD
For
war
dAE
For
war
dBE
For
war
dAD
For
war
dBD
20:16RtE
RsD
RdD
RtD
Reg
Writ
eE
Reg
Writ
eM
Reg
Writ
eW
Mem
toR
egE
Bra
nchD
Hazard Unit
Flu
shE
EN
EN
CLR
CLR
Forwarding to Early-branch HW
Chapter 7 <86>
MIC
ROAR
CHIT
ECTU
RE• Forwarding logic:
ForwardAD = (rsD !=0) AND (rsD == WriteRegM) AND RegWriteM
ForwardBD = (rtD !=0) AND (rtD == WriteRegM) AND RegWriteM
• Stalling logic:branchstall = BranchD AND RegWriteE AND
(WriteRegE == rsD OR WriteRegE == rtD) OR
BranchD AND MemtoRegM AND (WriteRegM == rsD OR WriteRegM == rtD)
StallF = StallD = FlushE = (lwstall OR branchstall)
Control Forwarding & Stalling Logic
Chapter 7 <87>
MIC
ROAR
CHIT
ECTU
RE• Guess whether branch will be taken
– Backward branches are usually taken (in bottom-tested loops)
– Consider history to improve guess• Good prediction significantly reduces fraction
of branches requiring a flush • Requires HW for history table, etc
Branch Prediction
Chapter 7 <88>
MIC
ROAR
CHIT
ECTU
RE• SPECINT2000 benchmark:
– 25% loads– 10% stores – 11% branches– 2% jumps– 52% R-type
• Suppose:– 40% of loads used by next instruction– 25% of branches mispredicted– All jumps flush next instruction (JTA not ready)
• What is the average CPI?
Pipelined Performance Example
Chapter 7 <89>
MIC
ROAR
CHIT
ECTU
RE• Average CPI is the weighted average of CPIlw , CPIsw ,
CPIbeq , CPIj and CPIR-type
• For pipeline processors, CPI = 1 + # of stall cycles
Load CPI = 1 when no stall, = 2 when load-use occurs (1 stall)– CPIlw = 1(0.6) + 2(0.4) = 1.4– CPIsw = 1Branch CPI = 1 when no stall, = 2 when it mispredicts and stalls– CPIbeq = 1(0.75) + 2(0.25) = 1.25Jump CPI = 2 since it always requires 1 stall– CPIj = 2– CPIR-type = 1
Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) + (0.52)(1) = 1.15
Calculation of Average CPI
Chapter 7 <90>
MIC
ROAR
CHIT
ECTU
RE• Pipelined processor critical path: Tc = max {
tpcq + tmem + tsetup
2(tRFread + tmux + teq + tAND + tmux + tsetup )
tpcq + tmux + tmux + tALU + tsetup
tpcq + tmemwrite + tsetup
2(tpcq + tmux + tRFwrite) }
Pipelined Performance
Chapter 7 <91>
MIC
ROAR
CHIT
ECTU
REElement Parameter Delay (ps)Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
Equality comparator teq 40
AND gate tAND 15
Memory write Tmemwrite 220
Register file write tRFwrite 100 ps
Tc = 2(tRFread + tmux + teq + tAND + tmux + tsetup )
= 2[150 + 25 + 40 + 15 + 25 + 20] ps = 550 ps
Pipelined Performance Example
Chapter 7 <92>
MIC
ROAR
CHIT
ECTU
REProgram with IC = 100 billion instructions
Execution Time = IC × CPI × Tc
= (100 × 109)(1.15)(550 × 10-
12) = 63 seconds
Pipelined Performance Example
Chapter 7 <93>
MIC
ROAR
CHIT
ECTU
RE
Processor
Execution Time(seconds)
Speedup(single-cycle as baseline)
Single-cycle 92.5 1
Multicycle 133 0.70
Pipelined 63 1.47
Processor Performance Comparison
Chapter 7 <94>
MIC
ROAR
CHIT
ECTU
RE• Deep Pipelining• Branch Prediction• Superscalar Processors• Out of Order Processors• Register Renaming• SIMD• Multithreading• Multiprocessors
Advanced Microarchitecture
Chapter 7 <95>
MIC
ROAR
CHIT
ECTU
RE• 10-20 stages typical• Number of stages limited by:
– Pipeline hazards– Sequencing overhead– Power– Cost
Deep Pipelining
Chapter 7 <96>
MIC
ROAR
CHIT
ECTU
RE• Ideal pipelined processor: CPI = 1• Branch misprediction increases CPI• Static branch prediction:
– Check direction of branch (forward or backward)– If backward, predict taken– Else, predict not taken
• Dynamic branch prediction:– Keep history of last (several hundred) branches in branch
target buffer, record:• Branch destination• Whether branch was taken
Branch Prediction
Chapter 7 <97>
MIC
ROAR
CHIT
ECTU
RE add $s1, $0, $0 # sum = 0 add $s0, $0, $0 # i = 0 addi $t0, $0, 10 # $t0 = 10for: beq $s0, $t0, done # if i == 10, branch add $s1, $s1, $s0 # sum = sum + i addi $s0, $s0, 1 # increment i j fordone:
Branch Prediction Example
Chapter 7 <98>
MIC
ROAR
CHIT
ECTU
RE• Remembers whether branch was taken the
last time and does the same thing• Mispredicts first and last branch of loop
1-Bit Branch Predictor
Chapter 7 <99>
MIC
ROAR
CHIT
ECTU
RE
Only mispredicts the last branch of the loop
stronglytaken
predicttaken
weaklytaken
predicttaken
weaklynot taken
predictnot taken
stronglynot taken
predictnot taken
taken taken taken
takentakentaken
taken
taken
2-Bit Branch Predictor
Chapter 7 <100>
MIC
ROAR
CHIT
ECTU
RE• Multiple copies of datapath execute multiple
instructions at once• Dependencies make it tricky to issue multiple
instructions at onceCLK CLK CLK CLK
ARD A1
A2RD1A3
WD3WD6
A4A5A6
RD4
RD2RD5
InstructionMemory
RegisterFile Data
Memory
ALU
s
PC
CLK
A1A2
WD1WD2
RD1RD2
Superscalar
Chapter 7 <101>
MIC
ROAR
CHIT
ECTU
RElw $t0, 40($s0)add $t1, $t0, $s1sub $t0, $s2, $s3 Ideal IPC: 2and $t2, $s4, $t0 Actual IPC: 2or $t3, $s5, $s6sw $s7, 80($t3)
Time (cycles)
1 2 3 4 5 6 7 8
RF40
$s0
RF
$t0+
DMIM
lw
add
lw $t0, 40($s0)
add $t1, $s1, $s2
sub $t2, $s1, $s3
and $t3, $s3, $s4
or $t4, $s1, $s5
sw $s5, 80($s0)
$t1$s2
$s1
+
RF$s3
$s1
RF
$t2-
DMIM
sub
and $t3$s4
$s3
&
RF$s5
$s1
RF
$t4|
DMIM
or
sw80
$s0
+ $s5
Superscalar Example
Chapter 7 <102>
MIC
ROAR
CHIT
ECTU
RElw $t0, 40($s0)add $t1, $t0, $s1sub $t0, $s2, $s3 Ideal IPC: 2and $t2, $s4, $t0 Actual IPC: 6/5 = 1.17or $t3, $s5, $s6sw $s7, 80($t3)
Stall
Time (cycles)
1 2 3 4 5 6 7 8
RF40
$s0
RF
$t0+
DMIM
lwlw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3
and $t2, $s4, $t0
sw $s7, 80($t3)
RF$s1
$t0add
RF$s1
$t0
RF
$t1+
DM
RF$t0
$s4
RF
$t2&
DMIM
and
IMor
and
sub
|$s6
$s5$t3
RF80
$t3
RF+
DM
sw
IM
$s7
9
$s3
$s2
$s3
$s2-
$t0
oror $t3, $s5, $s6
IM
Superscalar with Dependencies
Chapter 7 <103>
MIC
ROAR
CHIT
ECTU
RE• Looks ahead across multiple instructions• Issues as many instructions as possible at once• Issues instructions out of order (as long as no
dependencies)• Dependencies:
– RAW (read after write): one instruction writes, later instruction reads a register
– WAR (write after read): one instruction reads, later instruction writes a register
– WAW (write after write): one instruction writes, later instruction writes a register
Out of Order Processor
Chapter 7 <104>
MIC
ROAR
CHIT
ECTU
RE• Instruction level parallelism (ILP): number
of instruction that can be issued simultaneously (average < 3)
• Scoreboard: table that keeps track of:– Instructions waiting to issue– Available functional units– Dependencies
Out of Order Processor
Chapter 7 <105>
MIC
ROAR
CHIT
ECTU
RElw $t0, 40($s0)add $t1, $t0, $s1sub $t0, $s2, $s3 Ideal IPC: 2and $t2, $s4, $t0 Actual IPC: 6/4 =
1.5or $t3, $s5, $s6sw $s7, 80($t3) Time (cycles)
1 2 3 4 5 6 7 8
RF40
$s0
RF
$t0+
DMIM
lwlw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3
and $t2, $s4, $t0
sw $s7, 80($t3)
or|$s6
$s5$t3
RF80
$t3
RF+
DM
sw $s7
or $t3, $s5, $s6
IM
RF$s1
$t0
RF
$t1+
DMIM
add
sub-$s3
$s2$t0
two cycle latencybetween load anduse of $t0
RAW
WAR
RAW
RF$t0
$s4
RF&
DM
and
IM
$t2
RAW
Out of Order Processor Example
Chapter 7 <106>
MIC
ROAR
CHIT
ECTU
RE
Time (cycles)
1 2 3 4 5 6 7
RF40
$s0
RF
$t0+
DMIM
lwlw $t0, 40($s0)
add $t1, $t0, $s1
sub $r0, $s2, $s3
and $t2, $s4, $r0
sw $s7, 80($t3)
sub-$s3
$s2$r0
RF$r0
$s4
RF&
DM
and
$s7
or $t3, $s5, $s6IM
RF$s1
$t0
RF
$t1+
DMIM
add
sw+80
$t3
RAW
$s6
$s5|
or
2-cycle RAW
RAW
$t2
$t3
lw $t0, 40($s0)add $t1, $t0, $s1sub $t0, $s2, $s3 Ideal IPC: 2and $t2, $s4, $t0 Actual IPC: 6/3 =
2or $t3, $s5, $s6sw $s7, 80($t3)
Register Renaming
Chapter 7 <107>
MIC
ROAR
CHIT
ECTU
RE• Single Instruction Multiple Data (SIMD)
– Single instruction acts on multiple pieces of data at once– Common application: graphics– Perform short arithmetic operations (also called packed
arithmetic)
• For example, add four 8-bit elementspadd8 $s2, $s0, $s1
a0
0781516232432 Bit position
$s0a1a2a3
b0 $s1b1b2b3
a0 + b0 $s2a1 + b1a2 + b2a3 + b3
+
SIMD
Chapter 7 <108>
MIC
ROAR
CHIT
ECTU
RE• Multithreading
– Wordprocessor: thread for typing, spell checking, printing
• Multiprocessors– Multiple processors (cores) on a single chip
Advanced Architecture Techniques
Chapter 7 <109>
MIC
ROAR
CHIT
ECTU
RE• Process: program running on a computer
– Multiple processes can run at once: e.g., surfing Web, playing music, writing a paper
• Thread: part of a program– Each process has multiple threads: e.g., a word
processor may have threads for typing, spell checking, printing
Threading: Definitions
Chapter 7 <110>
MIC
ROAR
CHIT
ECTU
RE• One thread runs at once• When one thread stalls (for example, waiting
for memory):– Architectural state of that thread stored– Architectural state of waiting thread loaded into
processor and it runs– Called context switching
• Appears to user like all threads running simultaneously
Threads in Conventional Processor
Chapter 7 <111>
MIC
ROAR
CHIT
ECTU
RE• Multiple copies of architectural state• Multiple threads active at once:
– When one thread stalls, another runs immediately– If one thread can’t keep all execution units busy,
another thread can use them
• Does not increase instruction-level parallelism (ILP) of single thread, but increases throughput
Intel calls this “hyperthreading”
Multithreading
Chapter 7 <112>
MIC
ROAR
CHIT
ECTU
RE
• Multiple processors (cores) with a method of communication between them
• Types:– Homogeneous: multiple cores with shared memory– Heterogeneous: separate cores for different tasks (for
example, DSP and CPU in cell phone)– Clusters: each core has own memory system
Multiprocessors