cs 61c: great ideas in computer architecture control and...

Post on 25-May-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS61C:GreatIdeasinComputerArchitecture

ControlandPipelining

1

Instructors:VladimirStojanovicandNicholasWeaverhttp://inst.eecs.Berkeley.edu/~cs61c/sp16

Datapath ControlSignals• ExtOp: “zero”,“sign”• ALUsrc: 0⇒ regB;

1⇒ immed• ALUctr: “ADD”,“SUB”,“OR”

• MemWr: 1⇒writememory• MemtoReg:0⇒ ALU;1⇒Mem• RegDst: 0⇒ “rt”;1⇒ “rd”• RegWr: 1⇒writeregister

32

ALUctr

clk

busW

RegWr

32

32busA

32

busB

5 5

Rw Ra Rb

RegFile

Rs

Rt

Rt

RdRegDst

Extender 3216Imm16

ALUSrcExtOp

MemtoReg

clk

DataIn32

MemWr01

0

1

ALU 0

1WrEn Adr

DataMemory

5

Imm16

clk

PC

00

4nPC_sel &Equal

PCExt

AdderAdder

Mux

InstAddress

0

1

2

SummaryoftheControlSignals(1/2)inst Register Transfer

add R[rd] ← R[rs] + R[rt]; PC ← PC + 4

ALUsrc=RegB, ALUctr=“ADD”, RegDst=rd, RegWr, nPC_sel=“+4”

sub R[rd] ← R[rs] – R[rt]; PC ← PC + 4

ALUsrc=RegB, ALUctr=“SUB”, RegDst=rd, RegWr, nPC_sel=“+4”

ori R[rt] ← R[rs] + zero_ext(Imm16); PC ← PC + 4

ALUsrc=Im, Extop=“Z”, ALUctr=“OR”, RegDst=rt,RegWr, nPC_sel=“+4”

lw R[rt] ← MEM[ R[rs] + sign_ext(Imm16)]; PC ← PC + 4

ALUsrc=Im, Extop=“sn”, ALUctr=“ADD”, MemtoReg, RegDst=rt, RegWr, nPC_sel = “+4”

sw MEM[ R[rs] + sign_ext(Imm16)] ← R[rs]; PC ← PC + 4

ALUsrc=Im, Extop=“sn”, ALUctr = “ADD”, MemWr, nPC_sel = “+4”

beq if (R[rs] == R[rt]) then PC ← PC + sign_ext(Imm16)] || 00else PC ← PC + 4

nPC_sel = “br”, ALUctr = “SUB”

3

SummaryoftheControlSignals(2/2)

add sub ori lw sw beq jumpRegDstALUSrcMemtoRegRegWriteMemWritenPCselJumpExtOpALUctr<2:0>

1001000xAdd

1001000x

Subtract

01010000Or

01110001Add

x1x01001Add

x0x0010x

Subtract

xxx00?1xx

op targetaddress

op rs rt rd shamt funct061116212631

op rs rt immediate

R-type

I-type

J-type

add,sub

ori,lw,sw,beq

jump

funcop 000000 000000 001101 100011 101011 000100 000010AppendixA

100000See 100010 WeDon’tCare:-)

4

BooleanExpressionsforControllerRegDst = add + subALUSrc = ori + lw + swMemtoReg = lwRegWrite = add + sub + ori + lw MemWrite = swnPCsel = beqJump = jump ExtOp = lw + swALUctr[0] = sub + beq (assume ALUctr is 00 ADD, 01 SUB, 10 OR)ALUctr[1] = or

Where:

rtype = ~op5 • ~op4 • ~op3 • ~op2 • ~op1 • ~op0, ori = ~op5 • ~op4 • op3 • op2 • ~op1 • op0lw = op5 • ~op4 • ~op3 • ~op2 • op1 • op0sw = op5 • ~op4 • op3 • ~op2 • op1 • op0beq = ~op5 • ~op4 • ~op3 • op2 • ~op1 • ~op0jump = ~op5 • ~op4 • ~op3 • ~op2 • op1 • ~op0

add = rtype • func5 • ~func4 • ~func3 • ~func2 • ~func1 • ~func0sub = rtype • func5 • ~func4 • ~func3 • ~func2 • func1 • ~func0

How do we implement this in

gates?

5

ControllerImplementation

addsuborilwswbeqjump

RegDstALUSrcMemtoRegRegWriteMemWritenPCselJumpExtOpALUctr[0]ALUctr[1]

“AND” logic “OR” logic

opcode func

6

P&HFigure4.17

7

Summary:Single-cycleProcessor• Fivestepstodesignaprocessor:

1.Analyzeinstructionsetàdatapathrequirements

2.Selectsetofdatapathcomponents&establishclockmethodology

3.Assembledatapathmeetingtherequirements

4.Analyzeimplementationofeachinstructiontodeterminesettingofcontrolpointsthateffectstheregistertransfer.

5.Assemblethecontrollogic• FormulateLogicEquations• DesignCircuits

Control

Datapath

Memory

ProcessorInput

Output

8

SingleCyclePerformance• Assumetimefor actionsare

– 100psforregisterreadorwrite;200psforother events

• Clockperiodis?Instr Instr fetch Register

readALU op Memory

accessRegister write

Total time

lw 200ps 100 ps 200ps 200ps 100 ps 800ps

sw 200ps 100 ps 200ps 200ps 700ps

R-format 200ps 100 ps 200ps 100 ps 600ps

beq 200ps 100 ps 200ps 500ps

• Clock rate (cycles/second = Hz) = 1/Period (seconds/cycle)

9

SingleCyclePerformance• Assumetimefor actionsare

– 100psforregisterreadorwrite;200psforother events

• Clockperiodis?Instr Instr fetch Register

readALU op Memory

accessRegister write

Total time

lw 200ps 100 ps 200ps 200ps 100 ps 800ps

sw 200ps 100 ps 200ps 200ps 700ps

R-format 200ps 100 ps 200ps 100 ps 600ps

beq 200ps 100 ps 200ps 500ps

• What can we do to improve clock rate?• Will this improve performance as well?

Want increased clock rate to mean faster programs10

LevelsofRepresentation/Interpretation

lw $t0,0($2)lw $t1,4($2)sw $t1,0($2)sw $t0,4($2)

HighLevelLanguageProgram(e.g.,C)

AssemblyLanguageProgram(e.g.,MIPS)

MachineLanguageProgram(MIPS)

HardwareArchitectureDescription(e.g.,blockdiagrams)

Compiler

Assembler

MachineInterpretation

temp=v[k];v[k]=v[k+1];v[k+1]=temp;

0000 1001 1100 0110 1010 1111 0101 10001010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111

LogicCircuitDescription(CircuitSchematicDiagrams)

ArchitectureImplementation

Anythingcanberepresentedasanumber,

i.e.,dataorinstructions

11

NoMoreMagic!

12

CS61ACS61BCS61C ✔CS61C ✔CS61C ✔CS61C çCS61C ✔EE40Phys 7B

I/O systemProcessor

CompilerOperatingSystem(Mac OSX)

Application (ex: browser)

Digital DesignCircuit Design

Instruction Set Architecture

Datapath & Control

transistors

MemoryHardware

Software Assembler

Administrivia

• Project2-2due3/8@23:59:59(Tue)• GuerrillaSessions:MIPSCPU

– Wed3/093- 5PM@241Cory– Sat3/121- 3PM@651@611Soda

13

GottaDoLaundry• Ann,Brian,Cathy,Daveeachhaveoneloadofclothestowash,dry,fold,andputaway– Washertakes30minutes

– Dryertakes30minutes

– “Folder”takes30minutes

– “Stasher”takes30minutestoputclothesintodrawers

A B C D

14

SequentialLaundry

• Sequentiallaundrytakes8hoursfor4loads

Task

Order

B

CD

A30Time

30 30 3030 30 3030 30 30 3030 30 30 3030

6 PM 7 8 9 10 11 12 1 2 AM

15

PipelinedLaundry

• Pipelinedlaundrytakes3.5hoursfor4loads!

Task

Order

BC

D

A

12 2 AM6 PM 7 8 9 10 11 1

Time3030 30 3030 30 30

16

• Pipeliningdoesn’thelplatencyofsingletask,ithelpsthroughputofentireworkload

• Multiple tasksoperatingsimultaneouslyusingdifferentresources

• Potentialspeedup=Numberpipestages

• Timeto“fill”pipelineandtimeto“drain”itreducesspeedup:2.3x(8/3.5)v.4x(8/2)inthisexample

6 PM 7 8 9Time

BC

D

A3030 30 3030 30 30

Task

Order

PipeliningLessons(1/2)

17

• SupposenewWashertakes20minutes,newStashertakes20minutes.Howmuchfasterispipeline?

• Pipelineratelimitedbyslowest pipelinestage

• Unbalancedlengthsofpipestagesreducesspeedup

6 PM 7 8 9Time

BC

D

A3030 30 3030 30 30

Task

Order

PipeliningLessons(2/2)

18

1)IFtch:InstructionFetch,IncrementPC2)Dcd:InstructionDecode,ReadRegisters3)Exec:

Mem-ref: CalculateAddressArith-log:PerformOperation

4)Mem:Load:ReadDatafromMemoryStore:WriteDatatoMemory

5)WB:WriteDataBacktoRegister

ExecutionStepsinMIPSDatapath

19

PC

inst

ruct

ion

mem

ory

+4

rtrsrd

regi

ster

s

ALU

Dat

am

emor

y

imm

1. InstructionFetch

2. Decode/Register Read

3. Execute 4. Memory 5. WriteBack

SingleCycleDatapath

20

PC

inst

ruct

ion

mem

ory

+4

rtrsrd

regi

ster

s

ALU

Dat

am

emor

y

imm

1. InstructionFetch

2. Decode/Register Read

3. Execute 4. Memory 5. WriteBack

Pipelineregisters

• Needregistersbetweenstages– Toholdinformationproducedinpreviouscycle

21

MoreDetailedPipeline

22

IFforLoad,Store,…

23

IDforLoad,Store,…

24

EXforLoad

25

MEMforLoad

26

WBforLoad– Oops!

Wrongregisternumber!

27

CorrectedDatapathforLoad

28

PipelinedExecutionRepresentation

• Everyinstructionmusttakesamenumberofsteps,sosomestageswillidle– e.g.MEMstageforanyarithmeticinstruction

IF ID EX MEM WBIF ID EX MEM WB

IF ID EX MEM WBIF ID EX MEM WB

IF ID EX MEM WBIF ID EX MEM WB

Time

29

GraphicalPipelineDiagrams

• Usedatapath figurebelowtorepresentpipeline:IF ID EX Mem WB

ALUI$ Reg D$ Reg

1.InstructionFetch

2.Decode/RegisterRead

3.Execute 4.Memory 5.WriteBack

PC

inst

ruct

ion

mem

ory

+4

RegisterFilert

rsrd

ALU

Dat

am

emor

y

imm

MU

X

30

Instr

Order

Load

Add

Store

Sub

Or

I$

Time (clock cycles)

I$

ALU

Reg

Reg

I$

D$

ALU

ALU

Reg

D$

Reg

I$

D$

Reg

ALU

Reg Reg

Reg

D$

Reg

D$

ALU

• RegFile: left half is write, right half is read

Reg

I$

GraphicalPipelineRepresentation

31

PipeliningPerformance(1/3)

• UseTc (“timebetweencompletionofinstructions”)tomeasurespeedup

– Equalityonlyachievedifstagesarebalanced(i.e.takethesameamountoftime)

• Ifnotbalanced,speedupisreduced• Speedupduetoincreasedthroughput

– Latency foreachinstructiondoesnotdecrease

32

PipeliningPerformance(2/3)

• Assumetimeforstagesis– 100psforregisterreadorwrite– 200psforotherstages

• Whatispipelinedclockrate?– Comparepipelineddatapath withsingle-cycledatapath

Instr Instrfetch

Register read

ALU op Memory access

Register write

Total time

lw 200ps 100 ps 200ps 200ps 100 ps 800ps

sw 200ps 100 ps 200ps 200ps 700ps

R-format 200ps 100 ps 200ps 100 ps 600ps

beq 200ps 100 ps 200ps 500ps

33

PipeliningPerformance(3/3)

Single-cycleTc = 800 psf = 1.25GHz

PipelinedTc = 200 ps

f = 5GHz

34

Clicker/PeerInstructionLogicinsomestagestakes200psandinsome100ps.Clk-Qdelayis30psandsetup-timeis20ps.Whatisthemaximumclockfrequencyatwhichapipelineddesigncanoperate?• A:10GHz• B:5GHz• C:6.7GHz• D:4.35GHz• E:4GHz

35

top related