ca-4 multi cycle processor

Upload: vu-doan

Post on 05-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 CA-4 Multi Cycle Processor

    1/50

    BKTP.HCM

    2010

    dce

    KIN TRC MY TNH

    CE2010Khoa Khoa hc v K thut My tnh

    BM K thut My tnh

    inh c Anh Vhttp://www.cse.hcmut.edu.vn/~anhvu

    http://www.cse.hcmut.edu.vn/~anhvuhttp://www.cse.hcmut.edu.vn/~anhvu
  • 8/2/2019 CA-4 Multi Cycle Processor

    2/50

    2010

    dce

    2Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Chapter 4

    The Processor

  • 8/2/2019 CA-4 Multi Cycle Processor

    3/50

    2010

    dce

    56Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    What is Pipelining?

    A way of speeding up execution of instructions

    Key idea: overlap execution of multipleinstructions

    Analogy: doing your laundry1. Run load through washer

    2. Run load through dryer

    3. Fold clothes

    4. Put away clothes

    5. Go to 1

    Observation: we can start another load as soonas we finish step 1!

  • 8/2/2019 CA-4 Multi Cycle Processor

    4/50

    2010

    dce

    57Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    The Laundry Analogy

    Ann, Brian, Cathy, Dave

    each have one load of clothesto wash, dry, and fold

    Washer takes 30 minutes

    Dryer takes 30 minutes

    Folder takes 30 minutes

    Stasher takes 30 minutesto put clothes into drawers

    A B C D

  • 8/2/2019 CA-4 Multi Cycle Processor

    5/50

    2010

    dce

    58Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    If we do laundry sequentially...

    Time Required: 8 hours for 4 loads

    30Tas

    k

    Orde

    r

    TimeA

    30 30 3030

    B

    30 3030

    C

    30 30 3030

    D

    30 30 3030

    6 PM 7 8 9 10 11 12 1 2 AM

  • 8/2/2019 CA-4 Multi Cycle Processor

    6/50

    2010

    dce

    59Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    To Pipeline, We Overlap Tasks

    Time Required: 3.5 Hours for 4 Loads Latency remains 2 hours

    Throughput improves by factor of 2.3 (decreases for more loads)

    12 2 AM6 PM 7 8 9 10 11 1Time30

    A

    C

    D

    B

    30 30 3030 30 30Tas

    k

    Order

  • 8/2/2019 CA-4 Multi Cycle Processor

    7/50

    2010

    dce

    60Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Pipelining a Digital System

    Key idea: break big computation up into pieces

    Separate each piece with a pipeline register

    1ns

    200ps 200ps 200ps 200ps 200ps

    Pipeline

    Register

  • 8/2/2019 CA-4 Multi Cycle Processor

    8/50

    2010

    dce

    61Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Pipelining a Digital System

    Why do this?

    Because it's faster for repeated computations

    1ns

    Non-pipelined:

    1 operation finishes

    every 1ns

    200ps 200ps 200ps 200ps 200ps

    Pipelined:

    1 operation finishes

    every 200ps

  • 8/2/2019 CA-4 Multi Cycle Processor

    9/50

    2010

    dce

    62Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Comments about pipelining

    Pipelining increases throughput, but not

    latency Answer available every 200ps, BUT

    A single computation still takes 1ns

    Limitations:

    Computations must be divisible into stage size

    Pipeline registers add overhead

  • 8/2/2019 CA-4 Multi Cycle Processor

    10/50

    2010

    dce

    63Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    IFetch: Instruction Fetch and Update PC Dec: Instruction Decode, Register Read, Sign

    Extend Offset Exec: Execute R-type; Calculate MemoryAddress; Branch Comparison; Branch and JumpCompletion

    Mem: Memory Read; Memory Write Completion;R-type Completion (RegFile write)

    WB: Memory Read Completion (RegFile write)

    Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

    IFetch Dec Exec Mem WBlw

    INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

    The Five Steps of the Load Instruction

  • 8/2/2019 CA-4 Multi Cycle Processor

    11/50

    2010

    dce

    64Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Review Single-Cycle Processor

    IFInstruction Fetch

    IDInstruction Decode

    EXExecute/ Address Calc.

    MEMMemory Access

    WBWrite Back

    Review: Single-Cycle Processor All 5 steps done in a single clock cycle

    Dedicated hardware required for each step

  • 8/2/2019 CA-4 Multi Cycle Processor

    12/50

    2010

    dce

    65Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Pipelining - Key Idea

    Question:

    What happens if we break execution into multiplecycles, but keep the extra hardware?

    Answer: In the best case, we can start executing a new

    instruction on each clock cycle

    this is pipelining Pipelining stages: IF - Instruction Fetch ID - Instruction Decode

    EX - Execute / Address Calculation MEM - Memory Access (read / write) WB - Write Back (results into register file)

  • 8/2/2019 CA-4 Multi Cycle Processor

    13/50

    2010

    dce

    66Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Basic Pipelined Processor

    IF/ID ID/EX EX/MEM MEM/WB

    Pipeline Registers

  • 8/2/2019 CA-4 Multi Cycle Processor

    14/50

    2010

    dce

    67Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Comments about Pipelining

    The good news

    Multiple instructions are being processed at same time

    This works because stages are isolated by registers

    Best case speedup of N

    The bad news Instructions interfere with each other hazards

    Example: different instructions may need the same piece ofhardware (e.g., memory) in same clock cycle

    Example: instruction may require a result produced by anearlier instruction that is not yet complete

    Worst case: must suspend execution stall

  • 8/2/2019 CA-4 Multi Cycle Processor

    15/50

    2010

    dce

    68Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Multicycle Datapath Approach

    Let an instruction take more than 1 clock cycle to complete

    Break up instructions into steps where each steptakes a cyclewhile trying to balance the amount of work to be done in each step

    restrict each cycle to use only one major functional unit

    Not every instruction takes thesamenumber of clock cycles

    In addition to faster clock rates, multicycle allows functionalunits that can be used more than once per instruction aslong as they are used on differentclock cycles, as a result only need one memory but only one memory access per cycle

    need only one ALU/adder but only one ALU operation percycle

  • 8/2/2019 CA-4 Multi Cycle Processor

    16/50

    2010

    dce

    69Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    At the end of a cycle Store values needed in a later cycle by the current instruction in an internal register (not visible

    to the programmer). All (except IR) hold data only between a pair of adjacent clock cycles (nowrite control signal needed)

    IR Instruction Register MDR Memory Data Register

    A, B regfile read data registers ALUout ALU output register

    Data used by subsequent instructions are stored in programmer visible registers(i.e., register file, PC, or memory)

    Our Multicycle Datapath Approach

    Address

    Read Data(Instr. or Data)

    MemoryPC

    Write Data

    Read Addr 1

    Read Addr 2Write Addr

    Register

    File

    Read

    Data 1

    Read

    Data 2

    ALU

    Write Data

    IR

    MDR

    A

    B ALUout

  • 8/2/2019 CA-4 Multi Cycle Processor

    17/50

    2010

    dce

    70Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Clocking the Multicycle DatapathSystem Clock

    MemWrite RegWrite

    clock cycle

    Address

    Read Data

    (Instr. or Data)

    Memory

    PC

    Write Data

    Read Addr 1

    Read Addr 2

    Write Addr

    Register

    File

    Read

    Data 1

    Read

    Data 2

    ALU

    Write Data

    IR

    MD

    R

    A

    BALUout

    The Multicycle Datapath with Control

  • 8/2/2019 CA-4 Multi Cycle Processor

    18/50

    2010

    dce

    71Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    The Multicycle Datapath with ControlSignals

    Address

    Read Data

    (Instr. or Data)

    MemoryPC

    Write Data

    Read Addr 1

    Read Addr 2

    Write Addr

    Register

    File

    Read

    Data 1

    Read

    Data 2

    ALU

    Write Data

    IR

    MDR

    A

    B

    ALUout

    SignExtend

    Shiftleft 2 ALU

    control

    Shiftleft 2

    ALUOp

    Control

    IRWriteMemtoReg

    MemWriteMemRead

    IorD

    PCWrite

    PCWriteCond

    RegDstRegWrite

    ALUSrcAALUSrcB

    zero

    PCSource

    1

    1

    1

    1

    1

    1

    0

    0

    0

    0

    0

    0

    2

    2

    3

    4

    Instr[5-0]

    Instr[25-0]

    PC[31-28]

    Instr[15-0]

    Instr[31-26]

    32

    28

    M l i l C l U i

  • 8/2/2019 CA-4 Multi Cycle Processor

    19/50

    2010

    dce

    72Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Multicycle Control Unit

    Multicycle datapath control signals are not determined solely by thebits in the instruction

    e.g., op code bits tell what operation the ALU should be doing, but notwhat instruction cycle is to be done next

    Must use a finite state machine (FSM) for control a set of states (current state stored in State Register)

    next state function (determined

    by current state and the input) output function (determined bycurrent state and the input) Combinational

    control logic

    State RegInst

    Opcode

    Datapath

    control

    points

    Next State. . .

    . . .

    .

    .

    .

    u t cyc e vantages

  • 8/2/2019 CA-4 Multi Cycle Processor

    20/50

    2010

    dce

    73Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Uses the clock cycle efficiently the clock cycle is

    timed to accommodate the slowest instruction step

    Multicycle implementations allow functional units to beused more than once per instruction as long as theyare used on different clock cycles

    but

    Requires additional internal state registers, moremuxes, and more complicated (FSM) control

    Clk

    Cycle 1

    IFetch Dec Exec Mem WB

    Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

    IFetch Dec Exec Mem

    lw sw

    IFetch

    R-type

    u t cyc e vantagesDisadvantages

    dceSi l C l M lti l C l Ti i

  • 8/2/2019 CA-4 Multi Cycle Processor

    21/50

    2010

    74Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Clk Cycle 1

    Multiple Cycle Implementation:

    IFetch Dec Exec Mem WB

    Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

    IFetch Dec Exec Memlw sw

    IFetchR-type

    Clk

    Single Cycle Implementation:

    lw sw Waste

    Cycle 1 Cycle 2

    multicycle clock slowerthan 1/5th of single cycle

    clock due to stateregister overhead

    Single Cycle vs. Multiple Cycle Timing

    dceH C W M k It F t ?

  • 8/2/2019 CA-4 Multi Cycle Processor

    22/50

    2010

    75Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    How Can We Make It Faster?

    Start fetching and executing the next instruction before thecurrent one has completed Pipelining (all?) modern processors are pipelined for

    performance Remember theperformance equation:

    CPU time = CPI * CC * IC

    Under idealconditions and with a large number ofinstructions, the speedup from pipelining is approximatelyequal to the number of pipe stages A five stage pipeline is nearly five times faster because the CC

    is nearly five times faster

    Fetch (and execute) more than one instruction at a time Superscalar processing stay tuned

    dceA Pi li d MIPS P

  • 8/2/2019 CA-4 Multi Cycle Processor

    23/50

    2010

    76Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    A Pipelined MIPS Processor

    Start the next instruction before the current one has completed improves throughput total amount of work done in a given time

    instruction latency (execution time, delay time, response time timefrom the start of an instruction to its completion) is notreduced

    clock cycle (pipeline stage time) is limited by the slowest stage for some stages dont need the whole clock cycle (e.g., WB) for some instructions, some stages are wasted cycles (i.e., nothing is

    done during that cycle for that instruction)

    Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

    IFetch Dec Exec Mem WBlw

    Cycle 7Cycle 6 Cycle 8

    sw IFetch Dec Exec Mem WB

    R-type IFetch Dec Exec Mem WB

    2010

    dceSingle C cle s Pi eline

  • 8/2/2019 CA-4 Multi Cycle Processor

    24/50

    2010

    77Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Single Cycle vs. Pipeline

    To complete an entire instruction in the pipelined casetakes 1000 ps (as compared to 800 ps for the single cyclecase). Why ?

    How long does each take to complete 1,000,000 adds ?

    lw IFetch Dec Exec Mem WB

    Pipeline Implementation (CC = 200 ps):

    IFetch Dec Exec Mem WBsw

    IFetch Dec Exec Mem WBR-type

    Clk

    Single Cycle Implementation (CC = 800 ps):

    lw sw Waste

    Cycle 1 Cycle 2

    400 ps

    2010

    dcePipeline Performance

  • 8/2/2019 CA-4 Multi Cycle Processor

    25/50

    2010

    78Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Pipeline Performance

    Assume time for stages is

    100ps for register read or write 200ps for other stages

    Compare pipelined datapath with single-cycledatapath

    Instr Instr fetch Registerread

    ALU op Memoryaccess

    Registerwrite

    Total time

    lw 200ps 100 ps 200ps 200ps 100 ps 800ps

    sw 200ps 100 ps 200ps 200ps 700ps

    R-format 200ps 100 ps 200ps 100 ps 600ps

    beq 200ps 100 ps 200ps 500ps

    2010

    dcePipeline Performance

  • 8/2/2019 CA-4 Multi Cycle Processor

    26/50

    2010

    79Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Pipeline Performance

    Single-cycle (Tc= 800ps)

    Pipelined (Tc= 200ps)

    2010

    dcePipeline Speedup

  • 8/2/2019 CA-4 Multi Cycle Processor

    27/50

    2010

    80Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Pipeline Speedup

    If all stages are balanced

    i.e., all take the same time

    If not balanced, speedup is less

    Speedup due to increased throughput

    Latency (time for each instruction) does notdecrease

    nonpipelined

    pipelined

    Time between instructionsTime between instructions

    Number of stages

    2010

    dcePipelining the MIPS ISA

  • 8/2/2019 CA-4 Multi Cycle Processor

    28/50

    2010

    81Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Pipelining the MIPS ISA

    What makes it easy

    all instructions are the same length (32 bits) can fetch in the 1st stage and decode in the 2nd stage

    few instruction formats (three) with symmetry acrossformats

    can begin reading register file in 2nd stage

    memory operations occur only in loads and stores can use the execute stage to calculate memory addresses

    each instruction writes at most one result (i.e.,changes the machine state) and does it in the last few

    pipeline stages (MEM or WB) operands must be aligned in memory so a single data

    transfer takes only one data memory access

    2010

    dce

    MIPS Pipeline Datapath Additions/Mods

  • 8/2/2019 CA-4 Multi Cycle Processor

    29/50

    82Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    State registers between each pipeline stage to isolate themIF: IFetch ID: Dec EX: Execute MEM:

    MemAccess

    WB:

    WriteBack

    Read

    Address

    InstructionMemory

    Add

    PC

    4

    Write Data

    Read Addr 1

    Read Addr 2

    Write Addr

    Register

    File

    Read

    Data 1

    Read

    Data 2

    16 32

    ALU

    Shift

    left 2

    Add

    DataMemory

    Address

    Write Data

    Read

    Data

    IF/ID

    SignExtend

    ID/EX EX/MEM

    MEM/WB

    System Clock

    MIPS Pipeline Datapath Additions/Mods

    2010

    dceMIPS Pipeline Control Path Modifications

  • 8/2/2019 CA-4 Multi Cycle Processor

    30/50

    83Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    MIPS Pipeline Control Path Modifications All control signals can be determined during Decode

    and held in thestate registers between pipeline stages

    Read

    Address

    InstructionMemory

    Add

    PC

    4

    Write Data

    Read Addr 1

    Read Addr 2

    Write Addr

    Register

    File

    Read

    Data 1

    Read

    Data 2

    16 32

    ALU

    Shift

    left 2

    Add

    DataMemory

    Address

    Write Data

    Read

    Data

    IF/ID

    SignExtend

    ID/EXEX/MEM

    MEM/WB

    Control

    ALU

    cntrl

    RegWrite

    MemRead

    MemtoReg

    RegDst

    ALUOp

    ALUSrc

    Branch

    PCSrc

    2010

    dcePipeline Control

  • 8/2/2019 CA-4 Multi Cycle Processor

    31/50

    84Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Pipeline Control

    IF Stage: read Instr Memory (always

    asserted) and write PC (on System Clock) ID Stage: no optional control signals to set

    EX Stage MEM Stage WB Stage

    RegDst

    ALUOp1

    ALUOp0

    ALUSrc

    Brch MemRead

    MemWrite

    RegWrite

    MemtoReg

    R 1 1 0 0 0 0 0 1 0

    lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 X

    beq X 0 1 0 1 0 0 0 X

    2010

    dceGraphically Representing MIPS Pipeline

  • 8/2/2019 CA-4 Multi Cycle Processor

    32/50

    85Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Can help with answering questions like: How many cycles does it take to execute this

    code?

    What is the ALU doing during cycle 4?

    Is there a hazard, why does it occur, and how canit be fixed?

    Graphically Representing MIPS Pipeline

    EX MEM WBIDIF

    2010dce

    Why Pipeline? For Performance!

  • 8/2/2019 CA-4 Multi Cycle Processor

    33/50

    86Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    I

    n

    s

    t

    r.

    O

    r

    d

    e

    r

    Time (clock cycles)

    Inst 0

    Inst 1

    Inst 2

    Inst 4

    Inst 3

    Once the pipelineis full, one

    instruction iscompleted everycycle, so CPI = 1

    Time to fill the pipeline

    Why Pipeline? For Performance!

    EX MEM WBIDIF

    EX MEM WBIDIF

    EX MEM WBIDIF

    EX MEM WBIDIF

    EX MEM WBIDIF

    2010dce

    Can Pipelining Get Us Into Trouble?

  • 8/2/2019 CA-4 Multi Cycle Processor

    34/50

    87Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Yes: Pipeline Hazards Situations that prevent starting the next instruction in the next cycle

    structural hazards: attempt to use the same resource by two differentinstructions at the same time

    data hazards: attempt to use data before it is ready An instructions source operand(s) are produced by a prior instruction still in

    the pipeline

    control hazards: attempt to make a decision about program control flow

    before the condition has been evaluated and the new PC targetaddress calculated

    branch and jump instructions, exceptions

    Can usually resolve hazards by waiting

    pipeline control must detect the hazard and take action to resolve hazards

    Can Pipelining Get Us Into Trouble?

    2010dce

    Structural hazard Conflict for use of a resource

  • 8/2/2019 CA-4 Multi Cycle Processor

    35/50

    88Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    I

    n

    s

    t

    r.

    O

    r

    d

    er

    Time (clock cycles)

    lw

    Inst 1

    Inst 2

    Inst 4

    Inst 3

    Structural hazard Conflict for use of a resource

    Fix with separate instr and data memories (I$ and D$)

    Reading data frommemory

    Reading instructionfrom memory

    EX MEM WBIDIF

    EX MEM WBIDIF

    EX MEM WBIDIF

    EX MEM WBIDIF

    EXMEM WBIDIF

    Instruction fetchwould have to

    stall for that cycle

    2010dce

    How About Register File Access?

  • 8/2/2019 CA-4 Multi Cycle Processor

    36/50

    89Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    How About Register File Access?

    I

    n

    s

    t

    r.

    O

    r

    d

    er

    Time (clock cycles)

    Inst 1

    Inst 2

    Fix register file

    access hazard bydoing reads in thesecond half of thecycle and writes in

    the first half

    add$1,

    add $2,$1,

    clock edge that controlsregister writing

    clock edge that controls loadingof pipeline state registers

    EX

    MEM WBIDIF

    EX MEM WBIDIF

    EX

    MEM WBIDIF

    EX MEM WBIDIF

    can mot thoi gian de doc vao thanhghi. Neu muon thuc hien nhu ben,ta phai co xung clock lon hon 2 lantime can thiet

    2010dce

    Register Usage Can Cause Data Hazards

  • 8/2/2019 CA-4 Multi Cycle Processor

    37/50

    91Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Read before write data hazard

    I

    n

    s

    t

    r.

    O

    r

    d

    er

    add$1,

    sub $4,$1,$5

    and $6,$1,$7

    xor $4,$1,$5

    or $8,$1,$9

    Register Usage Can Cause Data Hazards

    Dependencies backward in time cause hazards

    EX MEM WBIDIF

    EX MEM WBIDIF

    EX MEM WBIDIF

    EX MEM WBIDIF

    EXMEM WBIDIF

    2010dce

    One Way to Fix a Data Hazard

  • 8/2/2019 CA-4 Multi Cycle Processor

    38/50

    92Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    I

    n

    s

    t

    r.

    O

    r

    d

    er

    add$1,

    sub $4,$1,$5

    and $6,$1,$7

    EX MEM WBIDIF

    EX MEM WBIDIF

    EXMEM WBIDIF

    stall

    stall

    Can fix datahazard by

    waitingstallbut impacts CPI

    One Way to Fix a Data Hazard

    2010dce

    Forwarding (aka Bypassing)

  • 8/2/2019 CA-4 Multi Cycle Processor

    39/50

    93Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Forwarding (aka Bypassing)

    forwarding results as soon as they are

    available to where they are needed Use result when it is computed

    Dont wait for it to be stored in a register

    Requires extra connections in the datapath

    add$s0,$t0,$t1

    sub $t2,$s0,$t3

    EX MEM WBIDIF

    EX MEM WBIDIF

    de khoai phai cho lenh load word, ta dua thang ket qua tuex xuong lenh sau (ky thuat forwarding)

    - can them nhung duong ket noi dat biet- chi dung duoc mot so truong hop

    2010dce

    Loads Can Cause Data Hazards

  • 8/2/2019 CA-4 Multi Cycle Processor

    40/50

    94Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Dependencies backward in time cause hazards

    Load-use data hazard

    EX MEM WBIDIF

    Loads Can Cause Data Hazards

    I

    n

    s

    t

    r.

    O

    r

    d

    er

    lw $1,4($2)

    sub $4,$1,$5

    and $6,$1,$7

    xor $4,$1,$5

    or $8,$1,$9

    EX MEM WBIDIF

    EX MEM WBIDIF

    EX MEM WBIDIF

    EXMEM WBIDIF

    2010dce

    Load-Use Data Hazard

  • 8/2/2019 CA-4 Multi Cycle Processor

    41/50

    95Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Load Use Data Hazard

    Cant always avoid stalls by forwarding

    If value not computed when needed Cant forward backward in time!

    lw $s0, 20($t1)

    sub $t2,$s0,$t3

    EX MEM WBIDIF

    EXMEM WBIDIF

    su dung ky thuat: cho va forwardingket hop nhieu phuong phap khac nhau de toi uu chuong trinh

    hazard do truy suat du lieu

    2010dce

    Code Scheduling to Avoid Stalls

  • 8/2/2019 CA-4 Multi Cycle Processor

    42/50

    96Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    g

    Reorder code to avoid use of load result in the

    next instruction C code for A = B + E; C = B + F;

    lw $t1, 0($t0)

    lw $t2, 4($t0)

    add $t3, $t1, $t2

    sw $t3, 12($t0)

    lw $t4, 8($t0)

    add $t5, $t1, $t4

    sw $t5, 16($t0)

    stall

    lw $t1, 0($t0)

    lw $t2, 4($t0)

    lw $t4, 8($t0)

    add $t3, $t1, $t2

    sw $t3, 12($t0)

    add $t5, $t1, $t4

    sw $t5, 16($t0)

    11 cycles13 cycles

    stall

    de tranh cho, ta sap xep lai cac cau lenh cua chuong trinh

    stall: doi 1 chu ky[cach nhau 1 chu ky

    cach nhau 2 chu ky,khong doi

    2010dce

    Branch Instructions Cause Control Hazards

  • 8/2/2019 CA-4 Multi Cycle Processor

    43/50

    97Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Dependencies backward in time cause hazards

    I

    n

    s

    t

    r.

    O

    r

    d

    e

    r

    lw

    Inst 4

    Inst 3

    beqEX MEM WBIDIF

    EX MEM WBIDIF

    EX MEM WBIDIF

    EX MEM WBIDIF

    2010dce

    Stall on Branch

  • 8/2/2019 CA-4 Multi Cycle Processor

    44/50

    98Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Wait until branch outcome determined before

    fetching next instruction

    beq $s1, $s2, L

    sub $t2,$t1,$t3

    EX MEM WBIDIF

    EXMEM WBIDIF

    - cach don gian nhat la cho

    2010dce Branch Prediction

  • 8/2/2019 CA-4 Multi Cycle Processor

    45/50

    99Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Longer pipelines cant readily determine

    branch outcome early Stall penalty becomes unacceptable

    Predict outcome of branch

    Only stall if prediction is wrong In MIPS pipeline

    Can predict branches not taken

    Fetch instruction after branch, with no delay

    - ky thuat thuong dung la doan- cham nhat cung bang truong hop cho- neu doan dung thi khong can phai cho

    - lenh Mips doan voi dieu kien khong say ratuc la lenh ngay sau beq- Tien doan tinh (khong phai tien doan dong

    neu doan dung, thi coi nhu khong mat chu ky nao ca

    2010dce MIPS with Predict Not Taken

  • 8/2/2019 CA-4 Multi Cycle Processor

    46/50

    100Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    beq $s1, $s2, L

    sub $t2,$t1,$t3

    EX MEM WBIDIF

    EX MEM WBIDIF

    beq $s1, $s2, L

    add $t2,$t1,$t3

    EX MEM WBIDIF

    EX MEM WBIDIF

    Predictioncorrect

    Predictionincorrect

    2010dce More-Realistic Branch Prediction

  • 8/2/2019 CA-4 Multi Cycle Processor

    47/50

    101Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    Static branch prediction

    Based on typical branch behavior Example: loop and if-statement branches

    Predict backward branches taken

    Predict forward branches not taken

    Dynamic branch prediction Hardware measures actual branch behavior e.g., record recent history of each branch

    Assume future behavior will continue the trend When wrong, stall while re-fetching, and update history

    tien doan tinh

    tien doan dong

    khong dua vao nhung thong tin truoc do

    nghia la luc nao cung chon 1 trong 2:taken hoat not taken

    co chien thuat:- dua vao 2 bit lich su: neu 2 lan minh tien doan sai thi se thay doi quan diem

    dua vao nhung lan tien doan truoc do

    2010dce Other Pipeline Structures Are Possible

  • 8/2/2019 CA-4 Multi Cycle Processor

    48/50

    102Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    What about the (slow) multiply operation? Make the clock twice as slow or

    let it take two cycles (since it doesnt use the DM stage)

    What if the data memory access is twice as slow as the instructionmemory? make the clock twice as slow or let data memory access take two cycles (and keep the same clock

    rate)

    MUL

    EX MEM WBIDIF

    EX MEM1 WBIDIF MEM2

    lenh nhan thuc hien cham hon lenh congnen lenh nhan se thuc hien cham hon

    ta lua chon giua viec gian xung clock hayde lenh nhan nhieu chu ki

    cho lenh dung nhieu chu ky

    cho lenh mem thanh 2 chu ky lien tiep

    lenh load thuong rat cham so voi cac lenh khac

    2010dce Other Sample Pipeline Alternatives

  • 8/2/2019 CA-4 Multi Cycle Processor

    49/50

    103Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    ARM7

    XScaleALU

    IM1 IM2 DM1 RegDM2

    IM Reg EX

    PC updateIM access

    decodereg

    access

    ALU opDM access

    shift/rotate

    commit result

    (write back)

    Reg SHFT

    PC update

    BTB access

    start IM access

    IM access

    decode

    reg 1 access

    shift/rotate

    reg 2 access

    ALU op

    start DM access

    exception

    DM write

    reg write

    theo cau truc Reg: tap lenh don gian, cau truc (thiet ke) don gian

    2010dce Summary

  • 8/2/2019 CA-4 Multi Cycle Processor

    50/50

    104Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu

    All modern day processors use pipelining

    Pipelining doesnt help latency of single task, it helpsthroughput of entire workload

    Potential speedup: a CPI of 1 and fast a CC

    Pipeline rate limited by slowest pipeline stage

    Unbalanced pipe stages makes for inefficiencies The time to fill pipeline and time to drain it can impact

    speedup for deep pipelines and short code runs

    Must detect and resolve hazards

    Stalling negatively affects CPI (makes CPI less than theideal of 1)

    nap lenh goi dau

    can bang cac giai doan : rd, sd, mem ...

    lap day cac giai doan. Neu ong day lien tuc thi hieu suat toi da